We are running a POC running influxdb2.0 on kubernetes with the aim to migrate from on-prem influxdb1 to new and shinny influxdb2.0
We are pushing 6000 messages per sec via kafka. The consumption is pretty much excellent but the as soon as we do a read, container just dies – consumes all of the allocated memory and then killed by kubernetes due liveness probe.
below is the setup we are running
Memory: 8G
CPU: 2CPU
liveness probe: 30s
Readiness probe: 30s
Does anyone else faced a similar issue? Any thing tha twe are missing completely.
Hi! Could you provide some detail on the liveness probe configuration? What endpoint did you configure and frequency of the checks. Also, does Kubernetes kill your container because of memory usage? Does the state of the container change to OOM Killed? What is the range of your query and number of expected results?
1 Like
Hi, i am using helm chart to deploy influx and using default values for readiness and liveness probe.
Liveness: http-get http://:15020/app-health/influxdb2/livez delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:15020/app-health/influxdb2/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
Kubernetes doesn’t kill the pod due to OOM
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Wed, 23 Dec 2020 17:08:33 +0000
Finished: Tue, 29 Dec 2020 11:32:58 +0000
and the events logs indicate liveness and readiness probes failing
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 79s (x7 over 5d18h) kubelet, hostname.mycompany.local Readiness probe failed: Get http://192.168.220.161:15021/healthz/ready: net/http: req
uest canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 78s (x4 over 5d18h) kubelet, hostname.mycompany.local Liveness probe failed: Get http://192.168.220.161:15020/app-health/influxdb2/livez: n
et/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 76s (x2 over 5d18h) kubelet, hostname.mycompany.local Readiness probe failed: Get http://192.168.220.161:15020/app-health/influxdb2/readyz:
net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 50s (x4 over 5d18h) kubelet, hostname.mycompany.local Readiness probe failed: Get http://192.168.220.161:15021/healthz/ready: net/http: req
uest canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 49s (x6 over 5d18h) kubelet, hostname.mycompany.local Readiness probe failed: Get http://192.168.220.161:15020/app-health/influxdb2/readyz:
net/http: request canceled (Client.Timeout exceeded while awaiting headers)
We are expecting around 21mil data points to be retrieved.
Please note that we ran same influx version on a similar spec VM and it showed results within seconds.
Hi @jmukhtar sorry for the delay. When applying the helm chart are you using this configuration or a different one? helm-charts/charts/influxdb2 at master · influxdata/helm-charts · GitHub
Hi @bondanthony, no worries at all. Yes we are applying the default config for the influx.
the only custom values that we added are as below
image:
repository: {{ proxy_srever }}l/influxdb/influxdb
resources:
limits:
cpu: 4000m
memory: 4Gi
requests:
cpu: 600m
memory: 4Gi
We are running istio on the kubernetes cluster; which was showing this behavior.
Another thing we noticed is the the Grafana is able to show the query results absolutely fine but when we are running it via influx gui same exact query gets stuck.
It appears that the issue comes with rendering of graphs on influx as compared to influx not able to show data
@bondanthony Any suggestion on how to fix this?
Hi @jmukhtar, I find this a little odd. Do the logs of the previous container have anything useful? Would it be possible to test without having the istio-proxy sidecar added? I have a test deployment running that matches your configuration with the istio-proxy deployed. I’ll run through a few tests to see if I can get it to crash on my side.
I have tried it on the deployment without istio and the behavior is still strange. The chronograph tab gets frozen and then the browser asks that the tab is not responding should I kill it.
At the same time, Grafana is able to show the data absolutely fine without any issues.
I have opened up a ticket as well and awaiting response on that. Chronograf Not Rendering the Query Result · Issue #20436 · influxdata/influxdb · GitHub
How are you exposing the UI? Ingress gateway or port-forward? We run an InfluxDB OSS instance internally and use the Chronograf UI. The only time I’ve seen a problem similar to this was related to pod memory limits.
These errors are interesting though.
ts=2021-01-04T11:48:24.428184Z lvl=info msg="Error writing response to client" log_id=0RVB0BSl000 handler=flux error="csv encoder error: write tcp 192.168.212.78:808
6->192.168.159.198:49720: write: broken pipe"
Do you by chance have an idle timeout on the ingress object?
Hi @bondanthony; I have deployed influx on a kubernetes cluster without ISTIO and the behavior is also not correct. I am getting the tab killed due to unresponsiveness.
Hi @jmukhtar, sorry for the long delay again. I was able to reproduce the issue and it relates to the way influx-stress creates mock data. The large number of unique tags is causing a problem with cardinality.
Could you confirm this issue relates to the specific data influx-stress created? Would is be possible to change the format of the data being ingested to use fields for the unique data over tag?
I was able to query 1,000 tables without too much latency. influx-stress insert -s 1000
Environment:
Influxdb2 helm chart with Istio in front of the platform.
Based on my test Istio doesn’t change the results of this test.