Influxbdb2.0 Kubernetes Memory Consumption

We are running a POC running influxdb2.0 on kubernetes with the aim to migrate from on-prem influxdb1 to new and shinny influxdb2.0

We are pushing 6000 messages per sec via kafka. The consumption is pretty much excellent but the as soon as we do a read, container just dies – consumes all of the allocated memory and then killed by kubernetes due liveness probe.

below is the setup we are running

Memory: 8G
CPU: 2CPU
liveness probe: 30s
Readiness probe: 30s

Does anyone else faced a similar issue? Any thing tha twe are missing completely.

Hi! Could you provide some detail on the liveness probe configuration? What endpoint did you configure and frequency of the checks. Also, does Kubernetes kill your container because of memory usage? Does the state of the container change to OOM Killed? What is the range of your query and number of expected results?

1 Like

Hi, i am using helm chart to deploy influx and using default values for readiness and liveness probe.

    Liveness:     http-get http://:15020/app-health/influxdb2/livez delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:    http-get http://:15020/app-health/influxdb2/readyz delay=0s timeout=1s period=10s #success=1 #failure=3

Kubernetes doesn’t kill the pod due to OOM

Last State: Terminated
Reason: Error
Exit Code: 137
Started: Wed, 23 Dec 2020 17:08:33 +0000
Finished: Tue, 29 Dec 2020 11:32:58 +0000

and the events logs indicate liveness and readiness probes failing

Events:
  Type     Reason     Age                  From                                 Message
  ----     ------     ----                 ----                                 -------
  Warning  Unhealthy  79s (x7 over 5d18h)  kubelet, hostname.mycompany.local  Readiness probe failed: Get http://192.168.220.161:15021/healthz/ready: net/http: req
uest canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  78s (x4 over 5d18h)  kubelet, hostname.mycompany.local  Liveness probe failed: Get http://192.168.220.161:15020/app-health/influxdb2/livez: n
et/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  76s (x2 over 5d18h)  kubelet, hostname.mycompany.local  Readiness probe failed: Get http://192.168.220.161:15020/app-health/influxdb2/readyz:
 net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  50s (x4 over 5d18h)  kubelet, hostname.mycompany.local  Readiness probe failed: Get http://192.168.220.161:15021/healthz/ready: net/http: req
uest canceled (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  49s (x6 over 5d18h)  kubelet, hostname.mycompany.local  Readiness probe failed: Get http://192.168.220.161:15020/app-health/influxdb2/readyz:
 net/http: request canceled (Client.Timeout exceeded while awaiting headers)

We are expecting around 21mil data points to be retrieved.

Please note that we ran same influx version on a similar spec VM and it showed results within seconds.

Hi @jmukhtar sorry for the delay. When applying the helm chart are you using this configuration or a different one? helm-charts/charts/influxdb2 at master · influxdata/helm-charts · GitHub

Hi @bondanthony, no worries at all. Yes we are applying the default config for the influx.

the only custom values that we added are as below

image:
  repository: {{ proxy_srever }}l/influxdb/influxdb

resources:
  limits:
    cpu: 4000m
    memory: 4Gi
  requests:
    cpu: 600m
    memory: 4Gi

We are running istio on the kubernetes cluster; which was showing this behavior.

Another thing we noticed is the the Grafana is able to show the query results absolutely fine but when we are running it via influx gui same exact query gets stuck.

It appears that the issue comes with rendering of graphs on influx as compared to influx not able to show data

@bondanthony Any suggestion on how to fix this?

Hi @jmukhtar, I find this a little odd. Do the logs of the previous container have anything useful? Would it be possible to test without having the istio-proxy sidecar added? I have a test deployment running that matches your configuration with the istio-proxy deployed. I’ll run through a few tests to see if I can get it to crash on my side.

I have tried it on the deployment without istio and the behavior is still strange. The chronograph tab gets frozen and then the browser asks that the tab is not responding should I kill it.

At the same time, Grafana is able to show the data absolutely fine without any issues.

I have opened up a ticket as well and awaiting response on that. Chronograf Not Rendering the Query Result · Issue #20436 · influxdata/influxdb · GitHub

How are you exposing the UI? Ingress gateway or port-forward? We run an InfluxDB OSS instance internally and use the Chronograf UI. The only time I’ve seen a problem similar to this was related to pod memory limits.

These errors are interesting though.

ts=2021-01-04T11:48:24.428184Z lvl=info msg="Error writing response to client" log_id=0RVB0BSl000 handler=flux error="csv encoder error: write tcp 192.168.212.78:808
6->192.168.159.198:49720: write: broken pipe"

Do you by chance have an idle timeout on the ingress object?

Hi @bondanthony; I have deployed influx on a kubernetes cluster without ISTIO and the behavior is also not correct. I am getting the tab killed due to unresponsiveness.

Hi @jmukhtar, sorry for the long delay again. I was able to reproduce the issue and it relates to the way influx-stress creates mock data. The large number of unique tags is causing a problem with cardinality.

Could you confirm this issue relates to the specific data influx-stress created? Would is be possible to change the format of the data being ingested to use fields for the unique data over tag?

I was able to query 1,000 tables without too much latency. influx-stress insert -s 1000

Environment:
Influxdb2 helm chart with Istio in front of the platform.
Based on my test Istio doesn’t change the results of this test.