InfluxDB - Kubernetes - Helm - CrashLoopBackOff

foggy-glasses · October 2, 2020, 7:38am

I deployed an influxdb 1.8 statefulset in our k8s cluster (EKS) using the helm tool. The influxdb is backed by an datastore which is based on the Elastic File System provided by AWS.

The entire setup was working fine until a team member decided to stress test it. Now the influxdb pod refuses to come alive with the Liveness and Readiness probe reporting failures. I increased the timeout and delay in both the probes but the pod still refuses to come up.

Redeploying results in the same with the pod trying to come up and failing - causing an endless CrashLoopBackOff restart of the pod. My last resort is to delete the data and wal directory on the EFS before redeploying it. But I’d like to know if I can bring it up with the data intact.

The pod memory request is set to 8 GB and and the limit is set to 16 GB.

tim.hall · October 5, 2020, 6:02pm

I would start by upping the memory. The stress test has likely added a significant amount of data – which may require more memory. Particularly if you are using the inmem index configuration. Also, there may be additional memory require to compact the data that has landed. Typical first step…increase memory availability and see if you can get to stability.

foggy-glasses · October 6, 2020, 2:01am

Thanks @tim.hall! Is there a reference on setting up the performance parameters for influxdb on kubernetes?

Could you please tell me more about the inmem index configuration or point me to where I can find more info about it?

As for our case, I set the client-body-size in our ingress to 0 and redid the stress test again and it at least kicked off well and is still on going - not sure if it ended well though.

tim.hall · October 15, 2020, 6:10pm

We haven’t created any documentation which is specific to performance and Kubernetes. I would generalize this to say, this just looks like typical behavior. Meaning, when you run InfluxDB there are a set of resources (CPU, memory, disk, etc.) that are allocated to it and if you exceed those resources, you can find yourself with InfluxDB crashing due to being out of resources. If you happen to be using virtualization tools and other daemons – it typically will be a crash loop as those tools/technologies attempt to restart the process.

In terms of docs here are some to look at:

Config:

inmem was the default for much of the 1.x line. But, in practice, we’ve switched all of the deployments we run now to use the TSI setting.

Details here:

and

In terms of monitoring InfluxDB, there are guides here with associated dashboards to assist:

One of the things we typically see is that folks will get started with InfluxDB and run along quite happy for a long while continuing to throw more and more workload at it. At some point, the resources available are insufficient for the workload and a crash occurs. Without establishing visibility into the various metrics and resource usage, people are in the dark as to why. So, best practice is to set this up early, understand and visualize what is happening with the various metrics and continue to observe what happens when you change/add workload. You will see things like the increase in CPU utilization or memory usage…and that gives you some clue about how much head room is left.

foggy-glasses · October 18, 2020, 2:29am

@tim.hall Thanks for responding!

We switched over to tsi and I’ll deploy dashboards to find out more on what’s going on behind the scene. You’re right about the fact that my team will need more visibility into what happens during the crash to investigate better.

Topic		Replies	Views
InfluxDB backed by NFS PV is eating up memory influxdb , prometheus , performance	3	2370	April 29, 2019
Influxbdb2.0 Kubernetes Memory Consumption	10	2042	February 2, 2021
Inflxudb _internal overflows with Kubernete's Lifeprobe influxdb	1	1345	March 14, 2018
/ping endpoint timing out quite often influxdb	2	504	May 15, 2020
Influxdb crashes due to running out of memory	2	3256	September 4, 2020

InfluxDB - Kubernetes - Helm - CrashLoopBackOff

Related topics