InfluxDB - Kubernetes - Helm - CrashLoopBackOff

I deployed an influxdb 1.8 statefulset in our k8s cluster (EKS) using the helm tool. The influxdb is backed by an datastore which is based on the Elastic File System provided by AWS.

The entire setup was working fine until a team member decided to stress test it. Now the influxdb pod refuses to come alive with the Liveness and Readiness probe reporting failures. I increased the timeout and delay in both the probes but the pod still refuses to come up.

Redeploying results in the same with the pod trying to come up and failing - causing an endless CrashLoopBackOff restart of the pod. My last resort is to delete the data and wal directory on the EFS before redeploying it. But I’d like to know if I can bring it up with the data intact.

The pod memory request is set to 8 GB and and the limit is set to 16 GB.

I would start by upping the memory. The stress test has likely added a significant amount of data – which may require more memory. Particularly if you are using the inmem index configuration. Also, there may be additional memory require to compact the data that has landed. Typical first step…increase memory availability and see if you can get to stability.

1 Like

Thanks @tim.hall! Is there a reference on setting up the performance parameters for influxdb on kubernetes?

Could you please tell me more about the inmem index configuration or point me to where I can find more info about it?

As for our case, I set the client-body-size in our ingress to 0 and redid the stress test again and it at least kicked off well and is still on going - not sure if it ended well though.

We haven’t created any documentation which is specific to performance and Kubernetes. I would generalize this to say, this just looks like typical behavior. Meaning, when you run InfluxDB there are a set of resources (CPU, memory, disk, etc.) that are allocated to it and if you exceed those resources, you can find yourself with InfluxDB crashing due to being out of resources. If you happen to be using virtualization tools and other daemons – it typically will be a crash loop as those tools/technologies attempt to restart the process.

In terms of docs here are some to look at:

Config:

inmem was the default for much of the 1.x line. But, in practice, we’ve switched all of the deployments we run now to use the TSI setting.

Details here:

and

In terms of monitoring InfluxDB, there are guides here with associated dashboards to assist:

One of the things we typically see is that folks will get started with InfluxDB and run along quite happy for a long while continuing to throw more and more workload at it. At some point, the resources available are insufficient for the workload and a crash occurs. Without establishing visibility into the various metrics and resource usage, people are in the dark as to why. So, best practice is to set this up early, understand and visualize what is happening with the various metrics and continue to observe what happens when you change/add workload. You will see things like the increase in CPU utilization or memory usage…and that gives you some clue about how much head room is left.

1 Like

@tim.hall Thanks for responding!

We switched over to tsi and I’ll deploy dashboards to find out more on what’s going on behind the scene. You’re right about the fact that my team will need more visibility into what happens during the crash to investigate better.