I’m still pretty new to Influx - only have had it running a few months. We’re using it to do remote writes from several Prometheus instances in our Kubernetes clusters for long term storage. I have RP’s and CQ’s set up to downsample the data and everything is working well for the most part.
What I’ve been struggling with is InfluxDB’s use of memory, cpu and disk I/O.
Influx is running in a VMWare VM, with 8vCPU’s and 64GB of RAM. The disk is backed by a NetApp all flash storage array which is pretty screaming fast for our modest infrastructure. I’ve switched all the DB’s over to using TSI1 to help with memory.
Influx usually runs fine for a day or two. Then I start seeing CPU load go from 1-2 up to 8+ and during this time, Swap usage goes up, taking up the full 4GB of swap space defined, and suddenly disk I/O for reads goes through the roof.
I also see tsm.tmp files being generated and not being consolidated at all. I’m currently sitting with 116 .tsm.tmp files, and they are almost all from the same database, and in fact, most from the same shard.
Do you have subscriptions setup ?
I have seen inadvertent loops set up that exhibit this behavior.
How about Kapacitor ?
Sometimes when Influx gets restarted it can create extra subscription jobs.
It doesn’t hurt to go in and delete all the Kapacitor made subscriptions as valid ones will be recreated.