I’m debugging a high CPU Usage (50%) of InfluxDB 1.6.4 OSS when idle, and I would love a few pointers!
Our instance has been gradually increasing CPU usage from 15% about a year ago to 50% now. There is very little traffic on the instance (less than 5 req/s on average).
The instance is an EC2 r4.large, 2 vcpus, 15G ram. CPUs are “Up to 2.3 GHz Intel Xeon Scalable Processor”.
Watching the logs I can see it’s mostly “Cache snapshot” and “Compacting file” messages. Here’s an excerpt: influxdb.log · GitHub
From reading about TSM, cache, and WAL in the influxdb docs, I am guessing that it is spending a lot of time flushing cache to disk. I’m not sure why that is when there are so little read/writes, I would assume the cache would not grow.
I was just taking a look at TSI indices, it seems that it allows to reduce the dependency on RAM by using the disk. In our case we have quite low RAM utilization at about 15% on average. So unless I’m missing something, this is probably not the right approach.
Thank you for submitting the GitHub issue. It looks like that problem was fixed in 2017, a year before the release of our version (1.6.4) so I’m assuming that we are running the fix already.
I just tried changing the “cache-snapshot-memory-size” from the default of 25M to 100M, and this does seem to help with CPU load. The “cache snapshot written” messages in the logs went from every 2-3s to every 30s, and our CPU load is down from 50% to 10%.
I would love to understand the behavior of the cache system a bit more, and specifically why there is activity writing the cache to disk where there are no read or writes on the DB. Currently we have absolutely no traffic and I can see this happening.
Is there someone I can reach out to regarding this or is there a resource somewhere describing this ?