I am experiencing some problems with some servers we have running influx on. I have written a detailed issue on github:
Another observation we did today is that Influx is able to write around 40% of the data to disk, while the rest seems to be stuck in memory and is lost when the container is restarted. The issue has happened only on Sunday evening at 00:00 when the shard-group-duration was 7 days. Recently we changed it to 1 day, and now it suddenly happened on a Thursday at the same time, meaning it must be related to opening and closing of a shard.
We have also experienced the issue on a single host which is not running docker swarm. Before we restarted the container I took a copy of the /metrics data, so if there is something there I will gladly share.
I am lost on how to troubleshoot this further and need some help on this.
Hello @cripyy,
Welcome! Thanks for your question and for creating an issue. I’m afraid 2.x is known for having unexpected memory spikes like this (part of the reason why the eng team invested in a rewrite for 3.x). The best place for that type of help is gh.
The problem is that it is not a general memory spike. The problem is that the memory will continue growing until it eventually becomes OOM. While this is happening the qc_queing_active goes straight up to 1024 which is the limit, and it refuses to reply to any queries from Grafana. It is only solved by doing a restart on the container. As a temporary fix, I have had to create a bash script that checks if the qc_queing_active is at the limit, if that is the case it restarts the container, but that is not an ideal “fix” for the issue.
I have posted the issue on Github, but have yet to receive any response.