Influx unable to write to disk

cripyy · October 3, 2024, 7:20am

Hey.

I am experiencing some problems with some servers we have running influx on. I have written a detailed issue on github:

Influx not writing to disk

opened 08:40AM - 09 Sep 24 UTC

We have a Docker Swarm cluster running InfluxDB v2.7.8 on a single node, where w…e experience issues every Sunday evening at 00 AM, with a recurrence interval of 2 to 3 weeks between occurrences. We’ve noticed that memory usage suddenly begins to increase steadily, and InfluxDB struggles to write data to disk. While some data is still being written, most of it is not. I’ve reviewed the logs, including Docker logs, syslog, and docker.service logs, but haven’t found any relevant information. There are no error entries suggesting that Influx is having some sort of problem. It is compacting and doing its normal tasks. Our first attempt to resolve this issue was upgrading InfluxDB from version 2.7.5 to 2.7.8, as there was a changelog entry addressing an infinite write loop bug. Unfortunately, this didn’t resolve the issue, as we experienced the same problem again today. We have Telegraf running on the server which gathers metrics from the Influx, where we've seen a spike in the queue active: ![Skjermbilde fra 2024-09-09 10-04-34](https://github.com/user-attachments/assets/02acd069-b114-4724-8bdc-6d0f8bc3ff2b) Below is also the graph from the memory usage of the server: ![image](https://github.com/user-attachments/assets/50503c70-d179-4d78-a97c-372b17263fa9) The problem is solved by restarting the docker container, but we have lost all data that Influx had in memory. __Environment info:__ uname -srm: Linux 5.15.0-113-generic x86_64 Docker version: Docker version 24.0.7, build afdd53b Influxdb-docker image version 2.7.8 The server is a VM running in VMware. __Logs:__ The only log error I can find which might be relevant is that Telegraf failed to send metrics to our off-site influx at 00:00 AM and 00:15 AM, even though our off-site influx still received some data from the server.

Another observation we did today is that Influx is able to write around 40% of the data to disk, while the rest seems to be stuck in memory and is lost when the container is restarted. The issue has happened only on Sunday evening at 00:00 when the shard-group-duration was 7 days. Recently we changed it to 1 day, and now it suddenly happened on a Thursday at the same time, meaning it must be related to opening and closing of a shard.

We have also experienced the issue on a single host which is not running docker swarm. Before we restarted the container I took a copy of the /metrics data, so if there is something there I will gladly share.

I am lost on how to troubleshoot this further and need some help on this.

Anaisdg · October 14, 2024, 7:50pm

Hello @cripyy,
Welcome! Thanks for your question and for creating an issue. I’m afraid 2.x is known for having unexpected memory spikes like this (part of the reason why the eng team invested in a rewrite for 3.x). The best place for that type of help is gh.

cripyy · October 15, 2024, 6:27am

The problem is that it is not a general memory spike. The problem is that the memory will continue growing until it eventually becomes OOM. While this is happening the qc_queing_active goes straight up to 1024 which is the limit, and it refuses to reply to any queries from Grafana. It is only solved by doing a restart on the container. As a temporary fix, I have had to create a bash script that checks if the qc_queing_active is at the limit, if that is the case it restarts the container, but that is not an ideal “fix” for the issue.

I have posted the issue on Github, but have yet to receive any response.

Topic		Replies	Views
InfluxDB-v2.6.1: How to reduce disk write load	1	826	February 18, 2023
Influxdb docker image memory issue InfluxDB 2 influxdb , docker	1	1066	November 15, 2021
InfluxDB doesn't write data after manual intervention InfluxDB 1	1	476	September 6, 2022
Influxdb 2.1.1 sudden slowdown and write timeouts InfluxDB 2	1	914	March 4, 2022
InfluxDB has stopped writing data Store influxdb	7	1807	January 16, 2020

Influx unable to write to disk

Related topics