We started to have an issue with InfluxDB 2 back in October to where it would stop accepting writes and queries until a restart is performed. We have upgraded to InfluxDB 2.6.1 and added prometheus monitoring to gather information when queries stop working, and hoped to gain insights on the issue. From the start of our monitoring, it first looked like what was occurring was storage_shard_fields_created created fields for the bucket, then shortly after everything is blocked. When the queries are blocked, the qc_executing_active metric shows an constant increase at about 2 a minute and does not drop until a restart. Seeing that made us wonder if maybe the shard creation was taking longer than it should and writes/queries are halted until the shard is fully created, however we have changed the
storage-shard-precreator-advance-period configuration from the default of 30m to 2h and 8h later on and that has not made any difference. The shard creation seemed to occur a little bit earlier, but not by much and we still had InfluxDB get to a point where queries and writes were blocked. At this point, we are running out of ideas from our basic knowledge of how InfluxDB is designed and works, so we are hoping to get some ideas and or help with the situation here.
$ rpm -q influxdb2 influxdb2-2.6.1-1.x86_64 $ cat /etc/redhat-release CentOS Linux release 7.9.2009 (Core) $ uname -srm Linux 3.10.0-1160.42.2.el7.x86_64 x86_64
The bucket is written to by telegraf running on 5,816 servers with an interval of 1 minute, the bucket in question has retention set to 2160h0m0s and shard group set to 24h0m0s.
Attached is a screenshot of Grafana during the typical scenerio. The break in metrics is when we restart InfluxDB.
We have tried reviewing debug logs and profiles of InfluxDB during the event, however does not reveal too much to us.
Please let me know if there is any information I did not provide that would be helpful in solving this issue.