Queries and writes blocked until restart of InfluxDB

Hello,

We started to have an issue with InfluxDB 2 back in October to where it would stop accepting writes and queries until a restart is performed. We have upgraded to InfluxDB 2.6.1 and added prometheus monitoring to gather information when queries stop working, and hoped to gain insights on the issue. From the start of our monitoring, it first looked like what was occurring was storage_shard_fields_created created fields for the bucket, then shortly after everything is blocked. When the queries are blocked, the qc_executing_active metric shows an constant increase at about 2 a minute and does not drop until a restart. Seeing that made us wonder if maybe the shard creation was taking longer than it should and writes/queries are halted until the shard is fully created, however we have changed the storage-shard-precreator-advance-period configuration from the default of 30m to 2h and 8h later on and that has not made any difference. The shard creation seemed to occur a little bit earlier, but not by much and we still had InfluxDB get to a point where queries and writes were blocked. At this point, we are running out of ideas from our basic knowledge of how InfluxDB is designed and works, so we are hoping to get some ideas and or help with the situation here.

OS/Version:

$ rpm -q influxdb2
influxdb2-2.6.1-1.x86_64
$ cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)
$ uname -srm
Linux 3.10.0-1160.42.2.el7.x86_64 x86_64

The bucket is written to by telegraf running on 5,816 servers with an interval of 1 minute, the bucket in question has retention set to 2160h0m0s and shard group set to 24h0m0s.

Attached is a screenshot of Grafana during the typical scenerio. The break in metrics is when we restart InfluxDB.

We have tried reviewing debug logs and profiles of InfluxDB during the event, however does not reveal too much to us.

Please let me know if there is any information I did not provide that would be helpful in solving this issue.

Thank you,

James Coleman

Hello @James_Coleman_LW,
Unfortunately I don’t know the answer to this question. But I’m asking around. Thank you for your patience.

If the profile is of any help, here is a profile during the downtime we had last night: ~/influx-profile/profiles$ ls -lahtotal 12384drwxr-xr-x 9 jcoleman sta - Pastebin.com

Hello,

Just to provide an update, the way we resolved this issue internally was to split the metrics being stored into multiple smaller buckets. It seems that the issue was relating to the performance of our hardware and the amount of data we were trying to store into a single bucket.

Hope this helps someone if they come across this issue themselves.

Hi @James_Coleman_LW

Do you happen to have any numbers around when it is wise to split data into multiple smaller buckets?

We are experiencing exactly the same issue as you are describing. https://github.com/influxdata/influxdb/issues/25296

I looks like some cases in this issue also is experiencing the same thing
https://github.com/influxdata/influxdb/issues/23956