Queries and writes blocked until restart of InfluxDB

James_Coleman_LW · April 20, 2023, 6:17pm

Hello,

We started to have an issue with InfluxDB 2 back in October to where it would stop accepting writes and queries until a restart is performed. We have upgraded to InfluxDB 2.6.1 and added prometheus monitoring to gather information when queries stop working, and hoped to gain insights on the issue. From the start of our monitoring, it first looked like what was occurring was storage_shard_fields_created created fields for the bucket, then shortly after everything is blocked. When the queries are blocked, the qc_executing_active metric shows an constant increase at about 2 a minute and does not drop until a restart. Seeing that made us wonder if maybe the shard creation was taking longer than it should and writes/queries are halted until the shard is fully created, however we have changed the storage-shard-precreator-advance-period configuration from the default of 30m to 2h and 8h later on and that has not made any difference. The shard creation seemed to occur a little bit earlier, but not by much and we still had InfluxDB get to a point where queries and writes were blocked. At this point, we are running out of ideas from our basic knowledge of how InfluxDB is designed and works, so we are hoping to get some ideas and or help with the situation here.

OS/Version:

$ rpm -q influxdb2
influxdb2-2.6.1-1.x86_64
$ cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)
$ uname -srm
Linux 3.10.0-1160.42.2.el7.x86_64 x86_64

The bucket is written to by telegraf running on 5,816 servers with an interval of 1 minute, the bucket in question has retention set to 2160h0m0s and shard group set to 24h0m0s.

Attached is a screenshot of Grafana during the typical scenerio. The break in metrics is when we restart InfluxDB.

We have tried reviewing debug logs and profiles of InfluxDB during the event, however does not reveal too much to us.

Please let me know if there is any information I did not provide that would be helpful in solving this issue.

Thank you,

James Coleman

Anaisdg · April 26, 2023, 4:14pm

Hello @James_Coleman_LW,
Unfortunately I don’t know the answer to this question. But I’m asking around. Thank you for your patience.

James_Coleman_LW · April 26, 2023, 6:16pm

If the profile is of any help, here is a profile during the downtime we had last night: ~/influx-profile/profiles$ ls -lahtotal 12384drwxr-xr-x 9 jcoleman sta - Pastebin.com

James_Coleman_LW · September 22, 2023, 5:01pm

Hello,

Just to provide an update, the way we resolved this issue internally was to split the metrics being stored into multiple smaller buckets. It seems that the issue was relating to the performance of our hardware and the amount of data we were trying to store into a single bucket.

Hope this helps someone if they come across this issue themselves.

kjetil_mjos · October 17, 2024, 10:02am

Hi @James_Coleman_LW

Do you happen to have any numbers around when it is wise to split data into multiple smaller buckets?

We are experiencing exactly the same issue as you are describing. https://github.com/influxdata/influxdb/issues/25296

I looks like some cases in this issue also is experiencing the same thing
https://github.com/influxdata/influxdb/issues/23956

Topic		Replies	Views
Shard file created but not written Store	2	415	June 7, 2023
Data stops being written after a couple of minutes Store influxdb	4	581	February 22, 2019
Influxdb 2.1.1 sudden slowdown and write timeouts InfluxDB 2	1	919	March 4, 2022
InfluxDB Server is restarting by itself when it have data inside influxdb	2	844	February 22, 2021
Query engine stuck InfluxDB 2	8	319	October 31, 2024

Queries and writes blocked until restart of InfluxDB

Related topics