Query engine stuck

Hello all, big problem here with v2.7.5 on Linux. The machine running influxd has plenty of resources (RAM, CPU, network). All runs OK for several (1+) hours and then, out of the blue, the query engine gets stuck. Queries are no longer run and new queries get queued until the queue gets full and then are not accepted any more. When this happens, influxd begins to use more CPU (but not much and not all the time) and its memory consumption begins to grow.

Writes still work OK, new data points are accepted and written OK. Our incoming fluxes are quite stable but there could be short bursts (couple of minutes maximum) from time to time.

At this point, either we restart influxd or we let it run and after some time (1 - 2 hours) the OOM killer kicks in.

Any hint?

P.S. I have collected metrics from several of these cases, if anyone is interested…

Before anyone asks: I had log-level set to error and no errors in the logs. I have now switched to info, waiting for things to stop again.

Queries rates: between 3 and 15 per minute.

Have you made any recent changes in your writes or queries? Or was this truly out of the blue with no changes on your side? Also when was the last time you upgraded, just so i know if this might be an old bug or a more recent issue.

No changes in the queries. On the write side I am not 100% sure but it looks like nothing was changed. SW was upgraded last in January. It started on a very precise moment (about 10 days ago) and never stopped since.

With log “info” enabled there are no unusual errors/messages at the time of the freeze (we had two episodes last night). There have been zero “warn” in the past 24 hours.

Any resolution here? Seeing the same thing, same symptoms.

No solution. I handle 4 servers, two of them are now showing this problem (the second started a couple of months ago, no updates, no changes in the use pattern).

I could mitigate it a bit by reducing the max number of compacting threads and by increasing the shard lifetime (which gives less compactions procedures).

Any guidance from @influxdb ?

Small update:

  • Of the 4 servers I handle, now there are (were) 3 giving the problem.
  • I have enabled detailed logs and found that at the time of the freeze-ups there were several compactions (Series partition and TSI) started all at the same exact time and ending much, much later (usually they end within ~1 minute, here they took 30-60 minutes). The pattern I used to isolate them was to grep for the string " compaction" in the influxd logs.
  • At the same time, system monitoring was showing high disk read activities and, at the very end, a very small disk write.

I had already set maximum concurrent compactions to 1, but this parameter apparently does not apply to the TSIs (only to the TSMs).

I started looking after the Indexes (TSIs). Searching around, I got to:

Followed the procedure described in the doc (took a long time to rebuild all the indexes) and now things look much, much better. We’ll see how things will evolve…