Query engine stuck

Hello all, big problem here with v2.7.5 on Linux. The machine running influxd has plenty of resources (RAM, CPU, network). All runs OK for several (1+) hours and then, out of the blue, the query engine gets stuck. Queries are no longer run and new queries get queued until the queue gets full and then are not accepted any more. When this happens, influxd begins to use more CPU (but not much and not all the time) and its memory consumption begins to grow.

Writes still work OK, new data points are accepted and written OK. Our incoming fluxes are quite stable but there could be short bursts (couple of minutes maximum) from time to time.

At this point, either we restart influxd or we let it run and after some time (1 - 2 hours) the OOM killer kicks in.

Any hint?

P.S. I have collected metrics from several of these cases, if anyone is interested…

Before anyone asks: I had log-level set to error and no errors in the logs. I have now switched to info, waiting for things to stop again.

Queries rates: between 3 and 15 per minute.

Have you made any recent changes in your writes or queries? Or was this truly out of the blue with no changes on your side? Also when was the last time you upgraded, just so i know if this might be an old bug or a more recent issue.

No changes in the queries. On the write side I am not 100% sure but it looks like nothing was changed. SW was upgraded last in January. It started on a very precise moment (about 10 days ago) and never stopped since.

With log “info” enabled there are no unusual errors/messages at the time of the freeze (we had two episodes last night). There have been zero “warn” in the past 24 hours.