Execution of heavy queries result in a crash

Testing the limits of my system I found out that running a crazy heavy query, one that may take tens of minutes or even hours to complete, brings to an inevitable crash of InfluxDB.

The system is running the 2.1.1 version in a docker environment and is given 1GB of RAM memory (I tried giving also from 512MB to 4GB with similar results). The data schema is composed of a single bucket, a single measurement, no tags and a few hundreds (600~) of fields, and data ingestion has an avarage frequency of a few (1-10~) seconds. The query I’m running for the test is the following:

data_from_bucket = from(bucket: "mybucket")
  |> range(start: -30d, stop: now())
  |> filter(fn: (r) => r["_measurement"] == "mymeasurement")

data_all = data_from_bucket
  |> pivot(rowKey:["_time"], columnKey: ["_field"], valueColumn: "_value")

data_field = data_from_bucket
  |> filter(fn: (r) => r["_field"] == "myfield")

join(tables: {key1: data_all, key2: data_field}, on: ["_time"], method: "inner")

From my tests simply running this query puts heavy load on cpu and memory, and leads to an inevitable crash.
Trying to stop the execution of the heavy query doesn’t change the result, probably because stopping che query request doesn’t really stop the internal execution. As can be seen from the ‘docker stats’, the resource usage from the influxdb process keeps being high after issuing the stop.
In the meantime any concurrent almost instant query (a few seconds) will complete without crash.

Query free run:

Docker compose   
mem_limit       crash after
4GB             4m 20s
2GB             1m 10s
1GB             18s
512MB           8s

Query ran once and stopped after 5s:

mem_limit       crash after
4GB             no crash, after 50s/60s log warn msg="internal error not returned to client" log_id=0X~AJGuW000 handler=error_logger error="context canceled"
2GB             3m
1GB             20s
512MB           10s

In a loop, I ran the query, stopped it after 10s, ran the query again, stopped it after 10s, and so on until crash

mem_limit       crash during run number
4GB             5
2GB             3
1GB             2
512MB           1

I tried acting on the query controller configuration, imposing a concurrency quota of 1 and a query memory bytes equal to half of the total memory but apparently nothing changed. This is just a single test under this configuration, I didn’t go deeper with this cause it seemed useless. I’m reporting it just to explicitly exclude this option.
Query free run:
mem_limit crash after
2GB 1m 10s

I know this is not a real world situation, but I think that the system should not behave in this way, it should gracefully stop and/or return an error, not just crash.

Anyway, this is just a test I came up with in the attempt to takle some crashes that I’m experiencing in a non easily reproducible way, while navigating and refreshing some Grafana dashboards composed of 4-6 panels that rely on queries that take 30-90 seconds to load. I’m trying to understand the limits of the system to have a better view of where to aim my efforts to improve the overall stability of my installation.

Crashes are not the worst of the problems, sometimes the system just hangs without any way to recover it except manually restarting, and that’s very bad. InfluxDB becomes unresponsive in any way, I can see from the resource monitor that it is still working on something heavy, but no crash occurres and I’m forced to reboot it. Once I left it alone to see how it would behave in the long run and after something like one hour it came back to normal without crashing. Unfortunately I have yet to find a way to consistently reproduce this.

My next thoughts on this are:

  • The hardware resources are limited but I could make the effort of improving them if I was sure that it would solve the problem, except from the tests it doesn’t seem the case. Giving more RAM seems to make things smoother, but crashes can happen anyway
  • The queries in the Grafana dashboards are pretty heavy, but that’s something I can trim a bit, not really change for good. I’m thinking I could run them in some periodic task and just retrieve the results in the dashboards, but would that really solve the stability issue? Those heavy queries would be ran by a different actor, but if the underling controller handles them in the same way it wouldn’t change much in the end
  • Adding verbosity to the logs. I’d love to have some more information to work on, but logs are pretty useless in this scenario, no error, warning or information about the crash/hang gets written. Is there a way to increase the verbosity? Could more log instruction be added to tackle this issue?

If anyone has advices to improve the stability I’m very open to any suggestion.