Queries which time out end up crashing influxdb 2.7 which reboots into setup mode


After some work I ended up with the correct data in influxdb.

The query is the following:

import "strings"

from(bucket: "connections")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => not strings.containsStr(v: r["interface_name"], substr: "Management"))
  |> filter(fn: (r) => not strings.containsStr(v: r["interface_name"], substr: "Loopback"))
  |> filter(fn: (r) => r["_field"] == "/interfaces/interface/state/oper-status")
  |> filter(fn: (r) => r["_value"] == "UP")
  |> window(every: v.windowPeriod)
  |> truncateTimeColumn(unit: v.windowPeriod)
  |> group(columns: ["_time", "_measurement"])
  |> count(column: "interface_name")
  |> duplicate(column: "interface_name", as: "_value")
  |> drop(columns: ["_start", "_stop", "interface_name"])
  |> group(columns: ["_measurement"])
  |> rename(columns: {"_measurement": "switch_hostname"})
  |> sort(columns: ["_time"], desc: false)

This gives a correct graph on both influx web UI & grafana, however, I have datapoints every 10 seconds for 3 switches (this results in 3 lines on a single graph). When I put the query beyond 15 minutes, to 1h, or higher, it simply times out on grafana. In the influx web UI it shows that the query is running for >60 seconds before timing out.

This is being run an EC2 AWS instance: t3.large, so 2 vCPUs and 8 GB of memory. I can see the CPU spike to ~35%, but I guess that the memory is just straight up being consumed. I will try to confirm this.

However, after running this query 2/3x I suddenly start getting unauthorized on my queries, looking at the logs they show this:

Apr 03 08:21:48 - systemd[1]: influxdb.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 03 08:21:48 - systemd[1]: Unit influxdb.service entered failed state.
Apr 03 08:21:48 - systemd[1]: influxdb.service failed.
Apr 03 08:21:48 - systemd[1]: influxdb.service holdoff time over, scheduling restart.
Apr 03 08:21:48 - systemd[1]: Stopped InfluxDB is an open-source, distributed, time series database.
Apr 03 08:21:48 - systemd[1]: Starting InfluxDB is an open-source, distributed, time series database...

Alright, so instead of gracefully handling the data, the server is full on crashing. What’s more after the crash the server goes into setup mode, forgetting all the users, all the tokens, basically making it a useless server.

This cannot be the way to handle an out-of-memory exception for a query which is written in an unoptimized way.

So from this 1 post I have 3 issues:

  • How does a simple query like this cause such a huge query delay
  • How does the server not gracefully handle out of memory queries
  • Why does the server go into setup mode after crashing due to the above queries

I’m the only one using this server and the only one doing these queries, as such I’m 100% sure there isn’t anybody else using the server.

Version is 2.7.5

@Matti Thanks for the details. I think this is important enough to report. Could you file an issue on the InfluxDB GitHub repo?

Hey, sure thing. I have confirmed that the culprit is indeed out of memory:

Apr 3 08:21:42 - influxd-systemd-start.sh: fatal error: runtime: out of memory

What I have also confirmed is that for example the .bolt file is new, checking the file with xfs_io & statx shows:

stat.btime = Wed Apr 3 08:21:49 2024

So this would showcase that the old user data is indeed reset, however looking through the data files some of them have creation dates when the server was brought online first.

I assume that the bug to be reported is the recovery after the OOM error is that the server rebooted into setup mode and seemingly has forgotten the majority of the data?

As a different question then, looking at optimizing queries, it states that

|> group() |> count()
sequence of operations is not supported, and I suspect that this is the part of the operations which is eating up all the memory.

Is there a way to gracefully handle this? Or a way to for example run this query every 30 seconds or minute and put the start & stop time to cover that last minute, and just push the output of that query into a new table? This would keep the memory utilization to a minimum while still giving me an option to query the data for a longer time period and be fast at the same time, as I’d just be querying the already prepared data at that point?