Fatal error: Out of Memory

Environment:

  • Raspberry PI 4 with 2 Gb mem.
  • InfluxDB ver 1.7.9
  • Docker container running Influxdb

Issue:

Docker container running Influxdb was running correctly for 8 months. Performed a “stress test” query through Grafana on Influxdb. Grafana showed a Network Timeout error in display during long query test. I noticed that the Influxdb was not working. Upon investigation, I discovered the following:

  1. Docker container was restarting every roughly 10 secs. (this was because I had this purposely set to restart container).
  2. Docker log revealed an issue within trace (every time at same point): 2020-09-27T00:36:20.237969581Z ts=2020-09-27T00:36:20.237659Z lvl=info msg=“Opened shard” log_id=0PVFfL~l000 service=store trace_id=0PVFfMQG000 op_name=tsdb_open index_version=inmem path=/var/lib/influxdb/data/mydb/autogen/210 duration=375.918ms
  3. 2020-09-27T00:36:20.476105067Z runtime: out of memory: cannot allocate 1852145664-byte block (28639232 in use)
  4. 2020-09-27T00:36:20.476173066Z fatal error: out of memory

It seems as if the Influxdb is either still trying to execute or complete the query or there is some persistent issue as a result that has “stuck” the startup.

Question:

I am able to run the container without running Influxdb, is there anything I can repair in the environment so Influx will restart without getting this memory error? (i.e. remove edit file etc.)?

Update on this issue:

Through some digging was able to decipher that the “out of memory” was due to the indexing being in memory “inmem” (log files from docker) and simply using up all available RAM. Turns out that the “inmem” is the default influxdb.conf file setting for index-version = “inmem” parameter. After further reading, found that one remedy to this issue was to set the index-version = “tsi1” parameter in the influxdb.conf file so that indexing uses disk rather than RAM.

However, the configuring of this parameter and subsequent running of influxd revealed that the problem persisted (docker container would continue to restart as it was still running out of memory). Upon further digging, discovered that a “re-indexing” is required when switching from “inmem” to “tsi1”. Ran influx_inspect buildtsi -datadir /var/lib/influxdb/data -waldir /var/lib/influxdb/wal and this repaired the issue and Influxdb was able to run. This was good.

Now queries will run against the database…but my stress test still causes an “out of memory” situation, but doesn’t cause a unrecoverable issue. (I am still digging into this “new” out of memory condition to see if this can be resolved.) More to come on this thread.

1 Like

@chrisgray13247 - Your investigation here is top notch. Thanks for sharing your steps and results. The community will benefit from your post!

Managing memory with influx can be tough. We don’t have a knob that sets a maximum memory limit; we can only do it indirectly. We’re actively working on improvements with this, but it won’t be available for a bit.

For your case, I do recommend using the tsi1 file based index instead of inmem - that will help with overall memory usage. TSM and TSI files are still mapped into memory but the OS should only keep in memory what is in use actively. The catch is if you run a complicated query which can demand more memory than the system has to produce the result.

A couple options: (1) Since you are stress testing, you can gradually reduce the complexity of your query to see the maximum stress/complexity your hardware and data set can support. (2) Check your swap settings. If there are other memory users on the RPi, they could be swapped out to free up some memory. (3) Reduce other software running on the RPi for the same reason. (4) Double check the container settings to ensure the full 2gb is available to it. (5) Reduce your data being stored. This is often a tough pill but perhaps you don’t need some subset of the data and can delete it or move it off the RPi.

Let us know what you figure out.

2 Likes

Philjb,

Thanks for reading my thread on this and your feedback. Much appreciated. I do need to do some further investigation on the issue as it has a significant impact on the type of implementation. (Time is my constraint at the moment as I have limited bandwidth to do the “deep dives” required).

Perhaps you have eluded to this in some of your comments, but is there away for Influxdb to diversify its memory needs by spanning both RAM and Storage space, so these “complex” queries can basically take what they need to complete (in terms of temporary memory)? Also, is there a “graceful” way the service could be configured so that the service does not stop when a “rougue” query is run? In addition (and this may be too much to ask), but is there a way for the service to return some sort of error response indicating that a query was too expensive to run based on resources available on the service?

I should give you some better context around the “stress test” I am referring to.

Presentation Layer: Grafana
Times series Ingestion Rate: Data is being stored at a one second sample rate into Influxdb
data
Queries: A total of eight (8) different Tags are being selected using query like this over a 30 Day period. SELECT max(“value”) FROM “CPU” WHERE (“tagname” = ‘temperature’) AND timeFilter GROUP BY time(__interval) fill(null) Note: Using the $__interval essentially controls the Group By clause so that returned values represent no more than the number of pixels that can be displayed in the trend tool (Grafana). $timeFilter is the time range from the Grafana trend tool, which in my case is the 30 day period.
Stress Test Break Point: At the 30 Day query test level, Influxdb service stops. Prior to this, Influxdb is able to “grind” through the queries and return results.

Any further thoughts on how to better configure the service would be greatly appreciated.

@chrisgray13247 -

You can set or increase the swap file in your Pi OS to increase the perceived amount of memory. InfluxDB will not know the difference but swapfiles will be slower.

I addressed some of your other questions around degradation over here in this post: Out-of-memory on backup + queries

Because you are on a Pi, adjusting down the TSM cache size will free up some memory for other uses. It has a high default for a Pi ~1gb.

Do you need only up to 30day query periods? Do you need longer periods?

1 Like

I have the same issue, the actual parameter is index-version = “tsi1” in influxdb.conf, but sometimes i have beaks lot querys produces OOM with next messages:

Feb 20 08:57:06 host kernel: 23592830 pages RAM
Feb 20 08:57:06 host kernel: 0 pages HighMem/MovableOnly
Feb 20 08:57:06 host kernel: 432550 pages reserved
Feb 20 08:57:06 host kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Feb 20 08:57:06 host kernel: [10298] 539 10298 99532637 22514834 96817 1487325 0 influxd
Feb 20 08:57:06 host kernel: Out of memory: Kill process 10298 (influxd) score 944 or sacrifice child

I think that I should disable ‘transparent_hugepage’ and enable ‘hugepages’ with a static number of pages, as well change values ‘sems’ in kernel; is a good idea? Can you suggest us on this trouble please?

THANK YOU. This solved my high CPU issue. For a week or two influxdb was using 30-40% cpu continuously. Stopped writing to the _internal database thinking that might help, but didn’t change anything. After looking at journalctl logs for influxdb.service (sudo journalctl -u influxdb.service --since today) I saw the out of memory error, which led me here. Not sure what kind of performance hit I’m going to take by moving things to disk instead of RAM, but at least the Raspi isn’t a space heater any more.

Kind of surprised since my database isn’t collecting that much data (maybe 15 devices sending JSON sensor data, plus collectd sending system info). Would love to find a good resource for tracking down the query or whatever that caused the issue, but it’s much harder than a SQL database for seeing what’s flowing into the database from all sources.

Of note, it’s important to (and influxdb warns you of this) do the influx_inspect command with the user that will be accessing the data (I guess the influx user?), and not sudo. I couldn’t access my data until I fixed those permissions.