Fatal error: Out of Memory

Environment:

  • Raspberry PI 4 with 2 Gb mem.
  • InfluxDB ver 1.7.9
  • Docker container running Influxdb

Issue:

Docker container running Influxdb was running correctly for 8 months. Performed a “stress test” query through Grafana on Influxdb. Grafana showed a Network Timeout error in display during long query test. I noticed that the Influxdb was not working. Upon investigation, I discovered the following:

  1. Docker container was restarting every roughly 10 secs. (this was because I had this purposely set to restart container).
  2. Docker log revealed an issue within trace (every time at same point): 2020-09-27T00:36:20.237969581Z ts=2020-09-27T00:36:20.237659Z lvl=info msg=“Opened shard” log_id=0PVFfL~l000 service=store trace_id=0PVFfMQG000 op_name=tsdb_open index_version=inmem path=/var/lib/influxdb/data/mydb/autogen/210 duration=375.918ms
  3. 2020-09-27T00:36:20.476105067Z runtime: out of memory: cannot allocate 1852145664-byte block (28639232 in use)
  4. 2020-09-27T00:36:20.476173066Z fatal error: out of memory

It seems as if the Influxdb is either still trying to execute or complete the query or there is some persistent issue as a result that has “stuck” the startup.

Question:

I am able to run the container without running Influxdb, is there anything I can repair in the environment so Influx will restart without getting this memory error? (i.e. remove edit file etc.)?

Update on this issue:

Through some digging was able to decipher that the “out of memory” was due to the indexing being in memory “inmem” (log files from docker) and simply using up all available RAM. Turns out that the “inmem” is the default influxdb.conf file setting for index-version = “inmem” parameter. After further reading, found that one remedy to this issue was to set the index-version = “tsi1” parameter in the influxdb.conf file so that indexing uses disk rather than RAM.

However, the configuring of this parameter and subsequent running of influxd revealed that the problem persisted (docker container would continue to restart as it was still running out of memory). Upon further digging, discovered that a “re-indexing” is required when switching from “inmem” to “tsi1”. Ran influx_inspect buildtsi -datadir /var/lib/influxdb/data -waldir /var/lib/influxdb/wal and this repaired the issue and Influxdb was able to run. This was good.

Now queries will run against the database…but my stress test still causes an “out of memory” situation, but doesn’t cause a unrecoverable issue. (I am still digging into this “new” out of memory condition to see if this can be resolved.) More to come on this thread.

@chrisgray13247 - Your investigation here is top notch. Thanks for sharing your steps and results. The community will benefit from your post!

Managing memory with influx can be tough. We don’t have a knob that sets a maximum memory limit; we can only do it indirectly. We’re actively working on improvements with this, but it won’t be available for a bit.

For your case, I do recommend using the tsi1 file based index instead of inmem - that will help with overall memory usage. TSM and TSI files are still mapped into memory but the OS should only keep in memory what is in use actively. The catch is if you run a complicated query which can demand more memory than the system has to produce the result.

A couple options: (1) Since you are stress testing, you can gradually reduce the complexity of your query to see the maximum stress/complexity your hardware and data set can support. (2) Check your swap settings. If there are other memory users on the RPi, they could be swapped out to free up some memory. (3) Reduce other software running on the RPi for the same reason. (4) Double check the container settings to ensure the full 2gb is available to it. (5) Reduce your data being stored. This is often a tough pill but perhaps you don’t need some subset of the data and can delete it or move it off the RPi.

Let us know what you figure out.

1 Like

Philjb,

Thanks for reading my thread on this and your feedback. Much appreciated. I do need to do some further investigation on the issue as it has a significant impact on the type of implementation. (Time is my constraint at the moment as I have limited bandwidth to do the “deep dives” required).

Perhaps you have eluded to this in some of your comments, but is there away for Influxdb to diversify its memory needs by spanning both RAM and Storage space, so these “complex” queries can basically take what they need to complete (in terms of temporary memory)? Also, is there a “graceful” way the service could be configured so that the service does not stop when a “rougue” query is run? In addition (and this may be too much to ask), but is there a way for the service to return some sort of error response indicating that a query was too expensive to run based on resources available on the service?

I should give you some better context around the “stress test” I am referring to.

Presentation Layer: Grafana
Times series Ingestion Rate: Data is being stored at a one second sample rate into Influxdb
data
Queries: A total of eight (8) different Tags are being selected using query like this over a 30 Day period. SELECT max(“value”) FROM “CPU” WHERE (“tagname” = ‘temperature’) AND timeFilter GROUP BY time(__interval) fill(null) Note: Using the $__interval essentially controls the Group By clause so that returned values represent no more than the number of pixels that can be displayed in the trend tool (Grafana). $timeFilter is the time range from the Grafana trend tool, which in my case is the 30 day period.
Stress Test Break Point: At the 30 Day query test level, Influxdb service stops. Prior to this, Influxdb is able to “grind” through the queries and return results.

Any further thoughts on how to better configure the service would be greatly appreciated.

@chrisgray13247 -

You can set or increase the swap file in your Pi OS to increase the perceived amount of memory. InfluxDB will not know the difference but swapfiles will be slower.

I addressed some of your other questions around degradation over here in this post: Out-of-memory on backup + queries

Because you are on a Pi, adjusting down the TSM cache size will free up some memory for other uses. It has a high default for a Pi ~1gb.

Do you need only up to 30day query periods? Do you need longer periods?

1 Like