Limit RAM consumption of InfluxDB (in a Docker container on an ARM device)

We are using InfluxDB on one of our embedded devices for monitoring
and statistical data. Since the embedded device also hosts other applications
with critcal ressource demands (in particular RAM usage), we need to limit InfluxDB’s
RAM consumption. However, InfluxDB uses more ram than we allow for and we didn’t find
the right setting yet.

I’m going to go into quite some detail here to describe what the problem is, what
our setup is and what steps we’ve tried so far. This is a critical high priority
issue for us, so help is highly appreciated.

The problem

We run InfluxDB on an embedded ARM device in a Docker container. To make the database
behave well with other applictations on the device, we run InfluxDB in a container with 512 MB of RAM. Most of the time, InfluxDB works satisfactory (both in terms of response times and RAM usage.)

However, there are some cases where RAM usage exceeds the amount of RAM given to
the container, so that the container gets OOM killed. This is very bad for us, since
we want to use InfluxDB for critical data and strive for it to have the highest
uptime.

The scenarios where the RAM consumption spikes are not yet completely clear to us.
One hypothesis is that InfluxDB uses a lot of RAM during compaction. But we’ve managed
do provoke these OOM situations and observed situations where compactions does not
seem to be the cause of RAM usage spikes. Also, the RAM spikes occur infrequently
(sometimes the system runs stable for more than a week, sometimes only for a day).

Currently, we are stress testing InfluxDB and write a lot of data to it (though with
low series cardinality). In production, we’ll probably have less data but we need
to get a hold of the memory problem before.

Our setup

  • Influx OSS v2.0.4 (based on arm64v8/influxdb:2.0.4-alpine )
  • ARM device with 4GB of RAM in total.
  • Linux Kernel v4.19.67
  • Buckets and series cardinality:
    • 4 relevant buckets
    • Series cardinality between 121 and 360
  • Storage size on disk around 3.5GB
  • Relevant docker parameters:
    • --ulimit nofile=65536:65536
    • --memory 536870912 (512 MiB)
  • Relevant environment vars
    • INFLUXD_REPORTING_DISABLED="true"
    • INFLUXD_STORAGE_CACHE_MAX_MEMORY_SIZE=419430400 (400 MiB)
    • INFLUXD_STORAGE_CACHE_SNAPSHOT_WRITE_COLD_DURATION="120s"
    • INFLUXD_STORAGE_COMPACT_FULL_WRITE_COLD_DURATION="1h0m0s"
    • INFLUXD_STORAGE_COMPACT_THROUGHPUT_BURST=8388608 (8 MiB)
    • INFLUXD_STORAGE_MAX_CONCURRENT_COMPACTIONS=2
    • INFLUXD_STORAGE_SERIES_FILE_MAX_CONCURRENT_SNAPSHOT_COMPACTIONS=2

What we’ve tried so far

We’ve identified several other configuration options (see this list) what might need tweaking.
You can see that this list isn’t short however, and we need to monitor the system
stability for some days to see the effect. So there is a lot of testing ahead of us.

What we haven’t done so far is to check if the environment var GOGC helps.

Some observations

We monitor container statistics (RAM usage and OOM score among them) on a 10s basis.
We observe that there are situations of high OOM score and high relative memory
consumption (relative to the container limit) that apparently don’t lead to
the container being killed.

We conclude that the spikes happen at a higher frequency, i.e. we don’t observe them
with our 10s monitoring interval.

Can you help us?

As you can see, testing all different configurations will need a considerable amount
of time. So can you help us limit our RAM consumption?