Addressing the growing RAM vs usage issue, aka, unexpected "out of memory"

I’ve for years been battling the dreaded “out of memory” crashes in Influx, and following the relevant reports and advice. Despite some initial attention, it seems many of the reports are now running out of solutions. An abridged list appears below.

My particular issue is very low load (cardinality <1000, data insertion rate < 1 point per minute, queries < 1 per minute) and ever increasing RAM usage. This is across a dozen instances on different servers with different databases and different use cases. 512MB used to be enough, now 1GB is not enough. We experimented with v1.5 and v1.6 and TMI vs TSI but quickly rolled back to v1.4.2 because it was just not workable.

So are there any prospects for managing RAM on constrained hardware with low load use cases? We tend to spin up VPS instances to run demos for clients or other experiments, so have many instances each with very simple requirements. But the VPS has to be sized entirely to accommodate these odd RAM explosions.

Is there any work on the horizon to tune or set parameters to limit RAM usage, even at the expense of performance?

Heath, thanks for raising this. I’m in the same boat. I’ve been using influxdb for about 2 years and this has been a continual problem. I have some instances using ~100GB of memory and I’d appreciate any insight into how to reduce this or even how to figure out where it’s all going.

We also struggle with this. So far our SOP is to watch memory usage creep up over a month or two, and then manually restart the InfluxDB process during our maintenance window, which resets it down to a reasonable consumption.

Hi Heath, Sloach, Lukecyca,

Can you share

  • the memory usage (e.g. “free -g”) on the server(s) and by the influxd process?
  • the retention policy info across your databases
    I am curious as I have just started using TICK since 2 weeks ago.

Hi Jayesh. I don’t think it’s going to help much. If I take just one instance the answers are 909MB used total, 811MB by influx (as reported by ps) with all databases using autogen. Ask me tomorrow and the numbers will be different again, but the theme is the same - almost all the memory on any of my Influx servers is used by Influx until it runs out and crashes.

Here’s what I see on my server. And atleast in my case, 32 GB of memory is used for filesystem caching - which I believe is mostly for memory-mapped files.

Many years ago (2011-2013), I had a similar situation with MongoDB wherein as I added more data, MongoDB’s memory footprint would grow and the only way out was to restart MongoDB as lukecyca described.

Wondering if it is is the same situation for you too.>ps aux | egrep 'CPU|influxd'
influxdb  3429 20.2 37.9 103602708 18717376 ?  Ssl  Aug13 4407:06 /usr/bin/influxd -config /etc/influxdb/influxdb.conf>free -g
              total        used        free      shared  buff/cache   available
Mem:             46          12           1           0          32          33
Swap:            23           2          21>influx -version

InfluxDB shell version: 1.4.2>cat /etc/redhat-release 

CentOS Linux release 7.3.1611 (Core)

I was able to simulate your situation (and it seems obvious once you think about it).
So I have a virtual machine with 8 cores and 46 GB RAM.
I am pushing about 816+ million data points across many measurements (OpenTSDB format/mode) into the system with 7 day retention.
When I queries the system (from Chronograf) for 7 days of data across 4 of the largest measurements, the influxDB memory footprint when from 10GB to 40GB (of 46 GB on the server).

Not only that, it “messed up” things so much, that my dashboard stopped updating for a while, the load remained high even after I changed the “past” time to 6 hours.

So essentially this says two things:
InfluxDB will try hard to response to the queries (and hence can scale with your hardware) and if you push it beyond its limit, there can be unpredictable problems :slight_smile:

I guess that’s my fears confirmed. We can’t keep chasing Influx’s RAM usage up and up unbounded. But there doesn’t appear to be any interest in getting a hold of it. It’s going to be painful but I think we need to look elsewhere before we get any further ingrained with TICK.

I found that the 1.6.3 release helped a bit with memory usage for my instances, probably due to this:

Basically this fix lowers the floor of the heap, which leaves more breathing room for whatever it using 10s of GB during runtime (heap use varies between around 50-100GB on my instances).

We really need some better tools for understanding where it’s going, so I can have some clue of how to re-structure things to fix this…