Memory increase slowly over 17 hours, until OOM killed it

I’m using the /write api and a java http connection to migration some data which is 10 and 30 minute interval data, spanning just a few months, to up to 12-16 months, written to a shared 8-12 measurements, distinguished by tags for each datasource. We’re writing the data oldest to newest. This is a 1 time operation, to migrate some data into influx measurements.
I’m also using batching, flushing the data out every 5 seconds. This normally equates to about 3-4 days of timestamps, and each timestamp has about 1000 values (it can vary though, depending on the datasource, but always written to the same field keys in a measurment, but not always all the field keys have values. )


In this picture, The system was steady state, just writing some live data. Then, we began this import/migration of summarized data around 9am. It ran until 2am the next day, when it was OOM killed for using too much memory. The green line is Total memory. The red line is resident memory. The server had 8GB or memory. All the data should be going into about 7 various retention policies.

Any thoughts on why the Total memory for the influxd process continues to increase slowly over the 17 hour time frame? It’s like something isn’t being flushed. or cleared in terms of memory. Just trying to get an idea of what the cause is for this.

How was your disk space? I have once seen InfluxDB run out of disk space, give errors, and then start eating memory until OOM.

On behalf of @Jeffery_K, we have lot of free disk space. It seems this is not due to disk space.

Could it be high cardinality?

Hardware Sizing

We write a lot of data every hour with 128gb memory each in two nodes, these average about 90% memory usage, 80% when it isn’t busy. But if there is a lot of data being written then it will start to eat memory.

Is it possible to write the data in smaller batches? If you know roughly how many points, tags, fields and values you will be adding in each batch you can use INCH

You could set up a test environment and give it a beating with INCH, increasing the amount of fields/tags as you go along should help work out the peak rate before OOM. That’s what i did anyways.

Ultimately though, memory consumption has been an issue until the latest update of InfluxDB. As i say it averages between 80% - 90% on my nodes, before the update it was hitting 98% on a regular basis.

I don’t know tonnes about the Influx stack i’m still picking it up as i go along, but hope it helps.

Phil

I had this happen again. The cardinality is around 57,606 right now.


This is over a 9 hour period, ending in influx being killed by OS for OOM.
On a 32GB system, and it was using about 20GB real memory max, but the resident memory keeps climbing, as if it’s not releasing it’s paging file, or thinks it needs that much resident memory. It’s an 8 core box.
Does anyone from influx know why the resident memory of the influx process continue to rise like this over time?

I also discovered that there are some decently high IOPS spikes in the disk, leading up to the crash. I’m going to investigate why it seemed to go higher towards the end of this 9 hour window. It’s possible that the data inserts were more at this point in time.

Thoughts? I’m going to re-run my test writing to a SSD, (this was a spindle disk, 7200rpm), and i’ll post the results.

Well, running with SSD wasn’t much better, infact, I had a whole lot more crashes towards the end of the run. as in, influxd went from 6GB to 70GB memory in 60 seconds! it only would like for 3-4 minutes before OOM killer killed it.
Here’s the memory chart. the different colors are because they are different process id’s.

Here is the Disk % busy, and the IO’s for the same time range for the SSD (where the influx data is stored).
Some decent ios.


Now interestingly, the drive I used yesterday, sdc, is the main linux LVM drive, sdc.

It also showed extremely high IO. As i think about it more, I bet this is the OS paging the 70GB of memory that influx wants to use, but can only access 20GB. The solution is obviously to add more memory to this system, but I wanted to see if someone could tell me, is this normal behavior, or potentially a bug in influx. My cardinality is low (around 105000) I have it on an 8 core system. 32GB ram, using solid state. I’m not sure what my points per second insert rate is. Is that somewhere in the _internal database?

Hi @Jeffery_K

I’m not sure about a writes per second measurement but you can find out how many writes in bytes there has been using the _internal database. From there you could probably work out an average writes per second over a set time period.

You can check that in influx : select * from “write” order by desc limit 10 (just so it doesn’t query the whole database) when you’re using the _internal database

I use the following queries with Grafana to work out bytes written:
SELECT difference(mean(“read_bytes”)) FROM “autogen”.“diskio” WHERE (“host” =~ /^$datasource$/)

SELECT difference(mean(“write_bytes”)) FROM “autogen”.“diskio” WHERE (“host” =~ /^$datasource$/)

If you swap out the regex part for the data source you should be abel to get some info and work from there. I have a dashboard that shows this in mb/gb so i can see how much data is being written.

As far as i can remember, InfluxDB will use as much memory as it can - i think it’s to do with caching when it starts (i did find out from the support people but i honestly can’t remember for certain), After a while it should settle down.

For the SSD did you change the influx config to use disk instead of memory? i think the default is
inmem but you can change it to tsi1. Still though i’m not sure that will help with the initial system start as Influx still needs the memory to get going.

The problem is if it OOM’s every time it won’t settle down. Giving it more ram will just mean influx uses it. Which might get past the initial loading, but i had 2 nodes with 200gb in each and that would still fill over time.

Thanks for the info. I’ll check into those statistics. We are using index-version = “tsi1”, have been for a while, mainly to reduce startup times.