How to prevent OOM exceptions when importing billions of data points

romainrichard · September 11, 2017, 5:10pm

Hi,

TL;DR: I’m currently trying to bulk insert billions of data points into InfluxDB using the -import flag. After less than 0.5 billion data points have been ingested, the InfluxDB process gets killed because the system ran out of memory (InfluxDB is using all the RAM). Any ways to prevent that?

I have a data set of about 500 billion points and I’m considering using InfluxDB to store them.
I’m trying to insert about 1% of data as a POC to see how it’ll behave. Unfortunately, I’m only able to get to 0.1% before the process gets killed because of an OOM exception.

The current system I’m using for the import has 16GB of RAM, which should be plenty. It also uses SSD.
By the time the process gets killed, I have about 200 series in the database (so it’s not an issue with series cardinality).
The data is stored in a single measurement, with one tag and one field.
In the first few minutes, InfluxDB is inserting data at around 300,000 points a second. After about 10 minutes (when it started using most of the RAM), that drops down to about 200,000. 20 minutes later, we’re under 100,000 and the process gets killed.

What can I do to prevent this from happening? I have a few ideas but not sure which one I should go with:

Wait n seconds between each import to let InfluxDB catch up
Wait n seconds every m imports
Force and wait for the WAL to be processed (if that’s even possible) every m imports
Tweak the default settings somehow

Thanks

tim.hall · October 5, 2017, 8:41pm

What batch size are you using to perform the writes?

Is your data time ordered? (i.e. oldest to newest or newest to oldest?) This is recommended.

What is the shard group duration set to?

I’d suggest reviewing this: Schema Design | InfluxDB OSS 1.3 Documentation

jason · October 5, 2017, 9:13pm

I’d also suggest taking a look at this FAQ entry regarding backfilling sparse data.

For backfilling data, there are a couple is things that need to be adjusted depending on the shape of your data.

Range of time - If you are backfilling years of data, you will most likely need to increase the shard duration on your retention policy as the default of 1w will end up creating lots of shards. If you do not plan on deleting the data, the larger the duration the better.
Density - If you have sparse data, for example, stock ticker data with 1 value per day for years, you will also need to increase your shard duration to avoid creating lots of small sparse shards.
Cache Config - Each shard has a cache of recently written points. By default, these are snapshotted to disk after the shard goes cold. The default is 10m. When backfilling, you frequently end up writing to lot of shards in a short period of time if the default shard duration is used. It’s recommended to lower your cache-snapshot-write-cold-duration to 10s during the backfilling so that the shard is snapshotted more quickly once you move to the next.

Topic		Replies	Views
InfluxDB memory consumption optimization InfluxDB 2	1	4674	August 6, 2019
Memory increase slowly over 17 hours, until OOM killed it Store influxdb , influxdata	7	4516	May 9, 2018
InfluxDB v2 High RAM usage and leading into OOM and constant restart InfluxDB 2 influxdb	5	4092	July 27, 2024
InfluxDB 2 -- OOM InfluxDB 2 influxdb	1	443	June 30, 2021
Influx has too many threads and killed by OOM when he try to execute too many queries, but should skip them InfluxDB 2 influxdb , query , flux	3	2235	July 19, 2021

How to prevent OOM exceptions when importing billions of data points

Related topics