How to prevent OOM exceptions when importing billions of data points

influxdb

#1

Hi,

TL;DR: I’m currently trying to bulk insert billions of data points into InfluxDB using the -import flag. After less than 0.5 billion data points have been ingested, the InfluxDB process gets killed because the system ran out of memory (InfluxDB is using all the RAM). Any ways to prevent that?

I have a data set of about 500 billion points and I’m considering using InfluxDB to store them.
I’m trying to insert about 1% of data as a POC to see how it’ll behave. Unfortunately, I’m only able to get to 0.1% before the process gets killed because of an OOM exception.

The current system I’m using for the import has 16GB of RAM, which should be plenty. It also uses SSD.
By the time the process gets killed, I have about 200 series in the database (so it’s not an issue with series cardinality).
The data is stored in a single measurement, with one tag and one field.
In the first few minutes, InfluxDB is inserting data at around 300,000 points a second. After about 10 minutes (when it started using most of the RAM), that drops down to about 200,000. 20 minutes later, we’re under 100,000 and the process gets killed.

What can I do to prevent this from happening? I have a few ideas but not sure which one I should go with:

  1. Wait n seconds between each import to let InfluxDB catch up
  2. Wait n seconds every m imports
  3. Force and wait for the WAL to be processed (if that’s even possible) every m imports
  4. Tweak the default settings somehow

Thanks


#2

What batch size are you using to perform the writes?

Is your data time ordered? (i.e. oldest to newest or newest to oldest?) This is recommended.

What is the shard group duration set to?

I’d suggest reviewing this: https://docs.influxdata.com/influxdb/v1.3/concepts/schema_and_data_layout/#shard-group-duration-overview


#3

I’d also suggest taking a look at this FAQ entry regarding backfilling sparse data.

For backfilling data, there are a couple is things that need to be adjusted depending on the shape of your data.

  1. Range of time - If you are backfilling years of data, you will most likely need to increase the shard duration on your retention policy as the default of 1w will end up creating lots of shards. If you do not plan on deleting the data, the larger the duration the better.
  2. Density - If you have sparse data, for example, stock ticker data with 1 value per day for years, you will also need to increase your shard duration to avoid creating lots of small sparse shards.
  3. Cache Config - Each shard has a cache of recently written points. By default, these are snapshotted to disk after the shard goes cold. The default is 10m. When backfilling, you frequently end up writing to lot of shards in a short period of time if the default shard duration is used. It’s recommended to lower your cache-snapshot-write-cold-duration to 10s during the backfilling so that the shard is snapshotted more quickly once you move to the next.

Ever-increasing RAM usage with low series cardinality