Backfill 16 years of data

Hello, I have 16 years of data to backfill into InfluxDB v2

  • The data consists of around 20,000 seperate data streams.
  • Each data stream has a timestamp and value
  • I have a measurement for each data stream and value field for the data.

I initially put something together with Python to select from the data source and insert to InfluxDB.

This was going fine until I stopped the process after backfilling 1 year of data, after restarting Influx, the CPU and Memory would sit at 100% usage.

  • 8 Cores
  • 16GB Memory

I saw somewhere that setting the “Shard group duration” to 52 weeks might help, I was also backfilling in descending order which was not ideal.

So I have started again with a 52 week ‘Shard group duration’ and backfilling in ascending order.

Would anyone have any ideas about the CPU and Memory issues after attempting this.

I was getting these errors in the Influx DB logs

fatal error: out of memory allocating heap arena metadata

InfluxDB memory requirement highly depends on the data.
Increasing the shard group duration will compress your data more (as you will have just one shard for 52 weeks instead of i.e. 52 shards of 1 week) but at the same time I it will require more memory to perform any operation (read/write/compaction)…

Your issue can be caused by a cardinality problem or simply by the amount of data, I highly suggest looking at cardinality and having a look at your current schema to understand if that’s the issue.

In some cases, memory errors can be avoided by changing some database settings (ie: storage-max-concurrent-compactions in case the system crashes during compaction)