Backfill 16 years of data

Hello, I have 16 years of data to backfill into InfluxDB v2

  • The data consists of around 20,000 seperate data streams.
  • Each data stream has a timestamp and value
  • I have a measurement for each data stream and value field for the data.

I initially put something together with Python to select from the data source and insert to InfluxDB.

This was going fine until I stopped the process after backfilling 1 year of data, after restarting Influx, the CPU and Memory would sit at 100% usage.

  • 8 Cores
  • 16GB Memory

I saw somewhere that setting the “Shard group duration” to 52 weeks might help, I was also backfilling in descending order which was not ideal.

So I have started again with a 52 week ‘Shard group duration’ and backfilling in ascending order.

Would anyone have any ideas about the CPU and Memory issues after attempting this.

I was getting these errors in the Influx DB logs

fatal error: out of memory allocating heap arena metadata

InfluxDB memory requirement highly depends on the data.
Increasing the shard group duration will compress your data more (as you will have just one shard for 52 weeks instead of i.e. 52 shards of 1 week) but at the same time I it will require more memory to perform any operation (read/write/compaction)


Your issue can be caused by a cardinality problem or simply by the amount of data, I highly suggest looking at cardinality and having a look at your current schema to understand if that’s the issue.

In some cases, memory errors can be avoided by changing some database settings (ie: storage-max-concurrent-compactions in case the system crashes during compaction)

Hello Giovanni, thanks for the advice.

I found the issue (Schema), although I think there is an issue with how InfluxDB was trying to handle the index.idxl files, which did not help.

The Setup

OS: Windows Server 2019
Memory: 16GB
CPU: 8
InfluxDB: 2.71

Task
We had another data store that contained data from 2006 → 2021 (~34bn timestamp and values associated with around 25,000 tags); we wanted to see if we could migrate all of the data into InfluxDB OOS v2.

Orignial Schema Chosen

  • One bucket
  • A measurement for each tag (25,000)
  • A single field for each measurement
  • Shard group duration 84 days

Example

Looking back on this schema now, it might not follow some of the best practices, although because we did not see any issues until after the data was moved into InfluxDB and triggered the processing of the idxl files 41 days later, we assumed it might be OK.

Inserting data
There were no performance issues while inserting all the data from 2006 → 2021. InfluxDB was storing the data fast, and I could also query the data with no problems.

The issues started after the first restart of InfluxDB once all of the data had been inserted. When InfluxDB was starting, it would always run out of memory, even after increasing it to 128GB, showing the following in the logs.

ts=2024-01-02T05:11:57.763390Z lvl=info msg=“index opened with 8 partitions” log_id=0mU9CmDW000 service=storage-engine index=tsi
ts=2024-01-02T05:11:57.764286Z lvl=info msg=“loading changes (start)” log_id=0mU9CmDW000 service=storage-engine engine=tsm1 op_name=“field indices” op_event=start
ts=2024-01-02T05:11:57.798411Z lvl=info msg=“index opened with 8 partitions” log_id=0mU9CmDW000 service=storage-engine index=tsi
ts=2024-01-02T05:11:57.799376Z lvl=info msg=“loading changes (start)” log_id=0mU9CmDW000 service=storage-engine engine=tsm1 op_name=“field indices” op_event=start
ts=2024-01-02T05:11:57.808020Z lvl=info msg=“index opened with 8 partitions” log_id=0mU9CmDW000 service=storage-engine index=tsi
ts=2024-01-02T05:11:57.809423Z lvl=info msg=“loading changes (start)” log_id=0mU9CmDW000 service=storage-engine engine=tsm1 op_name=“field indices” op_event=start
ts=2024-01-02T05:11:57.850115Z lvl=info msg=“index opened with 8 partitions” log_id=0mU9CmDW000 service=storage-engine index=tsi
ts=2024-01-02T05:11:57.853269Z lvl=info msg=“loading changes (start)” log_id=0mU9CmDW000 service=storage-engine engine=tsm1 op_name=“field indices” op_event=start
ts=2024-01-02T05:11:57.971738Z lvl=info msg=“index opened with 8 partitions” log_id=0mU9CmDW000 service=storage-engine index=tsi
ts=2024-01-02T05:11:57.971738Z lvl=info msg=“loading changes (start)” log_id=0mU9CmDW000 service=storage-engine engine=tsm1 op_name=“field indices” op_event=start
ts=2024-01-02T05:11:58.006032Z lvl=info msg=“index opened with 8 partitions” log_id=0mU9CmDW000 service=storage-engine index=tsi
ts=2024-01-02T05:11:58.007230Z lvl=info msg=“loading changes (start)” log_id=0mU9CmDW000 service=storage-engine engine=tsm1 op_name=“field indices” op_event=start
ts=2024-01-02T05:13:06.534138Z lvl=info msg=“loading changes (end)” log_id=0mU9CmDW000 service=storage-engine engine=tsm1 op_name=“field indices” op_event=end op_elapsed=68724.797ms
ts=2024-01-02T05:13:08.743053Z lvl=info msg=“Opened file” log_id=0mU9CmDW000 service=storage-engine engine=tsm1 service=filestore path=F:\InfluxData\.influxdbv2\engine\data\990a8c4f7ef61b90\autogen\159\000000838-000000002.tsm id=0 duration=1769.063ms
ts=2024-01-02T05:13:08.753084Z lvl=info msg=“Opened shard” log_id=0mU9CmDW000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=F:\InfluxData\.influxdbv2\engine\data\990a8c4f7ef61b90\autogen\159 duration=71956.496ms
ts=2024-01-02T05:13:09.203008Z lvl=info msg=“index opened with 8 partitions” log_id=0mU9CmDW000 service=storage-engine index=tsi
ts=2024-01-02T05:13:09.203008Z lvl=info msg=“loading changes (start)” log_id=0mU9CmDW000 service=storage-engine engine=tsm1 op_name=“field indices” op_event=start
ts=2024-01-02T05:15:03.589323Z lvl=info msg=“loading changes (end)” log_id=0mU9CmDW000 service=storage-engine engine=tsm1 op_name=“field indices” op_event=end op_elapsed=186218.152ms
ts=2024-01-02T05:15:04.412393Z lvl=info msg=“Opened file” log_id=0mU9CmDW000 service=storage-engine engine=tsm1 service=filestore path=F:\InfluxData\.influxdbv2\engine\data\990a8c4f7ef61b90\autogen\113\000000427-000000002.tsm id=0 duration=690.736ms
ts=2024-01-02T05:15:04.452393Z lvl=info msg=“Opened shard” log_id=0mU9CmDW000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=F:\InfluxData\.influxdbv2\engine\data\990a8c4f7ef61b90\autogen\113 duration=187904.339ms
ts=2024-01-02T05:15:05.432625Z lvl=info msg=“index opened with 8 partitions” log_id=0mU9CmDW000 service=storage-engine index=tsi
ts=2024-01-02T05:15:05.446352Z lvl=info msg=“loading changes (start)” log_id=0mU9CmDW000 service=storage-engine engine=tsm1 op_name=“field indices” op_event=start
fatal error: runtime: cannot allocate memory
runtime stack:
runtime.throw({0x40ab064?, 0x1194570?})
/go/src/runtime/panic.go:1047 +0x65 fp=0x3f84fc60 sp=0x3f84fc30 pc=0x43b905
runtime.persistentalloc1(0x3fc0, 0xffffffff00012ad4?, 0x5f00a40)
/go/src/runtime/malloc.go:1440 +0x24f fp=0x3f84fca8 sp=0x3f84fc60 pc=0x41030f
runtime.persistentalloc.func1()

After a while of trying to figure out what the issue was, I think InfluxDB was trying to load all the fields.idxl files into memory.

Each one of the autogen folders had a very large idxl file. I had a search around GitHub to see if I could find any reference to what they are, although I could not find any details apart from this issue.
https://github.com/influxdata/influxdb/issues/23653

before

I tried changing some of the configuration options to ensure it did not try to load them all and process them simultaneously, although I had no luck.

Because inserting all the data took around 40 days, I wanted to find a way to work around the issue. This bucket will also be cold storage with no new data added, so if I could get it to work, I would save another 40 days by changing the schema.

Working around the issue
By moving all the autogen folders out of the autogen folder, it forced InfluxDB to only load and process one at a time. To save time, I created a script to do this for me, slowly putting them back in, one by one.

It would be good to know if there was something I missed in the configuration that would have solved this for me.

After each fields.idxl file was processed into fields.idx, it was significantly smaller.

Before
fields.idxl before processing on InfluxDB Start
before

After
after

3.7gb → 519kb

There were no memory issues once all the idxl files were processed into idx files.

Next time
Once that was sorted, I investigated other schemas, one suggested by InfluxDB.

Instead of having a measurement and field for every tag, there would be a single measurement, a tag for every tag and fields for the different data types.

Single Measurement / Multiple Tags / Data Type Fields

To ensure this was a better schema, avoiding the memory issues I was seeing, I spun up two identical instances of Influx DB and inserted identical data.

I found that the new schema did not have issues with large fields.idxl files after inserting the data.

Below are two separate instances with different schemas and identical data. The one on the right was my original schema with the large idxl file, and the one on the left is the new schema with a tiny idxl file after inserting the data.

During the insert of the data, there were no differences resources used for CPU / Memory.

After restarting InfluxDB, triggering the processing of the fields.idxl files, there was a significant difference with the original schema in pink running out of memory when trying to process fields.idxl.

The orange line is the new schema (Single Measurement / Multiple Tags / Data Type Fields), which had no issues when testing; I have yet to test the new schema with as much data as the previous attempt, although it looks promising with the testing I did.

TLDR: If you have memory issues and logs similar to the ones posted above, check to see if you have some huge idxl files, see if your schema can be improved or if you can process the idxl files one by one to work around the issue temporarily.

1 Like

Hi Brenton,

Thank you very much for providing your analysis in detail. I faced the same issue while processing idxl files and came looking for similar experiences. So far I am not sure how to trigger these fields.idxl → .idx processing periodically without having to restart influxd.exe each time. There doesn’t seem to be anything in the config options relating to this either.