Hello Giovanni, thanks for the advice.
I found the issue (Schema), although I think there is an issue with how InfluxDB was trying to handle the index.idxl files, which did not help.
The Setup
OS: Windows Server 2019
Memory: 16GB
CPU: 8
InfluxDB: 2.71
Task
We had another data store that contained data from 2006 â 2021 (~34bn timestamp and values associated with around 25,000 tags); we wanted to see if we could migrate all of the data into InfluxDB OOS v2.
Orignial Schema Chosen
- One bucket
- A measurement for each tag (25,000)
- A single field for each measurement
- Shard group duration 84 days
Example
Looking back on this schema now, it might not follow some of the best practices, although because we did not see any issues until after the data was moved into InfluxDB and triggered the processing of the idxl files 41 days later, we assumed it might be OK.
Inserting data
There were no performance issues while inserting all the data from 2006 â 2021. InfluxDB was storing the data fast, and I could also query the data with no problems.
The issues started after the first restart of InfluxDB once all of the data had been inserted. When InfluxDB was starting, it would always run out of memory, even after increasing it to 128GB, showing the following in the logs.
ts=2024-01-02T05:11:57.763390Z lvl=info msg=âindex opened with 8 partitionsâ log_id=0mU9CmDW000 service=storage-engine index=tsi
ts=2024-01-02T05:11:57.764286Z lvl=info msg=âloading changes (start)â log_id=0mU9CmDW000 service=storage-engine engine=tsm1 op_name=âfield indicesâ op_event=start
ts=2024-01-02T05:11:57.798411Z lvl=info msg=âindex opened with 8 partitionsâ log_id=0mU9CmDW000 service=storage-engine index=tsi
ts=2024-01-02T05:11:57.799376Z lvl=info msg=âloading changes (start)â log_id=0mU9CmDW000 service=storage-engine engine=tsm1 op_name=âfield indicesâ op_event=start
ts=2024-01-02T05:11:57.808020Z lvl=info msg=âindex opened with 8 partitionsâ log_id=0mU9CmDW000 service=storage-engine index=tsi
ts=2024-01-02T05:11:57.809423Z lvl=info msg=âloading changes (start)â log_id=0mU9CmDW000 service=storage-engine engine=tsm1 op_name=âfield indicesâ op_event=start
ts=2024-01-02T05:11:57.850115Z lvl=info msg=âindex opened with 8 partitionsâ log_id=0mU9CmDW000 service=storage-engine index=tsi
ts=2024-01-02T05:11:57.853269Z lvl=info msg=âloading changes (start)â log_id=0mU9CmDW000 service=storage-engine engine=tsm1 op_name=âfield indicesâ op_event=start
ts=2024-01-02T05:11:57.971738Z lvl=info msg=âindex opened with 8 partitionsâ log_id=0mU9CmDW000 service=storage-engine index=tsi
ts=2024-01-02T05:11:57.971738Z lvl=info msg=âloading changes (start)â log_id=0mU9CmDW000 service=storage-engine engine=tsm1 op_name=âfield indicesâ op_event=start
ts=2024-01-02T05:11:58.006032Z lvl=info msg=âindex opened with 8 partitionsâ log_id=0mU9CmDW000 service=storage-engine index=tsi
ts=2024-01-02T05:11:58.007230Z lvl=info msg=âloading changes (start)â log_id=0mU9CmDW000 service=storage-engine engine=tsm1 op_name=âfield indicesâ op_event=start
ts=2024-01-02T05:13:06.534138Z lvl=info msg=âloading changes (end)â log_id=0mU9CmDW000 service=storage-engine engine=tsm1 op_name=âfield indicesâ op_event=end op_elapsed=68724.797ms
ts=2024-01-02T05:13:08.743053Z lvl=info msg=âOpened fileâ log_id=0mU9CmDW000 service=storage-engine engine=tsm1 service=filestore path=F:\InfluxData\.influxdbv2\engine\data\990a8c4f7ef61b90\autogen\159\000000838-000000002.tsm id=0 duration=1769.063ms
ts=2024-01-02T05:13:08.753084Z lvl=info msg=âOpened shardâ log_id=0mU9CmDW000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=F:\InfluxData\.influxdbv2\engine\data\990a8c4f7ef61b90\autogen\159 duration=71956.496ms
ts=2024-01-02T05:13:09.203008Z lvl=info msg=âindex opened with 8 partitionsâ log_id=0mU9CmDW000 service=storage-engine index=tsi
ts=2024-01-02T05:13:09.203008Z lvl=info msg=âloading changes (start)â log_id=0mU9CmDW000 service=storage-engine engine=tsm1 op_name=âfield indicesâ op_event=start
ts=2024-01-02T05:15:03.589323Z lvl=info msg=âloading changes (end)â log_id=0mU9CmDW000 service=storage-engine engine=tsm1 op_name=âfield indicesâ op_event=end op_elapsed=186218.152ms
ts=2024-01-02T05:15:04.412393Z lvl=info msg=âOpened fileâ log_id=0mU9CmDW000 service=storage-engine engine=tsm1 service=filestore path=F:\InfluxData\.influxdbv2\engine\data\990a8c4f7ef61b90\autogen\113\000000427-000000002.tsm id=0 duration=690.736ms
ts=2024-01-02T05:15:04.452393Z lvl=info msg=âOpened shardâ log_id=0mU9CmDW000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=F:\InfluxData\.influxdbv2\engine\data\990a8c4f7ef61b90\autogen\113 duration=187904.339ms
ts=2024-01-02T05:15:05.432625Z lvl=info msg=âindex opened with 8 partitionsâ log_id=0mU9CmDW000 service=storage-engine index=tsi
ts=2024-01-02T05:15:05.446352Z lvl=info msg=âloading changes (start)â log_id=0mU9CmDW000 service=storage-engine engine=tsm1 op_name=âfield indicesâ op_event=start
fatal error: runtime: cannot allocate memory
runtime stack:
runtime.throw({0x40ab064?, 0x1194570?})
/go/src/runtime/panic.go:1047 +0x65 fp=0x3f84fc60 sp=0x3f84fc30 pc=0x43b905
runtime.persistentalloc1(0x3fc0, 0xffffffff00012ad4?, 0x5f00a40)
/go/src/runtime/malloc.go:1440 +0x24f fp=0x3f84fca8 sp=0x3f84fc60 pc=0x41030f
runtime.persistentalloc.func1()
After a while of trying to figure out what the issue was, I think InfluxDB was trying to load all the fields.idxl files into memory.
Each one of the autogen folders had a very large idxl file. I had a search around GitHub to see if I could find any reference to what they are, although I could not find any details apart from this issue.
https://github.com/influxdata/influxdb/issues/23653
I tried changing some of the configuration options to ensure it did not try to load them all and process them simultaneously, although I had no luck.
Because inserting all the data took around 40 days, I wanted to find a way to work around the issue. This bucket will also be cold storage with no new data added, so if I could get it to work, I would save another 40 days by changing the schema.
Working around the issue
By moving all the autogen folders out of the autogen folder, it forced InfluxDB to only load and process one at a time. To save time, I created a script to do this for me, slowly putting them back in, one by one.
It would be good to know if there was something I missed in the configuration that would have solved this for me.
After each fields.idxl file was processed into fields.idx, it was significantly smaller.
Before
fields.idxl before processing on InfluxDB Start
After
3.7gb â 519kb
There were no memory issues once all the idxl files were processed into idx files.
Next time
Once that was sorted, I investigated other schemas, one suggested by InfluxDB.
Instead of having a measurement and field for every tag, there would be a single measurement, a tag for every tag and fields for the different data types.
Single Measurement / Multiple Tags / Data Type Fields
To ensure this was a better schema, avoiding the memory issues I was seeing, I spun up two identical instances of Influx DB and inserted identical data.
I found that the new schema did not have issues with large fields.idxl files after inserting the data.
Below are two separate instances with different schemas and identical data. The one on the right was my original schema with the large idxl file, and the one on the left is the new schema with a tiny idxl file after inserting the data.
During the insert of the data, there were no differences resources used for CPU / Memory.
After restarting InfluxDB, triggering the processing of the fields.idxl files, there was a significant difference with the original schema in pink running out of memory when trying to process fields.idxl.
The orange line is the new schema (Single Measurement / Multiple Tags / Data Type Fields), which had no issues when testing; I have yet to test the new schema with as much data as the previous attempt, although it looks promising with the testing I did.
TLDR: If you have memory issues and logs similar to the ones posted above, check to see if you have some huge idxl files, see if your schema can be improved or if you can process the idxl files one by one to work around the issue temporarily.