Out of Memory on Startup 2.7.1

Hey everyone, I’ve been having a significant problem with my InfluxDB docker setup. It seems that every time I start my container, the container’s memory skyrockets and the container exits with a 137 exit code (out of memory) before the service gets anywhere. We have hundreds of gigabytes of data stored, so loading it all into memory is not an option. What can I do to reduce the startup load and get InfluxDB running without trigger an OOM error?

I’ve found a few resources that mention out-of-memory problems: influxdb out of memory · Issue #13318 · influxdata/influxdb · GitHub, https://github.com/influxdata/influxdb/issues/24128. I’ve tried messing with the configuration parameters to no avail. I’ve also tried setting GODEBUG=madvdontneed=1 but that didn’t help either.

Here is the end of the docker logs:

2023-05-02 10:06:17 ts=2023-05-02T16:06:17.094979Z lvl=info msg="loading changes (start)" log_id=0hZJEXvW000 service=storage-engine engine=tsm1 op_name="field indices" op_event=start
2023-05-02 10:06:17 ts=2023-05-02T16:06:17.105109Z lvl=info msg="Opened file" log_id=0hZJEXvW000 service=storage-engine engine=tsm1 service=filestore path=/var/lib/influxdb2/engine/data/37612e6f882fd383/autogen/109/000000024-000000003.tsm id=0 duration=61.675ms
2023-05-02 10:06:17 ts=2023-05-02T16:06:17.119459Z lvl=info msg="Opened shard" log_id=0hZJEXvW000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/var/lib/influxdb2/engine/data/37612e6f882fd383/autogen/109 duration=42524.077ms
2023-05-02 10:06:17 ts=2023-05-02T16:06:17.528501Z lvl=info msg="index opened with 8 partitions" log_id=0hZJEXvW000 service=storage-engine index=tsi
2023-05-02 10:06:17 ts=2023-05-02T16:06:17.537640Z lvl=info msg="loading changes (start)" log_id=0hZJEXvW000 service=storage-engine engine=tsm1 op_name="field indices" op_event=start
2023-05-02 10:06:17 ts=2023-05-02T16:06:17.791727Z lvl=info msg="loading changes (end)" log_id=0hZJEXvW000 service=storage-engine engine=tsm1 op_name="field indices" op_event=end op_elapsed=696.751ms
2023-05-02 10:06:17 ts=2023-05-02T16:06:17.847755Z lvl=info msg="Opened file" log_id=0hZJEXvW000 service=storage-engine engine=tsm1 service=filestore path=/var/lib/influxdb2/engine/data/37612e6f882fd383/autogen/292/000000001-000000001.tsm id=0 duration=9.282ms
2023-05-02 10:06:17 ts=2023-05-02T16:06:17.863249Z lvl=info msg="Opened shard" log_id=0hZJEXvW000 service=storage-engine service=store op_name=tsdb_open index_version=tsi1 path=/var/lib/influxdb2/engine/data/37612e6f882fd383/autogen/292 duration=1049.864ms
2023-05-02 10:06:18 ts=2023-05-02T16:06:18.341013Z lvl=info msg="index opened with 8 partitions" log_id=0hZJEXvW000 service=storage-engine index=tsi
2023-05-02 10:06:18 ts=2023-05-02T16:06:18.344178Z lvl=info msg="loading changes (start)" log_id=0hZJEXvW000 service=storage-engine engine=tsm1 op_name="field indices" op_event=start

After some more investigation, this appears to be a WAL issue. Based on some other threads like InfluxDB 1.7.4 fails after 9 months without issues and [0.9.3] WAL gets progressively slower as DB size increases · Issue #3885 · influxdata/influxdb · GitHub, it seems like the WAL files get progressively larger as new data is added, and don’t get flushed. Then when the database starts again, it tries to load the entire WAL into memory to do compaction, which results in an OOM error.

I’ve tried altering all the storage configurations, but nothing seems to prevent Influx from reading all the data into memory at once. Any help with this?

Hello @Adam_Ten_Hoeve,
Thank you for sharing what you’ve found.
I think you’re right.
Unfortunately operator control of memory is not something that these versions of InfluxDB offer. We’re looking to offer that with 3.0.

Same issue here with using influxDB 2.7.
Even increasing swap to 40GB (!) did not help. The docker instance crashes because of excessive memory usage at startup. Tried many memory environment configs as well, but without succes.

@Adam_Ten_Hoeve did you find a solution in the mean time?
@Anaisdg Is there nothing we can do about this until 3.0 is released for OSS?
Is there an indication when 3.0 OSS will be released for the public?

Having the same issue here. Any help would be greatly appreciated.

I don’t suppose we can delete the wal files can we?

I have met the same issue too. Is there any solutions without increasing physical memory?