Crash loop with "cannot allocate memory"

I have an Influx database that has recently entered a crash-loop, producing the following log messages each time. It gives some lengthy strack-traces or something after what I paste here.

The database is admittedly somewhat large (we recently made a change that pushed it from 21 million series to 23 million series, ironically by dropping some tags on new data). Assuming the ultimate cause of the crash is “your database is too large”, do you have any advice on how to recover what we’ve got? We could give up the problematic tags on old data, but as far as I know, the only way to do that is to export, modify, and re-import, which I think would be pretty cumbersome. This is with InfluxDB 1.8.3.

Nov 04 09:37:39 xenial-template influxd[12350]: ts=2020-11-04T17:37:39.800669Z lvl=info msg="Error adding new TSM files from snapshot. Removing temp files." log_id=0QGxy7u0000 engine=tsm1 trace_id=0QH5D8V0000 op_name=tsm1_cache_snapshot error="cannot allocate memory"
Nov 04 09:37:39 xenial-template influxd[12350]: ts=2020-11-04T17:37:39.802607Z lvl=info msg="Cache snapshot (end)" log_id=0QGxy7u0000 engine=tsm1 trace_id=0QH5D8V0000 op_name=tsm1_cache_snapshot op_event=end op_elapsed=1373.912ms
Nov 04 09:37:39 xenial-template influxd[12350]: ts=2020-11-04T17:37:39.802628Z lvl=info msg="Error writing snapshot" log_id=0QGxy7u0000 engine=tsm1 error="cannot allocate memory"
Nov 04 09:37:39 xenial-template influxd[12350]: ts=2020-11-04T17:37:39.802641Z lvl=info msg="Cache snapshot (start)" log_id=0QGxy7u0000 engine=tsm1 trace_id=0QH5DDrW000 op_name=tsm1_cache_snapshot op_event=start
Nov 04 09:37:40 xenial-template influxd[12350]: fatal error: runtime: cannot allocate memory
Nov 04 09:37:40 xenial-template influxd[12350]: runtime stack:

Hello @ezquat,
Wow! 23 M series! I’m not sure what the best solution is. I’m sharing your question with the team. Thank you for your patience.

What we have done for the moment is to move away half the shard files and bring up Influx on the more recent ones, which we care more about.

This is kind of satisfactory for now, but doesn’t feel very good as a general strategy. Our database grows over time and then we hit these cliffs where it starts to misbehave. We have to react quickly, and have to somewhat trust that we can actually bring up a server on top of shard files that we have manually moved around.

1 Like

@ezquat,
Thank you for sharing. I’ve shared your question with the team, but there might be a longer than usual delay because of the 2.0 GA release and InfluxDays.