InfluxDB 2 v2.7.0 Freezes Time After Time

A very big hug to the community!

I have some problems with InfluxDB freezing because of an “internal error”. I suppose the reason could be something related to reindexing. Currently, I can’t catch the problem.

So, I’m interested in either somebody who has resolved such a problem or has some ideas on how to overcome an issue. Thanks!

Some logs:

[user@monitoring ~]$ journalctl  --since "2023-10-11 02:07" --until "2023-10-11 02:10" | grep -i 'influxd-systemd-start'
Oct 11 02:08:06 monitoring influxd-systemd-start.sh[3174967]: ts=2023-10-11T02:08:06.254103Z lvl=info msg="index opened with 8 partitions" log_id=0knAdzwG000 service=storage-engine index=tsi
Oct 11 02:08:06 monitoring influxd-systemd-start.sh[3174967]: ts=2023-10-11T02:08:06.254484Z lvl=info msg="loading changes (start)" log_id=0knAdzwG000 service=storage-engine engine=tsm1 op_name="field indices" op_event=start
Oct 11 02:08:06 monitoring influxd-systemd-start.sh[3174967]: ts=2023-10-11T02:08:06.254530Z lvl=info msg="loading changes (end)" log_id=0knAdzwG000 service=storage-engine engine=tsm1 op_name="field indices" op_event=end op_elapsed=0.055ms
Oct 11 02:08:06 monitoring influxd-systemd-start.sh[3174967]: ts=2023-10-11T02:08:06.255025Z lvl=info msg="Reindexing TSM data" log_id=0knAdzwG000 service=storage-engine engine=tsm1 db_shard_id=809
Oct 11 02:08:06 monitoring influxd-systemd-start.sh[3174967]: ts=2023-10-11T02:08:06.255042Z lvl=info msg="Reindexing WAL data" log_id=0knAdzwG000 service=storage-engine engine=tsm1 db_shard_id=809
Oct 11 02:08:06 monitoring influxd-systemd-start.sh[3174967]: ts=2023-10-11T02:08:06.271975Z lvl=info msg="saving field index changes (start)" log_id=0knAdzwG000 service=storage-engine engine=tsm1 op_name=MeasurementFieldSet op_event=start
Oct 11 02:08:06 monitoring influxd-systemd-start.sh[3174967]: ts=2023-10-11T02:08:06.277287Z lvl=info msg="saving field index changes (end)" log_id=0knAdzwG000 service=storage-engine engine=tsm1 op_name=MeasurementFieldSet op_event=end op_elapsed=5.323ms
Oct 11 02:08:06 monitoring influxd-systemd-start.sh[3174967]: ts=2023-10-11T02:08:06.643584Z lvl=warn msg="internal error not returned to client" log_id=0knAdzwG000 handler=error_logger error="context canceled"
Oct 11 02:08:06 monitoring influxd-systemd-start.sh[3174967]: ts=2023-10-11T02:08:06.644776Z lvl=warn msg="internal error not returned to client" log_id=0knAdzwG000 handler=error_logger error="context canceled"
Oct 11 02:08:11 monitoring influxd-systemd-start.sh[3174967]: ts=2023-10-11T02:08:11.634037Z lvl=warn msg="internal error not returned to client" log_id=0knAdzwG000 handler=error_logger error="context canceled"
Oct 11 02:08:11 monitoring influxd-systemd-start.sh[3174967]: ts=2023-10-11T02:08:11.642932Z lvl=warn msg="internal error not returned to client" log_id=0knAdzwG000 handler=error_logger error="context canceled"
Oct 11 02:08:26 monitoring influxd-systemd-start.sh[3174967]: ts=2023-10-11T02:08:26.646839Z lvl=warn msg="internal error not returned to client" log_id=0knAdzwG000 handler=error_logger error="context canceled"
Oct 11 02:08:26 monitoring influxd-systemd-start.sh[3174967]: ts=2023-10-11T02:08:26.650471Z lvl=warn msg="internal error not returned to client" log_id=0knAdzwG000 handler=error_logger error="context canceled"
Oct 11 02:08:56 monitoring influxd-systemd-start.sh[3174967]: ts=2023-10-11T02:08:56.655019Z lvl=warn msg="internal error not returned to client" log_id=0knAdzwG000 handler=error_logger error="context canceled"

Sometimes I have:

Oct 12 07:21:02 monitoring influxd-systemd-start.sh[3231510]: ts=2023-10-12T07:21:02.395281Z lvl=info msg="Cache snapshot (start)" log_id=0komBSfl000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=start
Oct 12 07:21:02 monitoring influxd-systemd-start.sh[3231510]: ts=2023-10-12T07:21:02.672531Z lvl=info msg="Snapshot for path written" log_id=0komBSfl000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot path=/opt/influxdb/engine/data/f34d8a2245cccd9f/autogen/812 duration=277.273ms
Oct 12 07:21:02 monitoring influxd-systemd-start.sh[3231510]: ts=2023-10-12T07:21:02.672577Z lvl=info msg="Cache snapshot (end)" log_id=0komBSfl000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=end op_elapsed=277.325ms
Oct 12 07:22:32 monitoring influxd-systemd-start.sh[3231510]: ts=2023-10-12T07:22:32.198702Z lvl=info msg="index opened with 8 partitions" log_id=0komBSfl000 service=storage-engine index=tsi
Oct 12 07:22:32 monitoring influxd-systemd-start.sh[3231510]: ts=2023-10-12T07:22:32.199096Z lvl=info msg="loading changes (start)" log_id=0komBSfl000 service=storage-engine engine=tsm1 op_name="field indices" op_event=start
Oct 12 07:22:32 monitoring influxd-systemd-start.sh[3231510]: ts=2023-10-12T07:22:32.199180Z lvl=info msg="loading changes (end)" log_id=0komBSfl000 service=storage-engine engine=tsm1 op_name="field indices" op_event=end op_elapsed=0.101ms
Oct 12 07:22:32 monitoring influxd-systemd-start.sh[3231510]: ts=2023-10-12T07:22:32.199927Z lvl=info msg="Reindexing TSM data" log_id=0komBSfl000 service=storage-engine engine=tsm1 db_shard_id=811
Oct 12 07:22:32 monitoring influxd-systemd-start.sh[3231510]: ts=2023-10-12T07:22:32.199945Z lvl=info msg="Reindexing WAL data" log_id=0komBSfl000 service=storage-engine engine=tsm1 db_shard_id=811
Oct 12 07:22:32 monitoring influxd-systemd-start.sh[3231510]: ts=2023-10-12T07:22:32.213748Z lvl=info msg="saving field index changes (start)" log_id=0komBSfl000 service=storage-engine engine=tsm1 op_name=MeasurementFieldSet op_event=start
Oct 12 07:22:32 monitoring influxd-systemd-start.sh[3231510]: ts=2023-10-12T07:22:32.221870Z lvl=info msg="saving field index changes (end)" log_id=0komBSfl000 service=storage-engine engine=tsm1 op_name=MeasurementFieldSet op_event=end op_elapsed=8.132ms
Oct 12 07:22:32 monitoring influxd-systemd-start.sh[3231510]: ts=2023-10-12T07:22:32.221942Z lvl=info msg="Write failed" log_id=0komBSfl000 service=storage-engine service=write shard=811 error="engine: context canceled"

Configs:

bolt-path = "/var/lib/influxdb/influxd.bolt"
engine-path = "/opt/influxdb/engine"
query-concurrency = 2048
query-initial-memory-bytes = 0
query-max-memory-bytes = 0
query-memory-bytes = 0
query-queue-size = 2048
storage-compact-throughput-burst = 134217728
storage-series-id-set-cache-size = 1600

Hello @Azimuth,
Hmm I’m not sure I haven’t encountered this issue a lot.
I’m seeing other users encounter it though. Have you investigated any of the solutions mentioned here:

Is it when you’re running queries? What queries are you running?
Are you using Grafana?