Thank you, @anatolijd !
That is very helpful information.
It does appear that I have an accumulation of .tsm.tmp files, and those are the files that vanish when I restart the influxdb service. So your suggestion that shard compaction might be failing does seem likely.
I checked the logs, but could not find anything that looked like an error message. I just see things like this:
Sep 3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959708Z lvl=info msg="TSM compaction (start)" log_id=0O~g78PG000 engine=tsm1 tsm1_leve
l=1 tsm1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group op_event=start
Sep 3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959762Z lvl=info msg="Beginning compaction" log_id=0O~g78PG000 engine=tsm1 tsm1_level=
1 tsm1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_files_n=8
Sep 3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959767Z lvl=info msg="Compacting file" log_id=0O~g78PG000 engine=tsm1 tsm1_level=1 tsm
1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/bigdisk0/influxdb/data/telegraf/autogen/1454/000059019-000000001.tsm
Sep 3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959772Z lvl=info msg="Compacting file" log_id=0O~g78PG000 engine=tsm1 tsm1_level=1 tsm
1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_index=1 tsm1_file=/bigdisk0/influxdb/data/telegraf/autogen/1454/000059020-000000001.tsm
Sep 3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959777Z lvl=info msg="Compacting file" log_id=0O~g78PG000 engine=tsm1 tsm1_level=1 tsm
1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_index=2 tsm1_file=/bigdisk0/influxdb/data/telegraf/autogen/1454/000059021-000000001.tsm
Sep 3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959781Z lvl=info msg="Compacting file" log_id=0O~g78PG000 engine=tsm1 tsm1_level=1 tsm
1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_index=3 tsm1_file=/bigdisk0/influxdb/data/telegraf/autogen/1454/000059022-000000001.tsm
Sep 3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959786Z lvl=info msg="Compacting file" log_id=0O~g78PG000 engine=tsm1 tsm1_level=1 tsm
1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_index=4 tsm1_file=/bigdisk0/influxdb/data/telegraf/autogen/1454/000059023-000000001.tsm
Sep 3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959790Z lvl=info msg="Compacting file" log_id=0O~g78PG000 engine=tsm1 tsm1_level=1 tsm
1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_index=5 tsm1_file=/bigdisk0/influxdb/data/telegraf/autogen/1454/000059024-000000001.tsm
Sep 3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959794Z lvl=info msg="Compacting file" log_id=0O~g78PG000 engine=tsm1 tsm1_level=1 tsm
1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_index=6 tsm1_file=/bigdisk0/influxdb/data/telegraf/autogen/1454/000059025-000000001.tsm
Sep 3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959799Z lvl=info msg="Compacting file" log_id=0O~g78PG000 engine=tsm1 tsm1_level=1 tsm
1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_index=7 tsm1_file=/bigdisk0/influxdb/data/telegraf/autogen/1454/000059026-000000001.tsm
Sep 3 04:39:56 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:56.447775Z lvl=info msg="Compacted file" log_id=0O~g78PG000 engine=tsm1 tsm1_level=1 tsm1
_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/bigdisk0/influxdb/data/telegraf/autogen/1454/000059026-000000002.tsm.tmp
Sep 3 04:39:56 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:56.447829Z lvl=info msg="Finished compacting files" log_id=0O~g78PG000 engine=tsm1 tsm1_l
evel=1 tsm1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_files_n=1
Sep 3 04:39:56 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:56.447838Z lvl=info msg="TSM compaction (end)" log_id=0O~g78PG000 engine=tsm1 tsm1_level=
1 tsm1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group op_event=end op_elapsed=2488.134ms
I have increased the loglevel to debug, so I’ll wait a while and see if that gives me any useful error logs
Any other ideas how to investigate why shard compaction might have started failing suddenly with 1.8.0 ?