Strange disk space usage leak after InfluxDB 1.8.x upgrade

Hello!
So a couple weeks ago, I upgraded some InfluxDB servers from 1.7.9 to 1.8.0 and a strange disks apace usage leak started. (I subsequently upgraded from 1.8.0 to 1.8.2 but it did not make any difference)

The pattern of metrics ingestion has not changed, nor have my CQs change, but when I upgraded to 1.8.x suddenly the disk space usage started growing by somewhere in the range of 2.0 to 3.5 gigbytes PER HOUR

Note that this growth is happening on the partition where meta, data and wal are located. The logs, the config, and the rest of the OS (Centos7) are all on different partitions, and are not affected.

Now here is the really strange part. If I restart the influxdb service, the disk usage suddenly drops back to normal levels. No data is lost, which is why I am describing this rapid disk space growth as a “leak”

Has anyone else seen behavior like this? Does anyone have any suggestions for debugging this issue?

check data directories to identify which DB/RP storage is growing fast:
du -sch /opt/influxdb/data/*/*

I would say that there are a number of .tsm files created, which is rather OK.
But if the disk space is released after service restart, then it is probably .tsm.tmp files.

.tmp files are temporary files created during compaction of two .tsm files, whic also OK, they are automatically deleted after shard compaction.
Depending on how many max-concurrent-compactions are allowed, you may see multiple .tsm.tmp files per directory, but usually it should be less than number of CPU cores.

If there are more .tsm.tmp files per shard directory - then there is a clearly problem, it means shard compaction can’t complete successfly, it is started againwith new .tmp file created each time.

When you restart inflxudb service, it scans all the shards and automatically removes .tmp files.

Check inflxudb error log for error messages.

Thank you, @anatolijd !
That is very helpful information.

It does appear that I have an accumulation of .tsm.tmp files, and those are the files that vanish when I restart the influxdb service. So your suggestion that shard compaction might be failing does seem likely.

I checked the logs, but could not find anything that looked like an error message. I just see things like this:

Sep  3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959708Z lvl=info msg="TSM compaction (start)" log_id=0O~g78PG000 engine=tsm1 tsm1_leve
l=1 tsm1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group op_event=start
Sep  3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959762Z lvl=info msg="Beginning compaction" log_id=0O~g78PG000 engine=tsm1 tsm1_level=
1 tsm1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_files_n=8
Sep  3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959767Z lvl=info msg="Compacting file" log_id=0O~g78PG000 engine=tsm1 tsm1_level=1 tsm
1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/bigdisk0/influxdb/data/telegraf/autogen/1454/000059019-000000001.tsm
Sep  3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959772Z lvl=info msg="Compacting file" log_id=0O~g78PG000 engine=tsm1 tsm1_level=1 tsm
1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_index=1 tsm1_file=/bigdisk0/influxdb/data/telegraf/autogen/1454/000059020-000000001.tsm
Sep  3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959777Z lvl=info msg="Compacting file" log_id=0O~g78PG000 engine=tsm1 tsm1_level=1 tsm
1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_index=2 tsm1_file=/bigdisk0/influxdb/data/telegraf/autogen/1454/000059021-000000001.tsm
Sep  3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959781Z lvl=info msg="Compacting file" log_id=0O~g78PG000 engine=tsm1 tsm1_level=1 tsm
1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_index=3 tsm1_file=/bigdisk0/influxdb/data/telegraf/autogen/1454/000059022-000000001.tsm
Sep  3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959786Z lvl=info msg="Compacting file" log_id=0O~g78PG000 engine=tsm1 tsm1_level=1 tsm
1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_index=4 tsm1_file=/bigdisk0/influxdb/data/telegraf/autogen/1454/000059023-000000001.tsm
Sep  3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959790Z lvl=info msg="Compacting file" log_id=0O~g78PG000 engine=tsm1 tsm1_level=1 tsm
1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_index=5 tsm1_file=/bigdisk0/influxdb/data/telegraf/autogen/1454/000059024-000000001.tsm
Sep  3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959794Z lvl=info msg="Compacting file" log_id=0O~g78PG000 engine=tsm1 tsm1_level=1 tsm
1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_index=6 tsm1_file=/bigdisk0/influxdb/data/telegraf/autogen/1454/000059025-000000001.tsm
Sep  3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959799Z lvl=info msg="Compacting file" log_id=0O~g78PG000 engine=tsm1 tsm1_level=1 tsm
1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_index=7 tsm1_file=/bigdisk0/influxdb/data/telegraf/autogen/1454/000059026-000000001.tsm
Sep  3 04:39:56 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:56.447775Z lvl=info msg="Compacted file" log_id=0O~g78PG000 engine=tsm1 tsm1_level=1 tsm1
_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/bigdisk0/influxdb/data/telegraf/autogen/1454/000059026-000000002.tsm.tmp
Sep  3 04:39:56 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:56.447829Z lvl=info msg="Finished compacting files" log_id=0O~g78PG000 engine=tsm1 tsm1_l
evel=1 tsm1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_files_n=1
Sep  3 04:39:56 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:56.447838Z lvl=info msg="TSM compaction (end)" log_id=0O~g78PG000 engine=tsm1 tsm1_level=
1 tsm1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group op_event=end op_elapsed=2488.134ms

I have increased the loglevel to debug, so I’ll wait a while and see if that gives me any useful error logs

Any other ideas how to investigate why shard compaction might have started failing suddenly with 1.8.0 ?

Even at debug log level, there is no indication of what might be going wrong in the log file.

Any other ideas of how to troubleshoot this?

I can’t get any debug or trace level logging at all.
My influxdb.conf file has:

trace-logging-enabled = true

in the [data] section and

level = "debug"

in the [logging] section, but in the log file, I am not seeing a single message of any kind for any log level other than “info”

grep -v "lvl=info" influxdb.log

^ that grep command finds zero matches.

So I am still no closer to debugging why my shard compaction might be failing and leaving behind tsm.tmp files

So I wanted to update y’all on this.

I never did get any further debug/trace information to show up in the log.

A couple days ago, the disk space abruptly stopped. I had been restarting the influxdb servers every few days to avoid running out of space

You can see in this screenshot, in the upper graph, the disk usage drops back to normal after each restart, and grows rapidly, but somewhat unpredictably until I restart again. At the most recent restart, the growth just stops.

The tiny spikes each night are the nightly backup script. You can see from the bottom graph that I moved the nightly backup files to a different disk. That seems to be the only change that correlates to the stop of the growth.

I am struggling to guess how the nightly backup could have triggered it, especially since the backup script never caused this problem for over a year, and didn’t start until the upgrade to 1.8.x and also out of six influxdb servers, all of which were upgraded to 1.8.x at the same time, and all of which use the same backup script, only two of them exhibited the strange disk space leak.

tldr; it stopped happening on its own, I can only guess as to why it happened, and I am not likelt to figure out why unless it starts happening again :frowning: