Strange disk space usage leak after InfluxDB 1.8.x upgrade

Hello!
So a couple weeks ago, I upgraded some InfluxDB servers from 1.7.9 to 1.8.0 and a strange disks apace usage leak started. (I subsequently upgraded from 1.8.0 to 1.8.2 but it did not make any difference)

The pattern of metrics ingestion has not changed, nor have my CQs change, but when I upgraded to 1.8.x suddenly the disk space usage started growing by somewhere in the range of 2.0 to 3.5 gigbytes PER HOUR

Note that this growth is happening on the partition where meta, data and wal are located. The logs, the config, and the rest of the OS (Centos7) are all on different partitions, and are not affected.

Now here is the really strange part. If I restart the influxdb service, the disk usage suddenly drops back to normal levels. No data is lost, which is why I am describing this rapid disk space growth as a “leak”

Has anyone else seen behavior like this? Does anyone have any suggestions for debugging this issue?

2 Likes

check data directories to identify which DB/RP storage is growing fast:
du -sch /opt/influxdb/data/*/*

I would say that there are a number of .tsm files created, which is rather OK.
But if the disk space is released after service restart, then it is probably .tsm.tmp files.

.tmp files are temporary files created during compaction of two .tsm files, whic also OK, they are automatically deleted after shard compaction.
Depending on how many max-concurrent-compactions are allowed, you may see multiple .tsm.tmp files per directory, but usually it should be less than number of CPU cores.

If there are more .tsm.tmp files per shard directory - then there is a clearly problem, it means shard compaction can’t complete successfly, it is started againwith new .tmp file created each time.

When you restart inflxudb service, it scans all the shards and automatically removes .tmp files.

Check inflxudb error log for error messages.

Thank you, @anatolijd !
That is very helpful information.

It does appear that I have an accumulation of .tsm.tmp files, and those are the files that vanish when I restart the influxdb service. So your suggestion that shard compaction might be failing does seem likely.

I checked the logs, but could not find anything that looked like an error message. I just see things like this:

Sep  3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959708Z lvl=info msg="TSM compaction (start)" log_id=0O~g78PG000 engine=tsm1 tsm1_leve
l=1 tsm1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group op_event=start
Sep  3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959762Z lvl=info msg="Beginning compaction" log_id=0O~g78PG000 engine=tsm1 tsm1_level=
1 tsm1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_files_n=8
Sep  3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959767Z lvl=info msg="Compacting file" log_id=0O~g78PG000 engine=tsm1 tsm1_level=1 tsm
1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/bigdisk0/influxdb/data/telegraf/autogen/1454/000059019-000000001.tsm
Sep  3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959772Z lvl=info msg="Compacting file" log_id=0O~g78PG000 engine=tsm1 tsm1_level=1 tsm
1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_index=1 tsm1_file=/bigdisk0/influxdb/data/telegraf/autogen/1454/000059020-000000001.tsm
Sep  3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959777Z lvl=info msg="Compacting file" log_id=0O~g78PG000 engine=tsm1 tsm1_level=1 tsm
1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_index=2 tsm1_file=/bigdisk0/influxdb/data/telegraf/autogen/1454/000059021-000000001.tsm
Sep  3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959781Z lvl=info msg="Compacting file" log_id=0O~g78PG000 engine=tsm1 tsm1_level=1 tsm
1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_index=3 tsm1_file=/bigdisk0/influxdb/data/telegraf/autogen/1454/000059022-000000001.tsm
Sep  3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959786Z lvl=info msg="Compacting file" log_id=0O~g78PG000 engine=tsm1 tsm1_level=1 tsm
1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_index=4 tsm1_file=/bigdisk0/influxdb/data/telegraf/autogen/1454/000059023-000000001.tsm
Sep  3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959790Z lvl=info msg="Compacting file" log_id=0O~g78PG000 engine=tsm1 tsm1_level=1 tsm
1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_index=5 tsm1_file=/bigdisk0/influxdb/data/telegraf/autogen/1454/000059024-000000001.tsm
Sep  3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959794Z lvl=info msg="Compacting file" log_id=0O~g78PG000 engine=tsm1 tsm1_level=1 tsm
1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_index=6 tsm1_file=/bigdisk0/influxdb/data/telegraf/autogen/1454/000059025-000000001.tsm
Sep  3 04:39:53 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:53.959799Z lvl=info msg="Compacting file" log_id=0O~g78PG000 engine=tsm1 tsm1_level=1 tsm
1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_index=7 tsm1_file=/bigdisk0/influxdb/data/telegraf/autogen/1454/000059026-000000001.tsm
Sep  3 04:39:56 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:56.447775Z lvl=info msg="Compacted file" log_id=0O~g78PG000 engine=tsm1 tsm1_level=1 tsm1
_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/bigdisk0/influxdb/data/telegraf/autogen/1454/000059026-000000002.tsm.tmp
Sep  3 04:39:56 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:56.447829Z lvl=info msg="Finished compacting files" log_id=0O~g78PG000 engine=tsm1 tsm1_l
evel=1 tsm1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group tsm1_files_n=1
Sep  3 04:39:56 hostname-influxdb-1 influxd[10891]: ts=2020-09-03T08:39:56.447838Z lvl=info msg="TSM compaction (end)" log_id=0O~g78PG000 engine=tsm1 tsm1_level=
1 tsm1_strategy=level trace_id=0P0mnkPl000 op_name=tsm1_compact_group op_event=end op_elapsed=2488.134ms

I have increased the loglevel to debug, so I’ll wait a while and see if that gives me any useful error logs

Any other ideas how to investigate why shard compaction might have started failing suddenly with 1.8.0 ?

Even at debug log level, there is no indication of what might be going wrong in the log file.

Any other ideas of how to troubleshoot this?

I can’t get any debug or trace level logging at all.
My influxdb.conf file has:

trace-logging-enabled = true

in the [data] section and

level = "debug"

in the [logging] section, but in the log file, I am not seeing a single message of any kind for any log level other than “info”

grep -v "lvl=info" influxdb.log

^ that grep command finds zero matches.

So I am still no closer to debugging why my shard compaction might be failing and leaving behind tsm.tmp files

So I wanted to update y’all on this.

I never did get any further debug/trace information to show up in the log.

A couple days ago, the disk space abruptly stopped. I had been restarting the influxdb servers every few days to avoid running out of space

You can see in this screenshot, in the upper graph, the disk usage drops back to normal after each restart, and grows rapidly, but somewhat unpredictably until I restart again. At the most recent restart, the growth just stops.

The tiny spikes each night are the nightly backup script. You can see from the bottom graph that I moved the nightly backup files to a different disk. That seems to be the only change that correlates to the stop of the growth.

I am struggling to guess how the nightly backup could have triggered it, especially since the backup script never caused this problem for over a year, and didn’t start until the upgrade to 1.8.x and also out of six influxdb servers, all of which were upgraded to 1.8.x at the same time, and all of which use the same backup script, only two of them exhibited the strange disk space leak.

tldr; it stopped happening on its own, I can only guess as to why it happened, and I am not likelt to figure out why unless it starts happening again :frowning:

I spoke too soon. After 10 days, I can see that the disk space is leaking again. More slowly than before, but it is leaking. I still have an accumulation of .tmp files

I also reverted my backup script changes on just one of the two affected influxdb servers, but it made no difference. The tmp file leak happens at the same rate on both, so I think the backup changes were a red herring

So this bug is still happening, and I still have no leads, and I don’t know why the rate of leakage fluctuates unpredictably. (though the rate of leakage is the same on both influxdb servers)

I’m having the same issue, but I’m running InfluxDB v1.8.3 on Windows.

Sometimes the database just grows out of control, because compaction does not occur, I don’t see any error in the log even at debug level.

Restarting the InfluxDB service solves the issue by forcing file compaction, I know that by the log, which states that the files are being compacted, but not only the latest shards, even older ones that by the time should have been closed and compacted days ago.

I’ve never checked for .tmp files in my data folder but I will the next time I see this issue.

Thanks Giovanni! This is such a weird and frustrating issue. Glad I’m not alone :smiley:

So InfluxDB 1.8.3 didn’t fix it, huh? That was the next thing in my list to try.

Now I am wondering if I can safely manage a downgrade back to InfluxDB 1.7.x to avoid this?

From the docs you should be able to downgrade without issue

Hello!
I’m currently asking for help. I appreciate your patience while I look for an answer.
Thank you.

3 Likes

Just an update on this, I was able to confirm that this bug is definitely happening on ALL of my InfluxDB 1.8.x servers.

I originally only noticed it on some of them because it seems to happen faster on servers with a lot of data.
My servers with 1 TB of data exhibit the tmp file leak within a day or two, but for a server with just a few GB of data, you have to watch it for a couple weeks before the leaked tmp space becomes noticeable.

I am currently working around this bug by downgrading all my influxdb servers back to 1.7.x

I wonder if the issues on this thread are related to these - InfluxDB: High CPU utilization in v.1.8.1 & v.1.8.2 · Issue #19543 · influxdata/influxdb · GitHub

I too am having the same issue on V1.8.0 with intermittent excessive CPU usage ( >80%) and Large Disk IO reads ( 15MB/s ). I have followed the Github suggestion of disabling the Store enabled monitoring which helped superbly for 1 week but i am now back to overloads with Grafana and Chronograf not being able to connect ( even though InfluxDB is running ).

I am surprised that there are so few support comments and suggestions on this topic - particularly from influxdata ? Surely someone can help us all…??

@Anaisdg Has there ever been any investigation by the InfluxDB team which determined the root cause? Is there any recommendation about which version is not supposed to have the issue?

We have some InfluxDB legacy servers and we recently started running into this issue.