Hello!
So a couple weeks ago, I upgraded some InfluxDB servers from 1.7.9 to 1.8.0 and a strange disks apace usage leak started. (I subsequently upgraded from 1.8.0 to 1.8.2 but it did not make any difference)
The pattern of metrics ingestion has not changed, nor have my CQs change, but when I upgraded to 1.8.x suddenly the disk space usage started growing by somewhere in the range of 2.0 to 3.5 gigbytes PER HOUR
Note that this growth is happening on the partition where meta, data and wal are located. The logs, the config, and the rest of the OS (Centos7) are all on different partitions, and are not affected.
Now here is the really strange part. If I restart the influxdb service, the disk usage suddenly drops back to normal levels. No data is lost, which is why I am describing this rapid disk space growth as a “leak”
Has anyone else seen behavior like this? Does anyone have any suggestions for debugging this issue?
check data directories to identify which DB/RP storage is growing fast: du -sch /opt/influxdb/data/*/*
I would say that there are a number of .tsm files created, which is rather OK.
But if the disk space is released after service restart, then it is probably .tsm.tmp files.
.tmp files are temporary files created during compaction of two .tsm files, whic also OK, they are automatically deleted after shard compaction.
Depending on how many max-concurrent-compactions are allowed, you may see multiple .tsm.tmp files per directory, but usually it should be less than number of CPU cores.
If there are more .tsm.tmp files per shard directory - then there is a clearly problem, it means shard compaction can’t complete successfly, it is started againwith new .tmp file created each time.
When you restart inflxudb service, it scans all the shards and automatically removes .tmp files.
Thank you, @anatolijd !
That is very helpful information.
It does appear that I have an accumulation of .tsm.tmp files, and those are the files that vanish when I restart the influxdb service. So your suggestion that shard compaction might be failing does seem likely.
I checked the logs, but could not find anything that looked like an error message. I just see things like this:
You can see in this screenshot, in the upper graph, the disk usage drops back to normal after each restart, and grows rapidly, but somewhat unpredictably until I restart again. At the most recent restart, the growth just stops.
The tiny spikes each night are the nightly backup script. You can see from the bottom graph that I moved the nightly backup files to a different disk. That seems to be the only change that correlates to the stop of the growth.
I am struggling to guess how the nightly backup could have triggered it, especially since the backup script never caused this problem for over a year, and didn’t start until the upgrade to 1.8.x and also out of six influxdb servers, all of which were upgraded to 1.8.x at the same time, and all of which use the same backup script, only two of them exhibited the strange disk space leak.
tldr; it stopped happening on its own, I can only guess as to why it happened, and I am not likelt to figure out why unless it starts happening again
I spoke too soon. After 10 days, I can see that the disk space is leaking again. More slowly than before, but it is leaking. I still have an accumulation of .tmp files
I also reverted my backup script changes on just one of the two affected influxdb servers, but it made no difference. The tmp file leak happens at the same rate on both, so I think the backup changes were a red herring
So this bug is still happening, and I still have no leads, and I don’t know why the rate of leakage fluctuates unpredictably. (though the rate of leakage is the same on both influxdb servers)
I’m having the same issue, but I’m running InfluxDB v1.8.3 on Windows.
Sometimes the database just grows out of control, because compaction does not occur, I don’t see any error in the log even at debug level.
Restarting the InfluxDB service solves the issue by forcing file compaction, I know that by the log, which states that the files are being compacted, but not only the latest shards, even older ones that by the time should have been closed and compacted days ago.
I’ve never checked for .tmp files in my data folder but I will the next time I see this issue.
Just an update on this, I was able to confirm that this bug is definitely happening on ALL of my InfluxDB 1.8.x servers.
I originally only noticed it on some of them because it seems to happen faster on servers with a lot of data.
My servers with 1 TB of data exhibit the tmp file leak within a day or two, but for a server with just a few GB of data, you have to watch it for a couple weeks before the leaked tmp space becomes noticeable.
I am currently working around this bug by downgrading all my influxdb servers back to 1.7.x
I too am having the same issue on V1.8.0 with intermittent excessive CPU usage ( >80%) and Large Disk IO reads ( 15MB/s ). I have followed the Github suggestion of disabling the Store enabled monitoring which helped superbly for 1 week but i am now back to overloads with Grafana and Chronograf not being able to connect ( even though InfluxDB is running ).
I am surprised that there are so few support comments and suggestions on this topic - particularly from influxdata ? Surely someone can help us all…??