So recently I updated the version of telegraf that is running on our nodes from 1.7 to 1.20. Since then there has been an alarming disk space trend on our influxdb servers. We have two of them (one for production machines another for internal machines) and each of them have an increase rate of disk consumption that corresponds to the day I released the upgrade of telegraf.
Between upgrades we made no changes to the metrics we are collecting though I did add the inclusion of
UseWildcardExpansion=true to our win_perf_counter partial configs. I made this chance as I noticed it effected logging without this set I was getting logging like
error while getting value for counter "\\Process(*)\\IO Write Bytes/sec", will skip metric: The returned data is not valid and with the setting that log changed to
error while getting value for counter "\\\\KTCHVS684\\Process(notepad#1)\\IO Write Bytes/sec", will skip metric: The data is not valid. Nothing in the description of this setting lead me or my coworker to believe we were capturing new metrics by including this, though I am including this detail for completeness.
Here is a graph of disk usage on our production influx server with the day I released the telegraf changes is highlighted in red
Is there any change in telegraf 1.20 that could account for this increased disk usage?
We have since started monitoring at a more granular level what shards are growing the the influx /data directory but we do not have historical data on that. Glancing at the servers I see no obvious growth in /wal directory or log files. Any areas of investigation would be appreciated.