Increased rate of disk usage after upgrading telegraf from 1.7 to 1.20

So recently I updated the version of telegraf that is running on our nodes from 1.7 to 1.20. Since then there has been an alarming disk space trend on our influxdb servers. We have two of them (one for production machines another for internal machines) and each of them have an increase rate of disk consumption that corresponds to the day I released the upgrade of telegraf.

Between upgrades we made no changes to the metrics we are collecting though I did add the inclusion of UseWildcardExpansion=true to our win_perf_counter partial configs. I made this chance as I noticed it effected logging without this set I was getting logging like error while getting value for counter "\\Process(*)\\IO Write Bytes/sec", will skip metric: The returned data is not valid and with the setting that log changed to error while getting value for counter "\\\\KTCHVS684\\Process(notepad#1)\\IO Write Bytes/sec", will skip metric: The data is not valid. Nothing in the description of this setting lead me or my coworker to believe we were capturing new metrics by including this, though I am including this detail for completeness.

Here is a graph of disk usage on our production influx server with the day I released the telegraf changes is highlighted in red

Production stats

Is there any change in telegraf 1.20 that could account for this increased disk usage?

We have since started monitoring at a more granular level what shards are growing the the influx /data directory but we do not have historical data on that. Glancing at the servers I see no obvious growth in /wal directory or log files. Any areas of investigation would be appreciated.

Telegraf itself generally does not write anything to disk. The metrics Telegraf collects and stores in the metric buffer is all in-memory. If you have logging configured to write to a file, then that is one possible exception.

from 1.7 to 1.20

While a lot has changed in 3+ years, one possibility is some of the inputs you are collecting are now collecting more data and hence more data is written to InfluxDB.

I did add the inclusion of UseWildcardExpansion=true to our win_perf_counter partial configs

As with any wildcard usage, such as in a regex, you could, but not always, collect more metrics once this setting is enabled.

Without knowing more about which metrics you are collecting, the number of metrics gathered between versions and configs, if you are writing any files to disk, like with an output, for example, it is hard to know what else might be causing this.

Sorry for the late reply, I thought I followed up on this weeks ago. So yes it was the inclusion of UseWildcardExpansion that caused the drastic increase of disk usage. I expected no difference in the number of series getting stats published to them since the wildcard usage our configs have is only Instance[*] which ‘works’ without the UseWildcardExpansion. But as the docs clearly pointed out Instance indexes will also be returned in the instance name we are now collecting stats on many new series as we get a series per index process when multiple processes that share the same name are running (ie: w3wp).