How to simply sum these values using a Telegraf aggregator?

I have a simple telegraf config file which just collects the disk space taken by influxdb shards (the data comes from the influxdb’s endpoint, as recommended in the docs):

[agent]
  hostname = "docker-debian"
  flush_interval = "10s"
  interval = "10s"

# GET THE DISK USAGE OF THE INFLUXDB DATABASES, FROM INFLUXDB'S OWN INTERNAL MONITORING SYSTEM
[[inputs.influxdb]]
    urls = ["http://influxdb_container:8086/debug/vars"]
    namepass = ["influxdb_shard"]
    fieldpass = ["diskBytes"]
    taginclude = ["database","path"]

# SEND ALL THE DATA TO THE REMOTE INFLUXDB INSTANCE
[[outputs.influxdb]]
  database = "telegraf"
  urls = [ "http://influxdb_container:8086" ]

If I look at what data arrives in the database, I can see there are a bunch of values, one for each path and for each database:

> select * from influxdb_shard where time>now()-10s
name: influxdb_shard
time                database      diskBytes path
----                --------      --------- ----
1749720160000000000 homeassistant 495041    /var/lib/influxdb/data/homeassistant/autogen/100
1749720160000000000 homeassistant 548397    /var/lib/influxdb/data/homeassistant/autogen/101
1749720160000000000 homeassistant 461434    /var/lib/influxdb/data/homeassistant/autogen/103
1749720160000000000 homeassistant 324464    /var/lib/influxdb/data/homeassistant/autogen/107
1749720160000000000 homeassistant 455053    /var/lib/influxdb/data/homeassistant/autogen/115
1749720160000000000 telegraf      6674052   /var/lib/influxdb/data/telegraf/autogen/160
1749720160000000000 telegraf      6176482   /var/lib/influxdb/data/telegraf/autogen/170
1749720160000000000 telegraf      6614880   /var/lib/influxdb/data/telegraf/autogen/176
1749720160000000000 telegraf      7104037   /var/lib/influxdb/data/telegraf/autogen/184

Instead of having all the individual paths sent to Influxdb, I want to first sum all the paths and then send only a total for each database. Can this be done with Telegraf aggregators/processors?

Thanks

Hello @teeeeee,
You’d have to use the starlark processor plugin or the execd processor plugin to get that type of aggregation or count.

Sorry if this is a dumb question, but why do you need to aggregate before they are stored? Wouldn’t it be easier to just aggregate the output in your queries?

Not a dumb question.

As the database size grows, the number of path directories increases. For my setup, it was something like 20 paths. So I had 20 data points each time Telegraf collected data (and you have this for each database). This diskBytes measurement was actually by far the most space-consuming one in my setup.

I am not interested in the individual breakdown of how much disk space is occupied in each path - I am only interested in the total for each database. So it does not make sense for me to store all this unnecessary data.

It’s possible to do that. In fact I was doing it like that for a long time. But you need to do a subquery:

SELECT mean("diskBytes_summed") FROM (
        SELECT sum("diskBytes") AS "diskBytes_summed" FROM "influxdb_shard" 
        WHERE "database"='telegraf' AND $timeFilter GROUP BY time(5s) fill(null) 
) 
GROUP BY time(10m)

The outer “group by” must be longer than the inner one (in my case I chose 10m and 5s), and the inner one must be shorter than your collection interval (mine was 10s). Otherwise it does not aggregate correctly and gives the wrong result for the sum over all paths.

Doing the aggregation in the query is cumbersome, non-intuitive, and far worse in terms of performance. My graphs were very slow to load the last 6 months of data.

Thanks for your answer.

In the end, I decided it was too fiddly to get it working with the influxdb input plugin, and did not investigate any other processors/aggregators.

Instead I am manually running a du command on each database, using the exec plugin:

# GET DISK USAGE OF INDIVIDUAL INFLUXDB DATABASES
[[inputs.exec]]
  commands = [ "/home/scripts/get_disk_usage.sh /home/influxdb_data/data/database1 /home/influxdb_data/data/database2" ]
  timeout = "1m"
  name_override = "diskusage"
  name_suffix = ""
  data_format = "json"
  tag_keys = [ "path" ]
#!/bin/bash
# get_disk_usage.sh shell script

echo "["
du -s -B1 "$@" | awk '{if (NR!=1) {printf ",\n"};printf "  { \"dir_size_bytes\": "$1", \"path\": \""$2"\" }";}'
echo
echo "]"

This works nicely, and gives the total of the data directory summed over all paths.

(This is not quite the same as the influxdb_shard measurement, which I think also takes into account the WAL directory. But monitoring the data directory is enough, since this is where the majority of the disk space is used).