Aggregate "sum" from multiple series for Grafana graph panel

My data
I have a measurement that have the total memory of each VM on each host in megabytes (granularity of 30s). The memory of the VMs can be changed and also the host of the VMs can also be changed.

For example; I have two nodes and 4 VMs:
_time, host, vm, _value
2021-06-01 10:41:00 GMT+2, FPPW9406, VM01, 4096
2021-06-01 10:41:00 GMT+2, FPPW9406, VM02, 1024
2021-06-01 10:41:00 GMT+2, FPPW9407, VM03, 4096
2021-06-01 10:41:00 GMT+2, FPPW9407, VM04, 1024
2021-06-01 10:40:30 GMT+2, FPPW9406, VM01, 4096
2021-06-01 10:40:30 GMT+2, FPPW9406, VM02, 1024
2021-06-01 10:40:30 GMT+2, FPPW9407, VM03, 4096
2021-06-01 10:40:30 GMT+2, FPPW9407, VM04, 1024
2021-06-01 10:40:00 GMT+2, FPPW9406, VM01, 4096
2021-06-01 10:40:00 GMT+2, FPPW9406, VM02, 1024
2021-06-01 10:40:00 GMT+2, FPPW9407, VM03, 4096
2021-06-01 10:40:00 GMT+2, FPPW9407, VM04, 1024

What I want
I want to create a Grafana graph panel that shows one line.
The Line = Summation of all VMs memory over time (in a graph panel)

The raw data should looks like this:
_time, _value
2021-06-01 10:41:00 GMT+2, 10240
2021-06-01 10:40:30 GMT+2, 10240
2021-06-01 10:40:00 GMT+2, 10240

I tried a lot of queries but I found no solution for this and the aggregateWindow with fn = sum is not the solution.

I think you should drop “host”, group by “VM” and then use sum().
Something like

  |> drop(columns: ["host"])
  |> group(columns: ["vm"], mode:"by")
  |> sum()
  |> group()
  |> drop(columns: ["vm"])

where the last two line are somehow optional and it is up to you.
Thanks!

1 Like

Thanks for your help, but as I mentioned I search a Flux query for a graph panel and with your query I only get the sum and not the sum over time.

Please don’t delete this post.

Ok. You can do what I suggested with a window() command if you need.
Why isn’t the aggregatewindow with window size 30s not a viable solution?

We want this graph over a big timeframe (for example 90d) and the problem is with a window size of 30s there are to many data. The big advantage of aggregatewindow is → normally we can use a dynamic window size (for example in “mean”) and that prevents from display to many data, but because we want the sum in a graph we cannot use a dynamic window.

Ok, that is way different to what you described in the first post: it is not that the query is not possibile, but it is much more that the result is not appealing. It is much more you choice of the way you want to process your data then the flux query itself.

Anyway, you can aggregate twice: first you aggregate with sum every 30s, then you aggregate with mean and a dynamic window size. Of course the results will be somehow interpolated, but you cannot reduce the cardinality of the data without any compromise.
Or, you can use a movingaverage, …

Thanks for your answer. :slight_smile:
I tried with the two “aggregateWindow” (sum & mean), it works but I have two problems:

1. The query is very slow (46s for a 24h timeframe)
I think this is because I use multiple “aggregateWindow” and the first “aggregateWindow” have to calculate a lot of data.
→ Is there a way to make this query faster?

2. The query is not really "dynamic"
If the granularity/frequency of the data change, I have to change the query in each dashboard/panel.
→ Is there a way to write a more “dynamic” query?

My query

from(bucket: "telegraf")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "hyperv_dynamic_memory")
  |> filter(fn: (r) => r["_field"] == "Guest_Visible_Physical_Memory")
  |> group()
  |> aggregateWindow(every: 30s, fn: sum, createEmpty: false)
  |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)

Can you try to run

from(bucket: "telegraf")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "hyperv_dynamic_memory")
  |> filter(fn: (r) => r["_field"] == "Guest_Visible_Physical_Memory")
  |> group()
  |> count()

and see what the count is? It is very strange to me that the query is so slow.

The fact that it is not really dynamic is due to you constraint: you need to sum the data at the sampling period they are recorded, so at some point you need to feed the query with the proper time period.
I do not see a reliable way to automatically detect the sampling frequency: you can make a query, calculate the frequency and then use that value in the first aggregatewindow, but it might be a little too much.

I run your query and the count is 7436505.

Ok I hope there is a solution (new function) in the future, for this.

7M data in 24h? Those are quite a few, in fact.
I guess you might try to change the shards of the database but I’m no expert for this. Maybe @Anaisdg can point you to the proper person/direction here.

I do not know whether a fuction to extimate the sampling frequency will ever be implemented as this is not something easy to guess reliably.

Thanks!

Now I query only downsampled data. So I query below 100’000 metric, and the perfomance is ok :slight_smile:

But I have another similar question:
I want to calculate the sum per time of all storage we use (for a Grafana graph panel).
The storage luns are shared and that’s why they are mapped to multiple nodes. On this nodes runs telegraf, so each node collects the data from the luns (Example a Cluster with 8 Nodes and one lun “LUN01” then in InfluxDB I have 8 times the disk space of “LUN01”).

My question:
Is there a more efficient query then my query below? (I query 1809191 metrics / 30 Days)
My query below takes approx. 18 seconds.

My Query:
Short explanation:
- The field for the LUN-Name is “FileSystemLabel”
- The field for the Node-Name is “host”

customsum = (tables=<-, column="_value") => 
  tables
    |> drop(columns: ["host"])
    |> unique(column: "FileSystemLabel")
    |> drop(columns: ["FileSystemLabel"]) 
    |> sum(column)

from(bucket: "telegraf_90d")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => 
    r._measurement == "cluster_csv" and
    r._field  == "SizeUsed" and 
    exists r.FileSystemLabel
  )
  |> aggregateWindow(every: 1d, fn: customsum, createEmpty: true)
  |> fill(column: "_value", usePrevious: true)
  |> toInt()

Hello @fluxator,
It looks like you’re trying to ungroup with a drop()
Have you tried just using group()?

Also you can use the flux profiler to evaluate the performance of flux and different alternatives

Thanks for your feedback. The problem with group() is I get the following error:
aggregateWindow: duplicate error: column "_stop" doesn't exist

How would you write the query to get the information I described above?
Is group() faster then drop() for ungrouping?

Thanks for the advice with “profiler”. That’s a really useful library! :slight_smile:

1 Like

Hello @fluxator
To help you better it would be great if you could help me understand what you’re doing here:
Why do you find the unique filesystemlabel and then drop the column?

I was thinking you could just

customsum = (tables=<-, column="_value") => 
  tables
    |> group()
    |> sum()

from(bucket: "telegraf_90d")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => 
    r._measurement == "cluster_csv" and
    r._field  == "SizeUsed" and 
    exists r.FileSystemLabel
  )
  |> aggregateWindow(every: 1d, fn: customsum, createEmpty: true)
  |> fill(column: "_value", usePrevious: true)
  |> toInt()

As I mentioned, because the storage luns are shared and that’s why they are mapped to multiple nodes. On this nodes runs telegraf, so each node collects the data from the luns (Example a Cluster with 8 Nodes and one lun “LUN01” then in InfluxDB I have 8 times the disk space of “LUN01”).
→ That’s why I have to use unique() to deduplicate based on the “FilesSystemLabel”.