I gather data from multiple agents into the same InfluxDB to provide an enterprise wide summary. The metrics I have are detailed hypervisor CPU usage gathered by a guest. For a total overview, I want to add up the per-CPU metrics and get a query like this:
select sum(usage) as usage from zvm_lpar_sytcup where time > now() - 5m group by time(1m),seqnr,cputype
This runs fine and completes in 60 ms or so.
Now the challenge is that when two guests happen to run on the same hypervisor, we get the data from that hypervisor twice, and my sum() doubles the amount. So I came up with this:
select max(*) from (select sum(usage) as usage from zvm_lpar_sytcup where time > now() - 5m group by time(1m),seqnr,cputype,lpar) group by time(1m),seqnr,cputype
This gives the correct results, but the query takes 550 ms to run. I tried to add the where clause to the outer select as well, but that did not help much.
My Grafana dashboard gets the data for 3 or 6 hours, which takes 1.2 seconds with just a few systems feeding data. Now I can do a continuous query for that last one, which makes the dashboard quick again, but it does burn 1% of CPU non-stop (even when nobody is looking). I’m concerned about usage when we have a few dozen systems feeding data.