We’ve been using InflxuDB happily for a year or so, visualising in Grafana, but recently as we’ve been seeing more measurements sent with several dimensions and series cardinality in the 100s and 1000s, we’ve encountered unexpectedly poor query performance, to the extent that it causes enough load that point intake is interrupted. We picked up on the advice in Summing multiple series to use subqueries, but maybe there’s a better way to do it?
The effect we want is to take the average of each sample in a bucket per series, and then sum the results. i.e. (if you’ll excuse the shonky Excel graph), from a measurement called “sessions” with three series, this should return the sum of the sessions - at 10s interval, SUM() does the expected; at a 30s aggregation interval (boxed) it’ll total all nine points. We need SUM(MEAN(Windows),MEAN(Android),MEAN(iOS))
.
There’s no construct of SUM(MEAN(value)) ...
as far as I can see. We have queries like
SELECT SUM(value) FROM
(SELECT MEAN(value) AS value FROM "sessions"
WHERE ("platform" =~ /^$platform$/ AND "site" =~ /^$site$/ AND ...)
GROUP BY platform,site,...)
WHERE $timeFilter GROUP BY time($__interval), "platform", fill(null)
At short graph spans of 15 mins Grafana takes maybe 5s to load, which is a bit sluggish but acceptable; at an hour, 20s; at 12h, a minute or more and beyond that everything starts to time out - and throughout, we see CPU (on an otherwise lightly loaded host) spiking to 60,80,100%. Even in our busiest periods, all other ingest/query load has our server(s) running at about 5%.
A that load level, we start losing ingestion (of all measurements, not just the ones being queried). Removing Grafana from the equation by submitting the generated queries through the CLI shows similar performance.
Total cardinality for this measurement is approaching 7000 over four tags of interest.Simple queries on this measurement and others are instant or a few seconds.
This seems like a primary use case of the dimensional nature of the TSDB, it seems bizarre that (a) a complex query construct is needed, such that it doesn’t even now have direct UI support in a popular tool like Grafana, and (b) the TSDB is not optimized for performance of this type of case. What are we missing?
Tangentially …the documentation, which frequently gives examples involving
water_level
measurements, never encounters this case. Taking the average over a time bucket (ha) of water level at a location makes sense, but summing those averages over location wouldn’t give a meaningful figure, so MEAN(everything involved there) is a suitable aggregation. I’d like to see examples for different forms of underlying data, which might bring out some of these issues.