It seems that using max(), min(), and mean() functions is much faster than using reduce to calculate max, min, and mean. I wonder if the performance difference is because reduce rewrites the result at each iteration, causing it to be slower. Could you explain why there is such a difference?
@han Great question! And yes, your assumption is one of the reasons why the reduce() method is so much slower. The other reason is where the computation takes place. Flux and InfluxDB can work together to make operations faster by allowing Flux to “push down” certain operations to the storage tier (closer to where the data lives) where operations happen much faster. For operations that can’t be “pushed down,” all the data needed for the operation has to be loaded into the Flux memory space and operated on there. Loading the data into memory has some inherent latency, but also, operations are slower in memory than they are in storage.
So structuring a min/max/mean query like this leverages pushdowns to calculate the min, max, and mean at the storage tier before loading the data into Flux memory and union’ing the results of the three streams together:
Also note that I structure the data() “variable” as a function to keep the pushdown chain intact across an identifier declaration, also known as a “thunk” (more info here).
reduce() can’t be pushed down to storage, so all the queried data has to be loaded into memory and iterated on there. There’s a compounding effect here:
You have to load all the raw, unaggregated data into memory to aggregate it with reduce().
There are more rows to iterate over.
It takes longer to iterate over each row in memory than it does to aggregate values in storage.