Imrpoving downsample/decimation performance

Is there a way to improve performance of downsampling/decimating data that is faster than using aggregateWindow(every: 1000ms, fn: first/mean/etc, …)? I expected that using ‘first’ would vastly outperform ‘mean’, since it can disregard most of the data, but I have found no difference in the amount of time that the query takes.

At the moment, I would like to avoid the complexity of having to add tasks that downsample the data periodically, especially since our data comes in in bursts and not necessarily in real-time.

Thanks in advance!

Hello @twim,
If you’re performing:
from |> range |> filter |> aggwindow
There ins’t anything more performant. :frowning:

Thank you very much for your reply.

I was afraid of this being the case :slight_smile: Would you happen to know why the ‘first’ selector function takes the same amount of time as, for example, ‘mean’? Since ‘mean’ must evaluate all the data, while ‘first’ can just skip most of the data, especially with large windows. You’d except ‘first’ to be many times faster. Is there an underlying technical issue for this not to be the case?

Hi @twim – are you experiencing slowness with mean in the aggregateWindow() function or just testing the difference between that and first? That collection of functions is pushed down to storage so the performance of doing a mean() in that case should be fast. I think the fact that it’s pushed down simply means that there isn’t much room for speedup either way. They’re ultimately doing similar amounts of work.

If you just want aggregated data in your queries, you could could just pre-aggregate the data with a Task into another bucket and query out of that bucket directly. That will be fastest for you client/s.

Hello @samdillard. Thank you for your reply. I don’t see why mean() and first() would need the same amount of work. I understand that mean() needs to process a lot of data, since it needs to evaluate all values in order to compute an average. first() On the other hand, can discard the vast majority of the data (everything but the first value in each window).

I have been thinking about setting up a separate bucket for pre-downsampled data. The problem is that the data is transfered in bursts, so to make a periodic task work, there would either need to be a lot over overlap, or we’d miss datapoints. Another option is to have our devices emit both high-speed and downsampled data directly.

@twim it’s not the same amount of work. It’s just that the difference in work isn’t meaningful from a real clock time perspective. Both operations you’re talking about spend most of their time in storage (if not optimized). They are optimized so the most of the time taken is eliminated…in both cases.