I currently have a continuous query that runs and aggregates my raw data (coming in at about six samples per second) using a 1 minute mean aggregation. What I’d like to do though is aggregate not just a simple mean of all the measurements within the window but something like:
Calculate the 3 sigma of all the measurements
Reject outliers outside of the three sigma
Count the number of outliers removed
Recalculate the mean and stddev using only the remaining (un-rejected) measurements
Is this possible with a continuous Flux query and can you point me to resources that cover this kind of more complex aggregation?
That’s a brain teaser, and my Flux knowledge is still a work-in-progress, but borrowing heavily from this well-explained blog post, I think this we can restate your objectives 1 & 2 as “filter out any entries where the Z-score is > 3.0”, since the Z-score tells you how many standard deviations from the mean the entry is.
I am sure your objective #4 is possible, but I need to mull it over. Meanwhile, I think this works for #1, 2, and 3 (I tested it on some sample temperature data that I had access to).
Hi Grant, thank you very much for you response and the blog link. This is extremely helpful. So perhaps for the fourth objective, I would need to create a new column that holds the Z-score for each measurement and then do another “pass” where I recompute avg and stddev using the Z-score column as a filter? I’ll work on it some and let you know what I how it goes.
Maybe get rid of the count() function (unless you really want to know how many values had a ZScore > 3.0), then replace the last filter function with all of this. We are basically filtering out all the “good” (ZScore < 3.0) measurements and writing to a new table: