Down sampling is not foolproof

Topic: Down sampling is not foolproof.

I enthusiastically started with flux, and use it with custom functions in a project that includes:

  • Energy meter readings (smart meter) (1 second sample).

  • Weather data (10 min sample).

  • Solar panels data real time 10 minutes sample.

  • Central heating installation temperature (water-side adjustment temperature control) (30 sec sample).

  • Degree days calculations (KNMI) (custom function flux).

  • Variable hourly rate calculation (custom function flux) source: European Entsoe Database.

all these 6 sources come in through mqtt and are put into a raw_bucket for downsampling.

So far it works perfectly.


Down sample sum up issue.

The following is a Down-Sample issue and is theoretical.

When downsampling, the last value always goes to the next sample.

It is not a problem with a long run, graphics or an average calculation,

As long as my salary doesn’t depend on it

But it has a great effect in a down-sampling chain.



  1. you always miss the last sample

  2. If you down-sampled in a chain from a down-sampled source, the error becomes much larger.

Example worked out

down sample chain 10m => 1h => 1d

There are 1440 minutes in a day

  1. if i have a value 1 representation every minute then i expect 1440 units in a day get 1439 for that and the 1 goes to the next day.

  2. If I first sum up per hour in a new bucket and get the info for summing up to a day from this bucket, then I miss 60 minutes and I get 1380 units . 60 goes to the next day.

  3. If I down-sample from 10 minutes to hours and then from hours to days then I miss 10 minutes and 60 minutes = 70 minutes.


Here’s a flux script to demonstrate this.

Experiment script place or remove comment // and see what happend.

import "generate"

start_proc = 2023-01-01T00:00:00Z
stop_proc = 2023-01-02T00:00:00Z

// for 2 day 's window'
//stop_proc = 2023-01-02T00:00:00Z

//for testing day range
//stop_proc = 2023-01-01T23:59:00Z

minuut_data = 
generate.from(
// 24 * 60 = 1440 minuten
count: 1440, 
// 2 * 24 * 60 =2880 for 2 days window
//count: 2880, 

//fn: (n) => (n/60) ,
fn: (n) => (1) ,
//fn: (n) => (n),
start: start_proc, 
stop: stop_proc
)

m10_data =
    minuut_data
        |> range(start: start_proc , stop: stop_proc)
        |> aggregateWindow(every: 10m, fn: sum, createEmpty: false)
        |> yield(name: "10 Minuten info")


uur_data = 
    m10_data
        |> range(start: start_proc , stop: stop_proc)
        |> aggregateWindow(every: 1h, fn: sum, createEmpty: false)
        |> yield(name: "hour info")


uur_data
    |> range(start: start_proc , stop: stop_proc)
    |> aggregateWindow(every: 1d, fn: sum, createEmpty: false)
    |> keep(columns: ["_time", "_value"])
    |> yield(name: "test_output")

Meter reading with spread demo.

Downsample error with spread or (diff)

The error with downsampling increases if you use spread (diff) to read meter readings.
Example the meter increments 1 unit every minute

  1. Then you have a spread of 1439 units in one day. 1 unit goes to the next day

  2. If you spread on the hour and add that for the whole day you have a total of 1357 units instead of 1440 missing (23 + 60)

  3. If you spread per 10 min and you add that up per hour and you add the hour for the whole day then you have 1233 units, you miss 6 units every hour and the last hour.

Demo fux script:

import "generate"

start_proc = 2023-01-01T00:00:00Z
stop_proc = 2023-01-02T00:00:00Z

minuut_data = 
    generate.from(
    count: 1440, 
    fn: (n) => (n),
    start: start_proc, 
    stop: stop_proc
    )

m10_data =
    minuut_data
        |> range(start: start_proc , stop: stop_proc)
        |> aggregateWindow(every: 10m, fn: spread, createEmpty: false)
        |> yield(name: "10 Minuten info")

uur_data = 
    m10_data
        |> range(start: start_proc , stop: stop_proc)
        |> aggregateWindow(every: 1h, fn: sum, createEmpty: false)
        |> yield(name: "hour info")

uur_data
    |> range(start: start_proc , stop: stop_proc)
    |> aggregateWindow(every: 1d, fn: sum, createEmpty: false)
    |> keep(columns: ["_time", "_value"])
    |> yield(name: "test_output")


This is a learning moment conclusion for me
Downsample as close to the source as possible.

I wish you lots of ‘flux’ fun.

Hello @douwe_steenstra,
Thank you for sharing your solution. Yes I too wish there were more options with windowing functions in flux where you can choose how you want to select your bounds.
If you’re a new Cloud user in aws east or eu central you can now query with SQL. We’re working on creating tutorials to demonstrate how to perform downsampling with Flight SQL, pandas, and AWS lambda so that you can have more control over your downsampling needs.

Hello @Aniasdg,

Thanks for your response, and I look forward to a good tutorial.

but I keep looking and part of the problem lies with the aggregateWindow function itself.
This is a composite function and I found the explanation about it in the internal video.

link: How to Use the Flux Window and Aggregate Functions

7:39 AggregateWindow Function.

This is a serious issue because aggregatewindow is highly promoted .

I modified the aggregateWindow function and it gives the correct result when downsampled. the deviation I have now comes from the spread() function. that has its own internal problem or is not made for this purpose.

        // |> aggregateWindow(every: 1h, fn: sum, createEmpty: false)
        |> window(every: 1h)
        |> sum()
        //|> duplicate(as: "_time", column: "_stop")
        |> duplicate(as: "_time", column: "_start")
        |> window(every: inf, timeColumn: "_time")

solution:

  |> duplicate(as: "_time", column: "_stop") 

is replaced by

  |> duplicate(as: "_time", column: "_start")

This prevents the time from shifting in another time frame, think date is moved forward one day.

gr,