Application note: using integral in continuous queries

I needed to perform an integral on power and availability data in a continuous query, typically used for energy monitoring where you perform an integral on power state information (ie 1 or 0) or power data (consumption/production in W).
In my case, the source is emitting data at 1min intervals, time is not aligned with the server: timestamps are taken from the source.
Take for instance the power state information - as it is easier to calculate. The goal is to count the number of seconds that the state was ‘1’ (numeric).
Example data: (Time > state)
10:00:51 > 1
10:01:51 > 1
10:02:51 > 1
10:03:52 > 1
10:04:52 > 1
10:05:52 > 1
10:06:51 > 1
10:07:51 > 1
10:08:51 > 1
10:09:51 > 1
10:10:51 > 1
10:11:51 > 1
10:12:52 > 1
10:13:52 > 1

The idea is to count the number of seconds that the state is 1 and aggregate this into 5min intervals.
Let’s run the CQ at 10:15.

The obvious thing to do is:

CREATE CONTINUOUS QUERY my_query ON db
BEGIN
SELECT integral(“state”) as “state_on”
INTO db.rp.count
FROM “db.rp.raw”
GROUP BY time(5m)
END

Unfortunately, when running at 10:15, due to lag/unsynchronized clocks, the datapoint which was supposed to come in at about 10:14:52 is not there yet so the computed value is incomplete.
That’s where the advanced CQ syntax comes into play. Right?

CREATE CONTINUOUS QUERY my_query ON db
RESAMPLE FOR 10m
BEGIN
SELECT integral(“state”) as “state_on”
INTO db.rp.count
FROM “db.rp.raw”
GROUP BY time(5m)
END

So now we’re still running the CQ every 5min’s but each time, we’re providing it with a double amount of data. That shoud fix it, right?
Unfortunately, the CQ run at 10:15 is now showing a new problem at the start of the bucket for 10:05. Since the query input is limited to the dataset after 10:05. So for the first 52 seconds of that bucket, there’s no value present and the integral will not count in that interval.

The final solution?

CREATE CONTINUOUS QUERY my_query ON db
RESAMPLE FOR 14m
BEGIN
SELECT integral(“state”) as “state_on”
INTO db.rp.count
FROM “db.rp.raw”
GROUP BY time(5m)
END

By using a RESAMPLE FOR 14m (3*(group by time)-1), you ensure that when the query reruns for the n-1’th bucket, the integral function has enough data to know the initial value but avoid running the query again for the n-2’th bucket, as that would yield the same problem as before.

I hope it makes sense. It took me a little while to get my head around this behaviour and I think/hope it will help others in similar cases.

The question is however: is this recommended practice? Do you have an alternative way to handle this?

Cheers,
Maarten

Sorry, but no, it still doesn’t work :frowning:
Accidentally, the source data was aligned with the resampling, but now it’s shifted again and the first few seconds of each interval are not taken into account in the query.
So it seems that even though you’re limiting the data to 3n-1, it’s still providing output for that bucket?

Hi Maarten ,
Can the Every clause in the advanced syntax help solve your problem ?

Hi Marc,
I’m afraid not.
I have done more analysis and opened a bug report (and proposed a solution) here: https://github.com/influxdata/influxdb/issues/13164