Thinning samples for derivative, when not aligned to bucket window

Hi,

I’m trying to understand how best to use derivative, especially with regards to “thinning” data as might happen when zooming out from a time range. Specifically I’m looking at SNMP interface counters, and I’m storing the raw counter values in influxdb.

To illustrate, here is some test data:

interface,agent_host=test ifHCInOctets=5000i 1527346860000000000
interface,agent_host=test ifHCInOctets=5100i 1527346980000000000
interface,agent_host=test ifHCInOctets=6000i 1527347220000000000
interface,agent_host=test ifHCInOctets=6200i 1527347340000000000
interface,agent_host=test ifHCInOctets=7000i 1527347760000000000
interface,agent_host=test ifHCInOctets=7300i 1527347940000000000

This imports as:

> select ifHCInOctets from interface where agent_host='test'
name: interface
time                 ifHCInOctets
----                 ------------
2018-05-26T15:01:00Z 5000
2018-05-26T15:03:00Z 5100
2018-05-26T15:07:00Z 6000
2018-05-26T15:09:00Z 6200
2018-05-26T15:16:00Z 7000
2018-05-26T15:19:00Z 7300

The basic derivative function works just fine:

> select derivative(ifHCInOctets,1s) from interface where agent_host='test'
name: interface
time                 derivative
----                 ----------
2018-05-26T15:03:00Z 0.8333333333333334
2018-05-26T15:07:00Z 3.75
2018-05-26T15:09:00Z 1.6666666666666667
2018-05-26T15:16:00Z 1.9047619047619047
2018-05-26T15:19:00Z 1.6666666666666667

For example: counter goes up from 5000 to 5100 between :01:00 and :03:00; that’s an increase of 100 in 120 seconds, so it’s 0.833. It’s not fazed by the irregular intervals.

Now, I want to thin this out so I’m only looking at one point within each 5 minute period. For example, if I took just the first point in each 5 minute window, I would get:

2018-05-26T15:01:00Z 5000
2018-05-26T15:07:00Z 6000
2018-05-26T15:16:00Z 7000

… and then I could calculate derivatives properly again, just with lower resolution. For example, from 01:00 to 07:00 is 1000 in 6 minutes, so 1000/360 = 2.778; the second one is 1000 in 9 minutes, so 1.852.

Now, reading through the influxdb documentation, it suggests using group by time(5m) and first(). This gives me:

> select first(ifHCInOctets) from interface where agent_host='test' group by time(5m)
name: interface
time                 first
----                 -----
2018-05-26T15:00:00Z 5000
2018-05-26T15:05:00Z 6000
2018-05-26T15:10:00Z
2018-05-26T15:15:00Z 7000
2018-05-26T15:20:00Z

But this has lost the actual timestamp of each data point, instead rounding it to the beginning of the 5 minute window. And hence the calculated rates are wrong:

> select derivative(first(ifHCInOctets),1s) from interface where agent_host='test' and time >= '2018-05-26T15:00:00Z' and time <= '2018-05-26T15:20:00Z' group by time(5m)
name: interface
time                 derivative
----                 ----------
2018-05-26T15:05:00Z 3.3333333333333335
2018-05-26T15:15:00Z 1.6666666666666667

In practice, this might not be too much of a problem as long as (a) the data is being sampled at accurate regular intervals, (b) the sampling interval divides into the aggregation window (which is fine if you sample at 1 minute intervals and aggregate over 5 minutes, but not if you sample at 2 minute intervals and aggregate over 5). And also the sampling is not happening close to the bucket boundary.

However, can anyone suggest a way to get a more accurate answer?

I think I just want to ‘thin’ the sample stream to one sample per bucket - or equivalently, use the timestamp of the row which first() or last() finds, rather than the bucket window boundary.

I have worked with so many systems which don’t accurately convert counts to rates, and it would be great to work with one where I can trust what it shows me :slight_smile:

Thanks… Brian.

1 Like