Derivative downsample CQ

jasonmkeller · April 12, 2017, 6:28pm

Very new to InfluxDB but I have what I’d imagine will be a common question for anyone storing network interface counters. Right now I’m pouring in the following fields…

ifHCInOctets
ifHCOutOctets

Obviously, these are cumulative counters per-interface per-device. When downsampling them, I have to first determine the rate (non-negative-derivative) per second, then I must average that derivative over the entire day. I’m trying to make this resistant to counter rolls (i.e. if someone clears/resets the counter in the middle of the day, it shouldn’t ruin the entire day worth of derivative). I’m unsure if I simply tell it to use the derivative across the entire day if it will withstand this or just take the endpoints (which will be wholly wrong in this instance) and attempt a single derivative across it.

Basic gist, I have to downsample all of the data per-interface per-device from minute resolution to daily resolution for trending use, and hold it for possibly years.

Once again, very new to the InfluxQL language, but so far this has been very, very promising, and is one of my last hurdles aside from load testing to get it from POC to PROD. If anyone could lend a hand, would be much appreciated!

jackzampolin · April 12, 2017, 7:37pm

@jasonmkeller What I would say to do is downsample the derivative data into hour rollups and the query the data with mean() to produce the full day number. This should be quite performant. Is this an option for you?

jasonmkeller · April 12, 2017, 7:41pm

Exactly what I had started doing with this CQ ;). Here’s what I have so far…

CREATE CONTINUOUS QUERY "cq_interface" ON "telegraf" BEGIN 
  SELECT non_negative_derivative(max("ifHCInOctets"), 1s) * 8 AS "in_bandwidth",non_negative_derivative(max("ifHOutOctets"), 1s) * 8 AS "out_bandwidth" 
  INTO "coarse"."interface" 
  FROM "interface" 
  GROUP BY time(1h) 
END

This look correct? The part I’m nervous about is whether it will keep the series per-interface per-device, and whether it has enough yank to store all these hourly points for two years and still stay in top shape with only 8 GB of memory.

jackzampolin · April 12, 2017, 8:08pm

@jasonmkeller That looks good! A couple of notes:

I would make sure to use "db"."rp"."measurement" syntax on the FROM and INTO clauses.
In order to preserve your tags, make sure to GROUP BY time(1h), *

As far as performance, how many devices are you talking? Also the tsi work should help significantly reduce memory usage for use-cases like this moving forward.

jasonmkeller · April 12, 2017, 8:11pm

Somewhere in the neighborhood of 1600-1800 devices; will be querying every interface (about half of them are switches) and dropping at the telegraf node any that have their ifHighSpeed = 0 (which on our switches means not plugged in).

jackzampolin · April 12, 2017, 8:14pm

@jasonmkeller at that scale you should have no problem.

jasonmkeller · April 12, 2017, 8:29pm

Thanks Jack; I should probably say that this is going to be essentially an ISP portal that many, many entities are going to be reviewing stats on (and if I do this right, they’ll be sitting on it all day ). My tentative gameplan now at this time is to have Telegraf doing SNMP gets and forwarding it into InfluxDB, then having Logstash ingesting Netflow data and forwarding that into Elasticsearch. Then it all gets graphed in Grafana.

I’m avoiding putting Netflow into InfluxDB due to the extremely high series cardinality of the data, which I’d imagine will choke InfluxDB to death (our Netflow streams will be very, very busy).

jackzampolin · April 12, 2017, 8:32pm

@jasonmkeller Yeah those netflow streams are difficult now. I would be interested to hear if you can test the tsi changes with the netflow data.

That setup seems . I’ve seen a couple other folks doing similar things.

jasonmkeller · April 12, 2017, 8:55pm

I’ve got a pretty extensive lab I could probably spin it up for testing (I don’t have Netflow off our very busy 10Gbit links here at work to pipe to my lab, but I can generate a decent stream of Netflow/IPFIX data toward it from other equipment).

Is there a Netflow plugin for Telegraf? If you have some docs to follow for setup I’m sure I could have it crunching/testing it out.

jasonmkeller · April 12, 2017, 9:03pm

So the CQ just fired and for some reason I still don’t see any of the new data in Grafana. I see the new shard created but the fields never show up in Grafana.

Something I’m missing?

jasonmkeller · April 12, 2017, 9:13pm

I see them in the database though…

> SELECT * FROM "coarse"."interface" LIMIT 20
name: interface
time                agent_host    ifDescr                 in_bandwidth
----                ----------    -------                 ------------
1492027200000000000 165.234.97.2  TenGigabitEthernet0/2/0 2.7397862925555557e+08
1492027200000000000 165.234.97.2  TenGigabitEthernet0/0/0 2.1867722045244446e+09
1492027200000000000 165.234.97.1  TenGigabitEthernet0/2/0 4.536673712866667e+08
1492027200000000000 165.234.97.1  TenGigabitEthernet0/0/0 3.6410802055866666e+09

jackzampolin · April 12, 2017, 9:15pm

You need to specify the right RP in grafana.

jasonmkeller · April 12, 2017, 9:27pm

Should be

non_negative_derivative(max("ifHCOutOctets"), 1s)

Sigh…silly typos raining on my parade

jackzampolin · April 12, 2017, 9:53pm

@jasonmkeller I was saying that in Grafana you need to make sure you are querying the right measurement. If they are in named the same in different RPs you need to make sure you specify the right one.

jasonmkeller · April 13, 2017, 1:21pm

And ironically I did have the correct RP (coarse) selected, but Grafana wasn’t populating any of the new values from the interface measurement (however, typing it in manually in the field does pull the data). Probably a Grafana issue as a search on the DB command line shows clearly that the data is there and properly tagged.

I checked some data this morning and I’m crunching through curves by hand to see if they match up (which so far they appear to). That brings me markedly closer to deployment

jasonmkeller · April 26, 2017, 2:01pm

Jack - hope you have some pointers. I just found an issue with our CQ…we’re getting blank values at midnight UTC every day on our rollups. Is this a bug or is this something messed up in my CQ?

> time                 mean
> ----                 ----
> 2017-04-24T22:00:00Z 8.087672388333334e+08
> 2017-04-24T23:00:00Z 5.1333139691333336e+08
> 2017-04-25T00:00:00Z
> 2017-04-25T01:00:00Z 3.6771099004888886e+08
> 2017-04-25T02:00:00Z 2.7155306081333333e+08
> 2017-04-25T03:00:00Z 2.5967737418e+08
> 2017-04-25T04:00:00Z 2.2444512318666667e+08
> 2017-04-25T05:00:00Z 2.631946728888889e+08
> 2017-04-25T06:00:00Z 2.370487154511111e+08
> 2017-04-25T07:00:00Z 2.3160151775111112e+08
> 2017-04-25T08:00:00Z 2.8540324688222224e+08
> 2017-04-25T09:00:00Z 3.2931515402666664e+08
> 2017-04-25T10:00:00Z 3.1178820714e+08
> 2017-04-25T11:00:00Z 3.3946282773777777e+08
> 2017-04-25T12:00:00Z 1.5727707702222223e+09
> 2017-04-25T13:00:00Z 4.676712102415556e+09
> 2017-04-25T14:00:00Z 5.32045180074e+09
> 2017-04-25T15:00:00Z 5.563749168948889e+09
> 2017-04-25T16:00:00Z 5.266778042544444e+09
> 2017-04-25T17:00:00Z 5.04857448816e+09
> 2017-04-25T18:00:00Z 5.457751590317778e+09
> 2017-04-25T19:00:00Z 5.464998330106667e+09
> 2017-04-25T20:00:00Z 3.905074336771111e+09
> 2017-04-25T21:00:00Z 2.0503537871688888e+09
> 2017-04-25T22:00:00Z 9.482162149355556e+08
> 2017-04-25T23:00:00Z 5.871217549577777e+08
> 2017-04-26T00:00:00Z
> 2017-04-26T01:00:00Z 3.8481892825333333e+08
> 2017-04-26T02:00:00Z 2.802886362822222e+08
> 2017-04-26T03:00:00Z 2.5633334344e+08
> 2017-04-26T04:00:00Z 2.2594040661333334e+08
> 2017-04-26T05:00:00Z 2.3159232344444445e+08
> 2017-04-26T06:00:00Z 2.6096797832e+08
> 2017-04-26T07:00:00Z 2.3024448323777777e+08
> 2017-04-26T08:00:00Z 3.0825314184e+08
> 2017-04-26T09:00:00Z 3.197389336933333e+08
> 2017-04-26T10:00:00Z 2.874293079688889e+08
> 2017-04-26T11:00:00Z 3.282012284488889e+08
> 2017-04-26T12:00:00Z 1.5318391435044444e+09

name         query
----         -----
cq_interface CREATE CONTINUOUS QUERY cq_interface ON telegraf BEGIN SELECT non_negative_derivative(max(ifHCInOctets), 1s) * 8 AS in_bandwidth, non_negative_derivative(max(ifHCOutOctets), 1s) * 8 AS out_bandwidth INTO telegraf.coarse.interface FROM telegraf.fine.interface GROUP BY time(1h), * END

jackzampolin · April 26, 2017, 5:07pm

@jasonmkeller An easy way to get around this might be to add a RESAMPLE and EVERY clause to your CQ. This should ensure that those points get recorded properly.

jasonmkeller · April 27, 2017, 7:58pm

While I think this might induce more load, I’ve changed it according to your suggestion to…

CREATE CONTINUOUS QUERY "cq_interface" ON "telegraf"
RESAMPLE EVERY 30m
BEGIN 
SELECT non_negative_derivative(max("ifHCInOctets"), 1s) * 8 AS "in_bandwidth",non_negative_derivative(max("ifHCOutOctets"), 1s) * 8 AS "out_bandwidth" 
INTO "telegraf"."coarse"."interface" 
FROM "telegraf"."fine"."interface" 
GROUP BY time(1h), * END`

Hopefully this works; I’ll watch it for a couple days and let you know. Does anyone have any idea why it would have been dropping that hour each day?

jasonmkeller · April 28, 2017, 1:11pm

Jack - this failed to change the behavior of the CQ. I still see the gap last night…please advise.

jackzampolin · April 28, 2017, 4:19pm

@jasonmkeller Can you open an issue on InfluxDB?

Topic		Replies	Views
Use of a Continuous Query with non_negative_difference and non_negative_derivative functions Store influxdb , time-series	0	2307	February 15, 2018
Derivative on downsampled bucket gives wrong result InfluxDB 2 influxdb , influxql	0	644	March 26, 2021
Thinning samples for derivative, when not aligned to bucket window	0	707	May 26, 2018
Downsampling existing data influxdb	3	7523	June 8, 2017
Continuous Query not downsampling data Store	0	701	March 8, 2018

Derivative downsample CQ

Related topics