Derivative downsample CQ

datalifecycle
influxql
influxdb
#1

Very new to InfluxDB but I have what I’d imagine will be a common question for anyone storing network interface counters. Right now I’m pouring in the following fields…

ifHCInOctets
ifHCOutOctets

Obviously, these are cumulative counters per-interface per-device. When downsampling them, I have to first determine the rate (non-negative-derivative) per second, then I must average that derivative over the entire day. I’m trying to make this resistant to counter rolls (i.e. if someone clears/resets the counter in the middle of the day, it shouldn’t ruin the entire day worth of derivative). I’m unsure if I simply tell it to use the derivative across the entire day if it will withstand this or just take the endpoints (which will be wholly wrong in this instance) and attempt a single derivative across it.

Basic gist, I have to downsample all of the data per-interface per-device from minute resolution to daily resolution for trending use, and hold it for possibly years.

Once again, very new to the InfluxQL language, but so far this has been very, very promising, and is one of my last hurdles aside from load testing to get it from POC to PROD. If anyone could lend a hand, would be much appreciated!

Using Kapacitor to rollup metrics
#2

@jasonmkeller What I would say to do is downsample the derivative data into hour rollups and the query the data with mean() to produce the full day number. This should be quite performant. Is this an option for you?

#3

Exactly what I had started doing with this CQ ;). Here’s what I have so far…

CREATE CONTINUOUS QUERY "cq_interface" ON "telegraf" BEGIN 
  SELECT non_negative_derivative(max("ifHCInOctets"), 1s) * 8 AS "in_bandwidth",non_negative_derivative(max("ifHOutOctets"), 1s) * 8 AS "out_bandwidth" 
  INTO "coarse"."interface" 
  FROM "interface" 
  GROUP BY time(1h) 
END

This look correct? The part I’m nervous about is whether it will keep the series per-interface per-device, and whether it has enough yank to store all these hourly points for two years and still stay in top shape with only 8 GB of memory.

#4

@jasonmkeller That looks good! A couple of notes:

  • I would make sure to use "db"."rp"."measurement" syntax on the FROM and INTO clauses.
  • In order to preserve your tags, make sure to GROUP BY time(1h), *

As far as performance, how many devices are you talking? Also the tsi work should help significantly reduce memory usage for use-cases like this moving forward.

#5

Somewhere in the neighborhood of 1600-1800 devices; will be querying every interface (about half of them are switches) and dropping at the telegraf node any that have their ifHighSpeed = 0 (which on our switches means not plugged in).

#6

@jasonmkeller at that scale you should have no problem.

#7

Thanks Jack; I should probably say that this is going to be essentially an ISP portal that many, many entities are going to be reviewing stats on (and if I do this right, they’ll be sitting on it all day :slight_smile: ). My tentative gameplan now at this time is to have Telegraf doing SNMP gets and forwarding it into InfluxDB, then having Logstash ingesting Netflow data and forwarding that into Elasticsearch. Then it all gets graphed in Grafana.

I’m avoiding putting Netflow into InfluxDB due to the extremely high series cardinality of the data, which I’d imagine will choke InfluxDB to death (our Netflow streams will be very, very busy).

#8

@jasonmkeller Yeah those netflow streams are difficult now. I would be interested to hear if you can test the tsi changes with the netflow data.

That setup seems :ok_hand:. I’ve seen a couple other folks doing similar things.

#9

I’ve got a pretty extensive lab I could probably spin it up for testing (I don’t have Netflow off our very busy 10Gbit links here at work to pipe to my lab, but I can generate a decent stream of Netflow/IPFIX data toward it from other equipment).

Is there a Netflow plugin for Telegraf? If you have some docs to follow for setup I’m sure I could have it crunching/testing it out.

#10

So the CQ just fired and for some reason I still don’t see any of the new data in Grafana. I see the new shard created but the fields never show up in Grafana.

Something I’m missing?

#11

I see them in the database though…

> SELECT * FROM "coarse"."interface" LIMIT 20
name: interface
time                agent_host    ifDescr                 in_bandwidth
----                ----------    -------                 ------------
1492027200000000000 165.234.97.2  TenGigabitEthernet0/2/0 2.7397862925555557e+08
1492027200000000000 165.234.97.2  TenGigabitEthernet0/0/0 2.1867722045244446e+09
1492027200000000000 165.234.97.1  TenGigabitEthernet0/2/0 4.536673712866667e+08
1492027200000000000 165.234.97.1  TenGigabitEthernet0/0/0 3.6410802055866666e+09
#12

You need to specify the right RP in grafana.

#14

Should be

non_negative_derivative(max("ifHCOutOctets"), 1s)

Sigh…silly typos raining on my parade

#15

@jasonmkeller I was saying that in Grafana you need to make sure you are querying the right measurement. If they are in named the same in different RPs you need to make sure you specify the right one.

#16

And ironically I did have the correct RP (coarse) selected, but Grafana wasn’t populating any of the new values from the interface measurement (however, typing it in manually in the field does pull the data). Probably a Grafana issue as a search on the DB command line shows clearly that the data is there and properly tagged.

I checked some data this morning and I’m crunching through curves by hand to see if they match up (which so far they appear to). That brings me markedly closer to deployment :smile:

1 Like
#17

Jack - hope you have some pointers. I just found an issue with our CQ…we’re getting blank values at midnight UTC every day on our rollups. Is this a bug or is this something messed up in my CQ?

> time                 mean
> ----                 ----
> 2017-04-24T22:00:00Z 8.087672388333334e+08
> 2017-04-24T23:00:00Z 5.1333139691333336e+08
> 2017-04-25T00:00:00Z
> 2017-04-25T01:00:00Z 3.6771099004888886e+08
> 2017-04-25T02:00:00Z 2.7155306081333333e+08
> 2017-04-25T03:00:00Z 2.5967737418e+08
> 2017-04-25T04:00:00Z 2.2444512318666667e+08
> 2017-04-25T05:00:00Z 2.631946728888889e+08
> 2017-04-25T06:00:00Z 2.370487154511111e+08
> 2017-04-25T07:00:00Z 2.3160151775111112e+08
> 2017-04-25T08:00:00Z 2.8540324688222224e+08
> 2017-04-25T09:00:00Z 3.2931515402666664e+08
> 2017-04-25T10:00:00Z 3.1178820714e+08
> 2017-04-25T11:00:00Z 3.3946282773777777e+08
> 2017-04-25T12:00:00Z 1.5727707702222223e+09
> 2017-04-25T13:00:00Z 4.676712102415556e+09
> 2017-04-25T14:00:00Z 5.32045180074e+09
> 2017-04-25T15:00:00Z 5.563749168948889e+09
> 2017-04-25T16:00:00Z 5.266778042544444e+09
> 2017-04-25T17:00:00Z 5.04857448816e+09
> 2017-04-25T18:00:00Z 5.457751590317778e+09
> 2017-04-25T19:00:00Z 5.464998330106667e+09
> 2017-04-25T20:00:00Z 3.905074336771111e+09
> 2017-04-25T21:00:00Z 2.0503537871688888e+09
> 2017-04-25T22:00:00Z 9.482162149355556e+08
> 2017-04-25T23:00:00Z 5.871217549577777e+08
> 2017-04-26T00:00:00Z
> 2017-04-26T01:00:00Z 3.8481892825333333e+08
> 2017-04-26T02:00:00Z 2.802886362822222e+08
> 2017-04-26T03:00:00Z 2.5633334344e+08
> 2017-04-26T04:00:00Z 2.2594040661333334e+08
> 2017-04-26T05:00:00Z 2.3159232344444445e+08
> 2017-04-26T06:00:00Z 2.6096797832e+08
> 2017-04-26T07:00:00Z 2.3024448323777777e+08
> 2017-04-26T08:00:00Z 3.0825314184e+08
> 2017-04-26T09:00:00Z 3.197389336933333e+08
> 2017-04-26T10:00:00Z 2.874293079688889e+08
> 2017-04-26T11:00:00Z 3.282012284488889e+08
> 2017-04-26T12:00:00Z 1.5318391435044444e+09

name         query
----         -----
cq_interface CREATE CONTINUOUS QUERY cq_interface ON telegraf BEGIN SELECT non_negative_derivative(max(ifHCInOctets), 1s) * 8 AS in_bandwidth, non_negative_derivative(max(ifHCOutOctets), 1s) * 8 AS out_bandwidth INTO telegraf.coarse.interface FROM telegraf.fine.interface GROUP BY time(1h), * END
#18

@jasonmkeller An easy way to get around this might be to add a RESAMPLE and EVERY clause to your CQ. This should ensure that those points get recorded properly.

#19

While I think this might induce more load, I’ve changed it according to your suggestion to…

CREATE CONTINUOUS QUERY "cq_interface" ON "telegraf"
RESAMPLE EVERY 30m
BEGIN 
SELECT non_negative_derivative(max("ifHCInOctets"), 1s) * 8 AS "in_bandwidth",non_negative_derivative(max("ifHCOutOctets"), 1s) * 8 AS "out_bandwidth" 
INTO "telegraf"."coarse"."interface" 
FROM "telegraf"."fine"."interface" 
GROUP BY time(1h), * END`

Hopefully this works; I’ll watch it for a couple days and let you know. Does anyone have any idea why it would have been dropping that hour each day?

#20

Jack - this failed to change the behavior of the CQ. I still see the gap last night…please advise.

#21

@jasonmkeller Can you open an issue on InfluxDB?