Using Kapacitor to rollup metrics

I’m currently in a quandary where I have to rollup metrics and store them for years at a time, but I only have Elasticsearch as a backend, not InfluxDB. It was suggested by @daniel that Kapacitor could potentially serve the purpose of taking the stream data from Telegraf, roll it up, and send it back to Telegraf, however I’m a bit perplexed for the configuration of this.

What I’m trying to do is thus…

I have roughly 1600 devices that I’ll be querying many interfaces on, pulling inbound octets, outbound octets. I’ll be polling them with Telegraf via SNMP every 2.5 minutes, storing this resolution of data for 30 days rolling in daily indices in Elasticsearch.

My hope is that Kapacitor can then rollup the data into daily resolution points that I can store long term. To achieve this, I’d need to group it per-interface per-agent-host, separately doing derivatives across all the fine-point data (to resist counter rolls/clears) and then calculating the mean of the derivatives (byte counters are cumulative, as you’re no doubt well aware) for a single daily mean of the bandwidth utilization for the day for each interface on each device.

I’m flailing around in the lab with it, but if anyone has any pointers I’d be grateful.

@jasonmkeller What are you having trouble with?

To start at the top, I’ve installed Kapacitor and tried to enable it to start on system startup, but I get this error…

systemctl enable kapacitor
Failed to execute operation: Too many levels of symbolic links

On the script syntax, I’m completely lost as to how in tickscript how to nest the two operations (derivative, then mean) with different time windows. From all outward appearances even if I manage to crack this, I’ll have to have two tickscripts (one for each field) right?

@jasonmkeller Hmm… That sounds like a packaging issue. Getting both of those operations into one tick script should be doable.

Can you share what OS you are using and the /etc/systemd/system/kapacitor.service or /etc/init.d/kapacitor file?

RHEL7…

# ls -lh /etc/systemd/system
lrwxrwxrwx. 1 root root   41 Apr  7 10:01 kapacitor.service -> /usr/lib/systemd/system/kapacitor.service
-rw-rw-r--. 1 root root  466 Mar 22 22:47 kibana.service
-rw-r--r--. 1 root root  511 Mar 30 13:47 logstash.service

Looks like it probably doesn’t like the symlink there as it wants the actual service to symlink.

@jasonmkeller Can you open an issue on Kapacitor with this info? We like to get those build issues taken care of. I’m checking on the TICKScript question for you

RHEL7 failed to enable service #1309 on Github now

1 Like

From @jasonmkeller on the github issue:

oooh! that poke did it. I’d have never guessed to switch an output from telegraf to the api endpoint of kapacitor (I thought that was only for commands to it, not data).

$ kapacitor stats ingress
Database Retention Policy Measurement Points Received
_kapacitor autogen edges 15
_kapacitor autogen ingress 15
_kapacitor autogen kapacitor 3
_kapacitor autogen nodes 12
_kapacitor autogen runtime 3
telegraf dhcp 2540
telegraf interface 9687
telegraf system 12
I’m hoping I’m at least close with this configuration, but I can’t be perfectly sure how it’s going to turn out (if it’s going to be bucketed properly, how it’s rolling up the interval, how much memory it needs, etc).

Great glad its working.

Memory and CPU are big concerns; right now my production flight-list will have Kapacitor co-located on the same node with Telegraf with 4 vCPU and 8GB of memory (I’m pushing for 16GB, but not sure I’m going to get it). I have 1600 devices I’ll be polling probably every minute.

Yeah this is going to be the biggest concern. Since you will have to buffer all the data in RAM for the aggregation period, which you mentioned is a day.

Honestly I don’t think you will be happy with this workflow so long as InfluxDB is not in the picture as any disruption in Kapacitor will mean data loss in your aggregate metrics.

Here is a quick run down of how I would do this while using InfluxDB as a short term buffer.

Telegraf → InfluxDB → Kapacitor → ES

Then configure the retention policy on InfluxDB to be say 2 days. Then write all the down sampling tasks as batch tasks that query the buffer of data out of InfluxDB for the past day and write it to ES.

This way if Kapacitor ever has an issue you have 2 days to make an API call to tell Kapacitor to re run a task for time period Kapacitor was down. InfluxDB usage should be small and that will also reduce the amount of RAM Kapacitor needs to consume as the data is buffered on disk instead.

That is how I would recommended going forward. But try it out and see how it works.

As for the tasks you want they sound like they will be straightforward with the exception of having Kapacitor write to ES. What is your plan there?

Also if you do run into trouble writing up the tasks please reach out.

Right now I’m flushing the data from Kapacitor back to Telegraf via an http input to take advantage of the native Elasticsearch plugin (I built from MASTER to take that for a spin prior to 1.3 dropping, and thus far the performance has been markedly better (80% better) than the GELF output (which went to Logstash, then ES), filtering by name via namepass and namedrop parameters to put them in different indices.

Kind of feels kludgy, but should be ok. We can tolerate some data loss on the aggregate portion that Kapacitor is serving, so I figure if I have enough IOPS and performance out of ES we could probably do 3 hr aggregations and end up with around 80 million rows per year (which if my math is right, would be around 4GB in Elasticsearch with my current template parameters).

My biggest fear at this point (aside from the whole thing bursting into flame :wink: ) is actually how the derivative aggregations are going to be done in Kapacitor and whether it will tolerate a counter clear/roll. If it’s doing a derivative between each row and then doing an average across that, then it shouldn’t miss a beat. However, if it attempts to just take the min/max during that interval and then do a derivative across the whole time interval, a counter roll/clear would end up ruining that whole interval, something I’m trying to avoid.

Ugh. Looks like my Kapacitor output isn’t going anywhere (not getting any data in Telegraf/ES).

# tail -50 /var/log/kapacitor/kapacitor.log
[httpd] 127.0.0.1 - - [11/Apr/2017:16:54:10 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" 64b4765d-1f01-11e7-82ed-000000000000 458
[httpd] 127.0.0.1 - - [11/Apr/2017:16:54:20 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" 6aaa58f3-1f01-11e7-82ee-000000000000 154
[httpd] 127.0.0.1 - - [11/Apr/2017:16:54:30 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" 70a03aa2-1f01-11e7-82ef-000000000000 1020
[httpd] 127.0.0.1 - - [11/Apr/2017:16:54:40 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" 76961ec2-1f01-11e7-82f0-000000000000 1012
[httpd] 127.0.0.1 - - [11/Apr/2017:16:54:50 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" 7c8bfd40-1f01-11e7-82f1-000000000000 106
[httpd] 127.0.0.1 - - [11/Apr/2017:16:55:00 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" 8281f09b-1f01-11e7-82f2-000000000000 223
[httpd] 127.0.0.1 - - [11/Apr/2017:16:55:01 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" 82f5d473-1f01-11e7-82f3-000000000000 12661
[httpd] 127.0.0.1 - - [11/Apr/2017:16:55:01 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" 8323fd2e-1f01-11e7-82f4-000000000000 5154
[httpd] 127.0.0.1 - - [11/Apr/2017:16:55:02 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" 839ceed4-1f01-11e7-82f5-000000000000 8039
[httpd] 127.0.0.1 - - [11/Apr/2017:16:55:02 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" 83bb9cb6-1f01-11e7-82f6-000000000000 10033
[httpd] 127.0.0.1 - - [11/Apr/2017:16:55:03 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" 8423ce7f-1f01-11e7-82f7-000000000000 6578
[httpd] 127.0.0.1 - - [11/Apr/2017:16:55:04 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" 84d30551-1f01-11e7-82f8-000000000000 5024
[httpd] 127.0.0.1 - - [11/Apr/2017:16:55:06 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" 85fedb21-1f01-11e7-82f9-000000000000 9536
[httpd] 127.0.0.1 - - [11/Apr/2017:16:55:06 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" 864ef17f-1f01-11e7-82fa-000000000000 4950
[httpd] 127.0.0.1 - - [11/Apr/2017:16:55:07 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" 86ed953b-1f01-11e7-82fb-000000000000 6224
[httpd] 127.0.0.1 - - [11/Apr/2017:16:55:10 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" 8877d615-1f01-11e7-82fc-000000000000 6255
[httpd] 127.0.0.1 - - [11/Apr/2017:16:55:15 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" 8b9957f1-1f01-11e7-82fd-000000000000 9622
[httpd] 127.0.0.1 - - [11/Apr/2017:16:55:17 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" 8c963fce-1f01-11e7-82fe-000000000000 4666
[httpd] 127.0.0.1 - - [11/Apr/2017:16:55:20 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" 8e6da955-1f01-11e7-82ff-000000000000 6424
[httpd] 127.0.0.1 - - [11/Apr/2017:16:55:30 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" 94638ee4-1f01-11e7-8300-000000000000 4319
[httpd] 127.0.0.1 - - [11/Apr/2017:16:55:40 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" 9a596a1c-1f01-11e7-8301-000000000000 6816
[httpd] 127.0.0.1 - - [11/Apr/2017:16:55:50 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" a04f481b-1f01-11e7-8302-000000000000 3127
[httpd] 127.0.0.1 - - [11/Apr/2017:16:56:00 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" a6452037-1f01-11e7-8303-000000000000 218
[httpd] 127.0.0.1 - - [11/Apr/2017:16:56:10 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" ac3b02d8-1f01-11e7-8304-000000000000 506
[httpd] 127.0.0.1 - - [11/Apr/2017:16:56:30 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" b826cd49-1f01-11e7-8305-000000000000 2069
[httpd] 127.0.0.1 - - [11/Apr/2017:16:56:40 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" be1ca73a-1f01-11e7-8306-000000000000 841
[httpd] 127.0.0.1 - - [11/Apr/2017:16:56:50 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" c41285e0-1f01-11e7-8307-000000000000 113
[httpd] 127.0.0.1 - - [11/Apr/2017:16:57:00 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" ca0867fd-1f01-11e7-8308-000000000000 186
[httpd] 127.0.0.1 - - [11/Apr/2017:16:57:10 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" cffe47d1-1f01-11e7-8309-000000000000 438
[httpd] 127.0.0.1 - - [11/Apr/2017:16:57:20 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" d5f4295e-1f01-11e7-830a-000000000000 105
[httpd] 127.0.0.1 - - [11/Apr/2017:16:57:30 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" dbeac94d-1f01-11e7-830b-000000000000 4209
[httpd] 127.0.0.1 - - [11/Apr/2017:16:57:31 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" dc5cefe5-1f01-11e7-830c-000000000000 5061
[httpd] 127.0.0.1 - - [11/Apr/2017:16:57:31 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" dc771858-1f01-11e7-830d-000000000000 5947
[httpd] 127.0.0.1 - - [11/Apr/2017:16:57:31 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" dcc840bb-1f01-11e7-830e-000000000000 10306
[httpd] 127.0.0.1 - - [11/Apr/2017:16:57:32 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" dcfcf13a-1f01-11e7-830f-000000000000 4893
[httpd] 127.0.0.1 - - [11/Apr/2017:16:57:32 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" dd227206-1f01-11e7-8310-000000000000 10708
[httpd] 127.0.0.1 - - [11/Apr/2017:16:57:32 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" dd61d9aa-1f01-11e7-8311-000000000000 5136
[httpd] 127.0.0.1 - - [11/Apr/2017:16:57:33 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" dddc4ce7-1f01-11e7-8312-000000000000 8443
[httpd] 127.0.0.1 - - [11/Apr/2017:16:57:33 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" ddf2d503-1f01-11e7-8313-000000000000 4512
[httpd] 127.0.0.1 - - [11/Apr/2017:16:57:34 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" de5f391b-1f01-11e7-8314-000000000000 8489
[httpd] 127.0.0.1 - - [11/Apr/2017:16:57:36 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" df68f329-1f01-11e7-8315-000000000000 4720
[httpd] 127.0.0.1 - - [11/Apr/2017:16:57:36 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" dfabe53d-1f01-11e7-8316-000000000000 8176
[httpd] 127.0.0.1 - - [11/Apr/2017:16:57:38 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" e0d4c833-1f01-11e7-8317-000000000000 5016
[httpd] 127.0.0.1 - - [11/Apr/2017:16:57:40 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" e1dff157-1f01-11e7-8318-000000000000 3005
[httpd] 127.0.0.1 - - [11/Apr/2017:16:57:44 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" e46e86d8-1f01-11e7-8319-000000000000 5806
[httpd] 127.0.0.1 - - [11/Apr/2017:16:57:45 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" e51cc47f-1f01-11e7-831a-000000000000 7294
[httpd] 127.0.0.1 - - [11/Apr/2017:16:57:50 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" e7d75070-1f01-11e7-831b-000000000000 2969
[httpd] 127.0.0.1 - - [11/Apr/2017:16:58:00 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" edcbbf43-1f01-11e7-831c-000000000000 3562
[httpd] 127.0.0.1 - - [11/Apr/2017:16:58:10 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" f3c19b69-1f01-11e7-831d-000000000000 8909
[httpd] 127.0.0.1 - - [11/Apr/2017:16:58:30 -0500] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "-" ffad5fc5-1f01-11e7-831e-000000000000 1214

For the derivative calculation it should be something like this:

stream
    |from()
       .measurement('m')
      .groupBy(*)
    |derivative('value')
        .unit(1s)
        // This will handle the clear/roll over of the counter.
        // Basically the derivative operation compares two points and performs the derivative,
        // by setting .nonNegative if the result comes out negative because of a clear or overflow,
        // the result is simply dropped (since we can't reliably determine what the value should be).
        // The end result is that all derivative values are positive, eliminating any large spikes a clear/overflow would cause.
        .nonNegative()
    // Perform any aggregation here
    // send to Telegraf/ES

Thanks @nathaniel; unfortunately due to shifting requirements here I’m having to pivot away from using Kapacitor to do this sort of thing and escalate to storing all the SNMP data in InfluxDB and simply doing a CQ in InfluxDB for the rollup aggregation to keep the stack as simple and stateful as possible (which ironically we managed to knock out in a single day yesterday!)

@jackzampolin worked out a CQ with me here to perform the aggregation, and it looks like it’s working quite well (Derivative downsample CQ).

Thank you for all your help and quick responses! I greatly appreciate it :smile:

1 Like

@jasonmkeller Glad I could help! :smile: