Kapacitor: Watch instances for downtime

I have a measurement where several intances of the same service have their uptime reported to by a global monitor service. Every five minutes that monitor will ask each instance for its uptime and forwards it to an InfluxDB measurement which ends up looking like this (say the instances were started ten minutes ago):

service,serviceName=ServiceA uptime=600 <timestamp+0>
service,serviceName=ServiceB uptime=600 <timestamp+0>
service,serviceName=ServiceC uptime=600 <timestamp+0>
service,serviceName=ServiceA uptime=900 <timestamp+5m>
service,serviceName=ServiceB uptime=900 <timestamp+5m>
service,serviceName=ServiceC uptime=900 <timestamp+5m>
service,serviceName=ServiceA uptime=1200 <timestamp+10m>
service,serviceName=ServiceB uptime=1200 <timestamp+10m>
service,serviceName=ServiceC uptime=123 <timestamp+10m>

Now, every once in a while one of these instances will die, restart and reset it’s uptime to zero. Clients connected to such an instance will be redistributed among the remaining instances. That means at some point the difference between the current uptime and the previously recorded uptime will be negative. In the example above, ServiceC died and was restarted 123 seconds before the last uptime was collected.

I was able to come up with a simple alert that triggers if one service goes down by using a difference node:

stream
    |from()
        .measurement('service')
        .groupBy('serviceName')
    |difference('uptime')
    |alert()
        .warn(lambda: "difference" < 0)
        .message('Service {{ index .Tags "serviceName" }} went down')

This sends a WARN-level alert if one instance dies which is inconvenient but not a big deal. So far so good.

Want I want to do now is send a critical alert when at least 20% of all available services die at the same time. I’ve been racking my brain on this problem for the last couple of hours. I can easily figure out

  1. the total number of services available by using |count('uptime') on an ungrouped stream
  2. if an individual instance died with the TICKscript above

But I can’t figure out how to bring those two things together. I’m thinking about something along the lines of

var data = // something

data
    |eval(lambda: float("instancesDied" / "instancesTotal"))
        .as('percentageDied')
    |alert()
       .crit(lambda: "percentageDied" >= 0.2)
       .message("More than 20% of instances died")

but I’m not even sure anymore if that’s even the right approach.

Any help is appreciated.

1 Like

I solved a similar problem by alerting when the uptime counter goes below 30s. I did this for both the JVMs on the boxes and the boxes themselves.

I don’t think this is possible with Kapacitor, because once you use GroupBy Node, then, according to the documentation, “Each group is then processed independently for the rest of the pipeline.”. There is no “UnGroupBy” tasks, which would join individual service series together.

This is not possible with Batch tasks either, because they only support simple queries without nesting.

What is possible is to create InfluxQL query with nested subquery. Inner query would group by instances (and interval) and filter those which reset. Outer query would count the number of reset instances in a given interval (only group by interval and use count aggregator on series).

To have % change is more difficult (because differencePercent is also missing), wrap that count in additional derivative function “difference” and divide by count. Ideally you would use denumerator (count, divided to) from the previous interval, you would probably need an additional query with time-shift of interval duration, joined to the first already-complex query using matches on all grouped by tags…

Alternatively, you can upload the intermediary count result back to InfluxDB (maybe have continuous query for that) and have a separate, simpler query to calculate the % change of reset instances.

I would suppose the same may be expressed with Flux in a more structural way, but may take up a full page of code. Ultimately Flux will be run from what are now Kapacitor Tasks and feed data to Alerts. So I think it is worth to try out 2.0 although not sure when it will be released, maybe 2021…