Kapacitor: Watch instances for downtime


I have a measurement where several intances of the same service have their uptime reported to by a global monitor service. Every five minutes that monitor will ask each instance for its uptime and forwards it to an InfluxDB measurement which ends up looking like this (say the instances were started ten minutes ago):

service,serviceName=ServiceA uptime=600 <timestamp+0>
service,serviceName=ServiceB uptime=600 <timestamp+0>
service,serviceName=ServiceC uptime=600 <timestamp+0>
service,serviceName=ServiceA uptime=900 <timestamp+5m>
service,serviceName=ServiceB uptime=900 <timestamp+5m>
service,serviceName=ServiceC uptime=900 <timestamp+5m>
service,serviceName=ServiceA uptime=1200 <timestamp+10m>
service,serviceName=ServiceB uptime=1200 <timestamp+10m>
service,serviceName=ServiceC uptime=123 <timestamp+10m>

Now, every once in a while one of these instances will die, restart and reset it’s uptime to zero. Clients connected to such an instance will be redistributed among the remaining instances. That means at some point the difference between the current uptime and the previously recorded uptime will be negative. In the example above, ServiceC died and was restarted 123 seconds before the last uptime was collected.

I was able to come up with a simple alert that triggers if one service goes down by using a difference node:

        .warn(lambda: "difference" < 0)
        .message('Service {{ index .Tags "serviceName" }} went down')

This sends a WARN-level alert if one instance dies which is inconvenient but not a big deal. So far so good.

Want I want to do now is send a critical alert when at least 20% of all available services die at the same time. I’ve been racking my brain on this problem for the last couple of hours. I can easily figure out

  1. the total number of services available by using |count('uptime') on an ungrouped stream
  2. if an individual instance died with the TICKscript above

But I can’t figure out how to bring those two things together. I’m thinking about something along the lines of

var data = // something

    |eval(lambda: float("instancesDied" / "instancesTotal"))
       .crit(lambda: "percentageDied" >= 0.2)
       .message("More than 20% of instances died")

but I’m not even sure anymore if that’s even the right approach.

Any help is appreciated.


I solved a similar problem by alerting when the uptime counter goes below 30s. I did this for both the JVMs on the boxes and the boxes themselves.