Generic Deadman Alerts On Sparse Events


#1

Hi,

I’m recording events (e.g. backup job complete at timestamp) in influxDB and would like to write alerts when number of events in a given time period is below a threshold or none (node went down and didn’t report at all). I think I need batch TICKScript because the start/end of period is important and should not be arbitrary based on when *.tick was added. I’d like the alerting to be resilient to Kapacitor crashes - that is if Kapacitor crashes and is restarted an hour later it doesn’t lose track of groupBy() tags and I miss alerts.

Can someone help out with TICKScript or suggestions how to approach this type of problem with Kapacitor and InfluxDB? Or maybe even a different system.

The first issue is count() does not return 0 (https://github.com/influxdata/influxdb/issues/6412). So I need a combination of deadman() and alert().

The second issue is writing alerts in a generic way using groupBy() instead of for each tag (e.g. node). If I add a Kapacitor batch job right now, and period for it is 12 hours, I will miss the fact that some nodes may not have reported any stats for 12h+1s because they are not part of the group. I tried to express a script that tried to count events over for pas 30 days but grouped by 12 hours and somehow grab the last count. This way I would get all tags available for last 30 days. However my attempts failed (last value would be 0 because the timestamp for groupBy seems to be the beginning of time period).

Thank you.