Kapacitor and Processing Alerting on Sparse Data

kapacitor

#1

As always all and any help is greatly appreciated.

I have somewhat strange situation where I need to do some alerting on
measurements that send data to Influx on a non regular basis. I am
ending up with a situation where the absence of data points is something
that should be considered OK. I an having trouble figuring out how to
create the correct TICKscript to do what I need.

I know that sparse data is not what Kapacitor is best at but am still
trying to use it to get this job done.

There was a Kapacitor Issue raised
(https://github.com/influxdata/kapacitor/issues/1039) similar to this
and a few pseudo solutions offered and I am trying to implement one of
those solutions.

We have a set of jobs that run on a periodic basis and take a variable
amount of time to run when they do. When the jobs hit a failure
condition they emit a metric to Influx. We end up having a time series
like this.

app.foobarbaz.failures,name=app1 value=1 01-01-2018@01:30
app.foobarbaz.failures,name=app1 value=3 01-01-2018@02:00
app.foobarbaz.failures,name=app1 value=1 01-01-2018@02:30
app.foobarbaz.failures,name=app1 value=2 01-01-2018@03:30
app.foobarbaz.failures,name=app1 value=4 01-01-2018@05:30

app.foobarbaz.failures,name=app1 value=4 01-01-2018@10:30
app.foobarbaz.failures,name=app1 value=4 01-01-2018@11:30
app.foobarbaz.failures,name=app1 value=4 01-01-2018@12:00
app.foobarbaz.failures,name=app1 value=4 01-01-2018@12:30
app.foobarbaz.failures,name=app1 value=4 01-01-2018@13:00

app.foobarbaz.failures,name=app1 value=1 01-02-2018@01:30
app.foobarbaz.failures,name=app1 value=3 01-02-2018@02:00
app.foobarbaz.failures,name=app1 value=1 01-02-2018@02:30
app.foobarbaz.failures,name=app1 value=2 01-02-2018@03:30
app.foobarbaz.failures,name=app1 value=4 01-02-2018@05:30

Every hour we look back at the last 6 hours of “value” for this
measurment and if the sum over that time is > x we alert.

The problem s that we can go more than 6 hours without any points being
sent into the system. So if we end up in a failure state and no failures
happen for more than 6 hours we stay in a failure state becacuse
Kapacitor has no points to process. Default and Fill nodes are not
useful here because they only work when there are data points to deal
with.

In our case it would also be OK to insert a metric with value=0 to
catalyze the system. I could use secondary TICK script and setup a
deadman node to an InfluxDBOut node but I really really do not want to
have deal with 2 tasks for every job.

Anyone have suggestions on how I might handle this?

This is the TICK that we have that works fine as long as points keep coming into the system.

var db          = 'mydb'
var rp          = 'autogen'
var measurement = 'app.foobarbaz.failues'
var groupBy     = []
var whereFilter = lambda: ("name" == 'app1')
var period      = 6h
var every       = 1h
var crit        = 5

var data = stream
    |from()
        .database(db)
        .retentionPolicy(rp)
        .measurement(measurement)
        .groupBy(groupBy)
        .where(whereFilter)
    |window()
        .period(period)
        .every(every)
        .align()
    |sum('value')
        .as('value')

var trigger = data
    |alert()
        .crit(lambda: "value" > crit)
        .stateChangesOnly()
        .log('/tmp/kapacitor/fbb.log')