TICK Alert frequency issues

philb · October 6, 2017, 10:46am

Hello,

I have written a script to check free disk space on my hosts and have set it to get an hours worth of data every 10 minutes in a window. The problem is the window appears to be ignoring these time frames and alerts me every minute. I’ve been getting alerts every minute since 11am.

TICK script:

var db = 'telegraf'
var rp = 'autogen'
var measurement = 'win_disk'
var name = 'Stream Free Disk Space2'
var groupBy = ['host']
var whereFilter = lambda: TRUE //("Percent_Free_Space" < crit)
var outputDB = 'chronograf'
var outputRP = 'autogen'
var outputMeasurement = 'alerts'
//var period = 60m
//var every = 10m
var crit = 95

var diskSpace = stream

|from()
    .database(db)
    .measurement(measurement)

    .groupBy(groupBy)
|window()
.period(60m)
.every(10m)
.align()

|mean('Percent_Free_Space')
.as('Free_Space_Mean')

|eval(lambda: "Free_Space_Mean")
.as('value')

var trigger = diskSpace
|alert()
.all()
.crit(lambda: "value" < crit)
.message('A Message')

.email('myself@myemail.co.uk')
.stateChangesOnly(10m)
.noRecoveries()

I’ve tried adding the times in using a variable but that throws me a ‘name not in scope’ error even though it is defined up top (i’ve commented them out since) and i’ve also tried adding them directly in to the .period and .every.

Nothing seems to work though, have i missed something?

Can any one point me in the right direction?

PhilB

nathaniel · October 6, 2017, 3:31pm

Your script looks right. Can you share the output of kapacitor show TASK? That might give us a hint as to whats going on.

dave.bender · June 5, 2018, 9:32pm

Can you share what the solution was? We have an over-alerting problem, too.

philb · June 6, 2018, 11:09am

HI Dave,

In the end i added a state count but this was also noisy. Ultimately i went with state duration and state changes only. I also tried using a batch script but needed more real time results.

My current itteration is as follows:

var db = 'database'

var rp = 'retention''

var measurement = 'win_disk'

var groupBy = ['host', 'instance']

var whereFilter = lambda: (isPresent("Percent_Free_Space")) AND ("instance" == 'C:') OR ("instance" == 'D:')

var name = 'Windows Disk Space'

var idVar = name + ':{{.Group}}'

var message = '{{ if eq .Level "CRITICAL" }} {{ .Level }} - {{.TaskName}} - {{ index .Fields "Percent_Free_Space" | printf "%0.2f"}}%   {{else if eq .Level "INFO"}} {{.Level}} - {{.TaskName}} - Deadman Switch Triggered! {{ else if eq .Level "WARNING" }}{{ .Level }} - {{.TaskName}} - {{ index .Fields "Percent_Free_Space" | printf "%0.2f"}}%  {{else}} {{.TaskName}} Back to normal! {{end}} '

var idTag = 'alertID'

var levelTag = 'level'

var messageField = 'message'

var durationField = 'duration'

var outputDB = 'chronograf'

var outputRP = 'autogen'

var outputMeasurement = 'alerts'

var triggerType = 'threshold'

var warn = 20
var crit = 15
var multiple = 1024
var data = stream
    |from()
        .database(db)
        .retentionPolicy(rp)
        .measurement(measurement)
        .groupBy(groupBy)
        .where(whereFilter)
 

    |stateDuration(lambda: "Percent_Free_Space" < crit )
        .unit(1m)
        .as('CritDuration')
    
    |stateDuration(lambda: "Percent_Free_Space" < warn)
        .unit(1m)
        .as('WarnDuration')
    //|log()
var trigger = data
    |alert()
        //state duration crit
        .crit(lambda: "CritDuration" > 10)
        //state duration warning
        .warn(lambda: "WarnDuration" > 10)
        .stateChangesOnly()
        .message(message)
        .id(idVar)
        .idTag(idTag)
        .levelTag(levelTag)
        .messageField(messageField)
        .durationField(durationField)
        .details(details)
        .email('email@address.com')

trigger
    |influxDBOut()
        .create()
        .database(outputDB)
        .retentionPolicy(outputRP)
        .measurement(outputMeasurement)
        .tag('alertName', name)
        .tag('triggerType', triggerType)

trigger
    |httpOut('output')

You should only get an alert if its in the given state for 10 minutes. I found one of the problem i was having is that some of my drives would dip between normal/warn/crit generating a lot of alerts.

I added .noFlapping but in the end went to state duration.

You’ll need to use a state duration node for each alert level though.

Hope that helps

Topic		Replies	Views
Kapacitor Alert Message Kapacitor kapacitor	3	956	December 23, 2019
Kapacitor alerting TICK script is not working as it should Kapacitor kapacitor	1	2386	April 3, 2017
Kapacitor - constant alerting every 20 seconds kapacitor	5	1124	October 26, 2018
Understanding how to Implement alerting when no data is received	0	2084	May 3, 2018
Kapacitor stream / batch Kapacitor	1	772	December 21, 2021

TICK Alert frequency issues

Related topics