TICK Alert frequency issues

Hello,

I have written a script to check free disk space on my hosts and have set it to get an hours worth of data every 10 minutes in a window. The problem is the window appears to be ignoring these time frames and alerts me every minute. I’ve been getting alerts every minute since 11am.

TICK script:

var db = 'telegraf'
var rp = 'autogen'
var measurement = 'win_disk'
var name = 'Stream Free Disk Space2'
var groupBy = ['host']
var whereFilter = lambda: TRUE //("Percent_Free_Space" < crit)
var outputDB = 'chronograf'
var outputRP = 'autogen'
var outputMeasurement = 'alerts'
//var period = 60m
//var every = 10m
var crit = 95

var diskSpace = stream

|from()
    .database(db)
    .measurement(measurement)

    .groupBy(groupBy)
|window()
.period(60m)
.every(10m)
.align()

|mean('Percent_Free_Space')
.as('Free_Space_Mean')

|eval(lambda: "Free_Space_Mean")
.as('value')

var trigger = diskSpace
|alert()
.all()
.crit(lambda: "value" < crit)
.message('A Message')

.email('myself@myemail.co.uk')
.stateChangesOnly(10m)
.noRecoveries()

I’ve tried adding the times in using a variable but that throws me a ‘name not in scope’ error even though it is defined up top (i’ve commented them out since) and i’ve also tried adding them directly in to the .period and .every.

Nothing seems to work though, have i missed something?

Can any one point me in the right direction?

PhilB

Your script looks right. Can you share the output of kapacitor show TASK? That might give us a hint as to whats going on.

Can you share what the solution was? We have an over-alerting problem, too.

HI Dave,

In the end i added a state count but this was also noisy. Ultimately i went with state duration and state changes only. I also tried using a batch script but needed more real time results.

My current itteration is as follows:

var db = 'database'

var rp = 'retention''

var measurement = 'win_disk'

var groupBy = ['host', 'instance']

var whereFilter = lambda: (isPresent("Percent_Free_Space")) AND ("instance" == 'C:') OR ("instance" == 'D:')

var name = 'Windows Disk Space'

var idVar = name + ':{{.Group}}'

var message = '{{ if eq .Level "CRITICAL" }} {{ .Level }} - {{.TaskName}} - {{ index .Fields "Percent_Free_Space" | printf "%0.2f"}}%   {{else if eq .Level "INFO"}} {{.Level}} - {{.TaskName}} - Deadman Switch Triggered! {{ else if eq .Level "WARNING" }}{{ .Level }} - {{.TaskName}} - {{ index .Fields "Percent_Free_Space" | printf "%0.2f"}}%  {{else}} {{.TaskName}} Back to normal! {{end}} '

var idTag = 'alertID'

var levelTag = 'level'

var messageField = 'message'

var durationField = 'duration'

var outputDB = 'chronograf'

var outputRP = 'autogen'

var outputMeasurement = 'alerts'

var triggerType = 'threshold'

var warn = 20
var crit = 15
var multiple = 1024
var data = stream
    |from()
        .database(db)
        .retentionPolicy(rp)
        .measurement(measurement)
        .groupBy(groupBy)
        .where(whereFilter)
 

    |stateDuration(lambda: "Percent_Free_Space" < crit )
        .unit(1m)
        .as('CritDuration')
    
    |stateDuration(lambda: "Percent_Free_Space" < warn)
        .unit(1m)
        .as('WarnDuration')
    //|log()
var trigger = data
    |alert()
        //state duration crit
        .crit(lambda: "CritDuration" > 10)
        //state duration warning
        .warn(lambda: "WarnDuration" > 10)
        .stateChangesOnly()
        .message(message)
        .id(idVar)
        .idTag(idTag)
        .levelTag(levelTag)
        .messageField(messageField)
        .durationField(durationField)
        .details(details)
        .email('email@address.com')

trigger
    |influxDBOut()
        .create()
        .database(outputDB)
        .retentionPolicy(outputRP)
        .measurement(outputMeasurement)
        .tag('alertName', name)
        .tag('triggerType', triggerType)

trigger
    |httpOut('output')

You should only get an alert if its in the given state for 10 minutes. I found one of the problem i was having is that some of my drives would dip between normal/warn/crit generating a lot of alerts.

I added .noFlapping but in the end went to state duration.

You’ll need to use a state duration node for each alert level though.

Hope that helps

2 Likes