Hi,
I have an alert defined in a TICK script. The script alerts when the number of failures are above a threshold.
This script works fine in alerting the failures. However I found that, the alert never becomes OK when no more failures are happening.
I have a query similar to one below in the alert script.
SELECT count(responsetime) FROM transactions_records WHERE responsecode = 123 AND $timeFilter
So when no failures are happening, this query will not return any data and the TICK script pipeline will not execute.
This leave the alert active for a long time.
To fix this I saved the alert details into InfluxDB. Details like alert level, alert id etc which can help be resetting the alert.
I wrote a new TICK script like below.
//template_id alerts_app_alert_reset
var period = 1d
var offset = 30m
var groupBy = [‘id_tag’, ‘env’, ‘qcinstance’, ‘host’, ‘alert_channel’, ‘alert_name’, ‘alert_type’]
var db = ‘telegraf’
var retention = ‘two_months’
var topic = ‘qcalerts_withoutok’var alertmessage = ‘Alert Manually reset’
var data = batch
|query(‘SELECT last(level) AS last_level, merchant, dashboard, responsecode, txntype FROM ’ + db + ‘.’ + retention + ‘.activealerts WHERE alert_type='app'’ )
.period(period)
.every(1m)
.groupBy(groupBy)
.offset(offset)
|where(lambda: “last_level” != ‘OK’)
|alert()
.crit(lambda: “last_level” == ‘OK’)
.warn(lambda: “last_level” == ‘OK’)
.stateChangesOnly()
.message(alertmessage)
.topic(topic)
.id(’{{index .Tags “id_tag”}}')
.idTag(‘id_tag’)
.idField(‘id’)
.levelTag(‘level_tag’)
.levelField(‘level’)
.durationField(‘duration’)
|delete()
.field(‘count’)
.field(‘last_level’)
|influxDBOut()
.database(‘telegraf’)
.retentionPolicy(‘two_months’)
.measurement(‘activealerts’)
Now what I’m not sure is, if I use the original alerts id as the id of alert in this script, will it reset the original alert?
Even though, this TICK script is making the alert to OK level, when a new CRITICAL/WARNING alert is generated its duration is not 0.
It seems the alert reset by the above TICK script is treated a different alert by Kapacitor.
Can anyone help me to identify, what is the unique identifier that I should use across scripts to reset an alert? Is it the alert id ?
Thanks,
Robert