Hey guys,
We’re trying to write our first TickScript and we’re getting an alert every 1m with non-sensical values. We’re being defeated by what seems to be the simplest of tasks.
batch
|query('SELECT "http_response.5xx" as "internal_server_errors" FROM "telegraf"."autogen"."haproxy"')
.period(1m)
.every(1m)
.align()
|count('internal_server_errors')
.as('count_internal_server_errors')
|alert()
.message('More than 5 5xx errors in the last 1m')
.crit(lambda: "count_internal_server_errors" > 5)
First, we’re certain we’re not getting more than 5 5xxs, we’re getting 0 according to the canned haproxy dashboard. When the email goes out, it gives a non-sensical value like "count_internal_server_errors":133
or 119
.
A shove in the correct direction would be greatly appreciated! We will definitely be open sourcing a lot of these if we can figure out how to do them.
After kicking ourselves for awhile, we realized that the http_response.5xx
is a literal count… so counting the counts isn’t going to help you much.
With that realization in mind, we used the alert rules to try and generate a proper tick script using the relative alert type.
This works… too good. Unfortunately, it emails a lot. I think for every 5xx error that comes in during the 1m period it sends us an email. We tried add 1m to stateChangeOnly but that didn’t help. Here’s the script:
var db = 'telegraf'
var rp = 'autogen'
var measurement = 'haproxy'
var groupBy = []
var whereFilter = lambda: ("proxy" == 'xxx')
var name = 'HaProxyAlertOn5xx-xxx'
var idVar = name
var message = ''
var idTag = 'alertID'
var levelTag = 'level'
var messageField = 'message'
var durationField = 'duration'
var outputDB = 'chronograf'
var outputRP = 'autogen'
var outputMeasurement = 'alerts'
var triggerType = 'relative'
var details = 'The previous 1m period container more than 4 5xx errors than the proceeding 1m period '
var shift = 1m
var crit = 4
var data = stream
|from()
.database(db)
.retentionPolicy(rp)
.measurement(measurement)
.groupBy(groupBy)
.where(whereFilter)
|eval(lambda: "http_response.5xx")
.as('value')
var past = data
|shift(shift)
var current = data
var trigger = past
|join(current)
.as('past', 'current')
|eval(lambda: float("current.value" - "past.value"))
.keep()
.as('value')
|alert()
.crit(lambda: "value" > crit)
.message(message)
.id(idVar)
.idTag(idTag)
.levelTag(levelTag)
.messageField(messageField)
.durationField(durationField)
.details(details)
.stateChangesOnly()
.email()
trigger
|eval(lambda: float("value"))
.as('value')
.keep()
|influxDBOut()
.create()
.database(outputDB)
.retentionPolicy(outputRP)
.measurement(outputMeasurement)
.tag('alertName', name)
.tag('triggerType', triggerType)
trigger
|httpOut('output')
Still no luck, any help appreciated
var name = 'HaProxyAlertOn5xx-appname'
var idVar = name
var message = ''
var idTag = 'alertID'
var levelTag = 'level'
var messageField = 'message'
var durationField = 'duration'
var outputDB = 'chronograf'
var outputRP = 'autogen'
var outputMeasurement = 'alerts'
var triggerType = 'relative'
var details = 'The previous 1m period container more than 4 5xx errors than the proceeding 1m period '
var shift = 1m
var crit = 4
var current = batch
|query('SELECT last("http_response.4xx") as value FROM "telegraf"."autogen"."haproxy" WHERE "proxy"=\'https-in\'')
.every(1m)
var past = batch
|query('SELECT "http_response.4xx" as value FROM "telegraf"."autogen"."haproxy" WHERE time > now() - 2m AND "proxy"=\'https-in\' order by time asc limit 1')
.every(1m)
var trigger = past
|join(current)
.as('past', 'current')
|eval(lambda: float("current.value" - "past.value"))
.keep()
.as('value')
|alert()
.crit(lambda: "value" > crit)
.message(message)
.id(idVar)
.email()
trigger
|eval(lambda: float("value"))
.as('value')
.keep()
|influxDBOut()
.create()
.database(outputDB)
.retentionPolicy(outputRP)
.measurement(outputMeasurement)
.tag('alertName', name)
.tag('triggerType', triggerType)
trigger
|httpOut('output')
Hi,
Try this last aproach with shift and offset, taht coombo should make the job!
…
var past = batch
|query(‘SELECT “http_response.4xx” as value FROM “telegraf”.“autogen”.“haproxy” WHERE “proxy”=‘https-in’’)
.every(1m)
.offset(1m)
var trigger = past
|shift(1m)
|join(current)
.as(‘past’, ‘current’)
…
This way, you take the previous data, and align it with the current one.
Try this and upload the logs (include a log after the join) to see how its doing its job!
2 Likes