Bumbling idiots need help with first TickScript: Alert on more than 5 5xx errors

kapacitor
#1

Hey guys,

We’re trying to write our first TickScript and we’re getting an alert every 1m with non-sensical values. We’re being defeated by what seems to be the simplest of tasks.

batch
|query('SELECT "http_response.5xx" as "internal_server_errors" FROM "telegraf"."autogen"."haproxy"')
    .period(1m)
    .every(1m)
    .align()
|count('internal_server_errors')
    .as('count_internal_server_errors')
|alert()
    .message('More than 5 5xx errors in the last 1m')
    .crit(lambda: "count_internal_server_errors" > 5)

First, we’re certain we’re not getting more than 5 5xxs, we’re getting 0 according to the canned haproxy dashboard. When the email goes out, it gives a non-sensical value like "count_internal_server_errors":133 or 119.

A shove in the correct direction would be greatly appreciated! We will definitely be open sourcing a lot of these if we can figure out how to do them.

#2

After kicking ourselves for awhile, we realized that the http_response.5xx is a literal count… so counting the counts isn’t going to help you much.

With that realization in mind, we used the alert rules to try and generate a proper tick script using the relative alert type.

This works… too good. Unfortunately, it emails a lot. I think for every 5xx error that comes in during the 1m period it sends us an email. We tried add 1m to stateChangeOnly but that didn’t help. Here’s the script:

var db = 'telegraf'

var rp = 'autogen'

var measurement = 'haproxy'

var groupBy = []

var whereFilter = lambda: ("proxy" == 'xxx')

var name = 'HaProxyAlertOn5xx-xxx'

var idVar = name

var message = ''

var idTag = 'alertID'

var levelTag = 'level'

var messageField = 'message'

var durationField = 'duration'

var outputDB = 'chronograf'

var outputRP = 'autogen'

var outputMeasurement = 'alerts'

var triggerType = 'relative'

var details = 'The previous 1m period container more than 4 5xx errors than the proceeding 1m period '

var shift = 1m

var crit = 4

var data = stream
    |from()
        .database(db)
        .retentionPolicy(rp)
        .measurement(measurement)
        .groupBy(groupBy)
        .where(whereFilter)
    |eval(lambda: "http_response.5xx")
        .as('value')

var past = data
    |shift(shift)

var current = data

var trigger = past
    |join(current)
        .as('past', 'current')
    |eval(lambda: float("current.value" - "past.value"))
        .keep()
        .as('value')
    |alert()
        .crit(lambda: "value" > crit)
        .message(message)
        .id(idVar)
        .idTag(idTag)
        .levelTag(levelTag)
        .messageField(messageField)
        .durationField(durationField)
        .details(details)
        .stateChangesOnly()
        .email()

trigger
    |eval(lambda: float("value"))
        .as('value')
        .keep()
    |influxDBOut()
        .create()
        .database(outputDB)
        .retentionPolicy(outputRP)
        .measurement(outputMeasurement)
        .tag('alertName', name)
        .tag('triggerType', triggerType)

trigger
    |httpOut('output')
#3

Still no luck, any help appreciated

var name = 'HaProxyAlertOn5xx-appname'

var idVar = name

var message = ''

var idTag = 'alertID'

var levelTag = 'level'

var messageField = 'message'

var durationField = 'duration'

var outputDB = 'chronograf'

var outputRP = 'autogen'

var outputMeasurement = 'alerts'

var triggerType = 'relative'

var details = 'The previous 1m period container more than 4 5xx errors than the proceeding 1m period '

var shift = 1m

var crit = 4

var current = batch
    |query('SELECT last("http_response.4xx") as value FROM "telegraf"."autogen"."haproxy" WHERE "proxy"=\'https-in\'')
        .every(1m)

var past = batch
    |query('SELECT "http_response.4xx" as value FROM "telegraf"."autogen"."haproxy" WHERE time > now() - 2m AND "proxy"=\'https-in\' order by time asc limit 1')
        .every(1m)

var trigger = past
    |join(current)
        .as('past', 'current')
    |eval(lambda: float("current.value" - "past.value"))
        .keep()
        .as('value')
    |alert()
        .crit(lambda: "value" > crit)
        .message(message)
        .id(idVar)
        .email()

trigger
    |eval(lambda: float("value"))
        .as('value')
        .keep()
    |influxDBOut()
        .create()
        .database(outputDB)
        .retentionPolicy(outputRP)
        .measurement(outputMeasurement)
        .tag('alertName', name)
        .tag('triggerType', triggerType)

trigger
    |httpOut('output')
#4

Hi,

Try this last aproach with shift and offset, taht coombo should make the job!


var past = batch
|query(‘SELECT “http_response.4xx” as value FROM “telegraf”.“autogen”.“haproxy” WHERE “proxy”=‘https-in’’)
.every(1m)
.offset(1m)

var trigger = past
|shift(1m)
|join(current)
.as(‘past’, ‘current’)

This way, you take the previous data, and align it with the current one.

Try this and upload the logs (include a log after the join) to see how its doing its job!

2 Likes