Process down for X minutes - HOW?!

gaizeror · August 12, 2018, 7:15am

Hey,
I am trying for 2 days solve this problem.
All I want is creating a template that alerts if a process is down for X minutes or more!
thats my latest try:

var db = 'telegraf'

var rp = 'autogen'

var measurement string

var groupBy = []

var whereFilter = lambda: TRUE

var period = 5m

var interval = 1m

var name string

var idVar = name

var message = '{{.ID}} is {{.Level}} value:  {{ index .Tags "host" }}'

var idTag = 'alertID'

var levelTag = 'level'

var messageField = 'message'

var durationField = 'duration'

var topic string

var threshold = 0.0

//var outputDB = 'chronograf'

//var outputRP = 'autogen'

//var outputMeasurement = 'alerts'

//var triggerType = 'deadman'



var data = stream
    |from()
        .database(db)
        .retentionPolicy(rp)
        .measurement(measurement)
        .groupBy(groupBy)
        .where(whereFilter)
    |window()
        .period(period)
        .every(interval)
    |mean('memory_rss')

var trigger = data
    |deadman(threshold, interval)
        .message(message)
        .id(idVar)
        .idTag(idTag)
        .levelTag(levelTag)
        .messageField(messageField)
        .durationField(durationField)
        .stateChangesOnly()
        .topic(topic)

trigger
    |eval(lambda: "emitted")
        .as('value')
        .keep('value', messageField, durationField)
    |eval(lambda: float("value"))
        .as('value')
        .keep()

please help me, I am frustrated from dat deadman.

philb · August 13, 2018, 4:05pm

Hi @gaizeror

Would the stateDuration node help to achieve what you want? If you write the deadman node out

Soemthing like this?

// Deadman switch
// Monitor SQL Prod for activity. If none detected then Deadman switch alert.
var deadman = data
    |stats(10m)
        .align()
    |derivative('emitted')
        .unit(10m)
        .nonNegative()
    |stateDuration(lambda: "emitted" <= 0.0)
        .unit(1m)
        .as('DeadDuration')
    |alert()
        .warn(lambda: "DeadDuration" >= 3)
        .stateChangesOnly()
        .details(details)
        .message(message)
        .id(idVar)
        .idTag(idTag)
        .levelTag(levelTag)
        .messageField(messageField)
        .durationField(durationField)
        .email('you@youremail.com')

That should count how long emitted <= 0.0 then in the crit lambda statement specify the time you want. I do something similar on a SQL server i monitor. If state duration is over 3 minutes then it alerts me.

Does that help? if not then i’ve misunderstood the issue, in which case. Apologies

PhilB

EDIT: i haven’t tested the above, it was modified from using stateCount in another script i have. The rest of the script monitors disk latency, I included the deadman switch in the same script.

gaizeror · August 14, 2018, 6:44am

Hi @philb, thanks for your answer, why aren’t you using the deadman node? I can see your var’s name is deadman, but there is no use of the deadman node

philb · August 14, 2018, 8:40am

Hi @gaizeror ,

That script above was used initially with the stateCount node. I’d used the state count node because sometimes i would get the deadman alerts but the service/app would stil be running, the data point was missing. In counting the state i was able to wait until i had 5 consecutive “criticals” and then trigger the alert.

I just swapped the state count node for the state duration node to post here.

If i recall the deadman node and the state count node would not work together so the alternative was to write the deadman part of the script using the derivative and stats node as above, which i think the deadman node itself is based on. You can see here: Deadman node

I was then able to use the value from the duration to generate the alerts. Another reason to separate them for me was the fact i already use the CRIT and WARN level in the full script to monitor disk latency. This only left me with the INFO level which i didn’t like, doing it this way i could assign WARN and CRIT levels to both alert nodes.

I suppose in reality i don’t actually use the deadman node as intended. Instead of alerting on on the actual throughput (or lack of) I alert on the amount of times the deadman alert would trigger in a set time frame.

I hope that make sense, i can tend to ramble a little bit when i get going.

EDIT: just read your code again, saw the topic on stack overflow. I notice you are using {{ index .Tags “host” }}, this won’t work unless you change your groupBy variable to

var groupBy = [‘host’]

In all honesty, that might actually fix the issue you were having. Either way, your tag values will be empty unless you group by them.

gaizeror · August 19, 2018, 6:55am

@philb Thanks, will try that.
And I added groupBy since than, thanks!

Topic		Replies	Views
Help with Deadman Alert Timing Kapacitor influxdb , kapacitor	1	749	January 18, 2019
Cannot get time based alerts with 0 value in tickscript Kapacitor kapacitor , chronograf	5	1207	July 24, 2019
Period and Every in DeadMan Alert Kapacitor	1	553	May 30, 2019
TICK Script for alert Host down Kapacitor kapacitor	15	7539	July 23, 2019
Kapacitor - if value exceeds threshold for specified duration Kapacitor	7	5479	August 30, 2018

Process down for X minutes - HOW?!

Related topics