Process down for X minutes - HOW?!

telegraf
kapacitor

#1

Hey,
I am trying for 2 days solve this problem.
All I want is creating a template that alerts if a process is down for X minutes or more!
thats my latest try:

var db = 'telegraf'

var rp = 'autogen'

var measurement string

var groupBy = []

var whereFilter = lambda: TRUE

var period = 5m

var interval = 1m

var name string

var idVar = name

var message = '{{.ID}} is {{.Level}} value:  {{ index .Tags "host" }}'

var idTag = 'alertID'

var levelTag = 'level'

var messageField = 'message'

var durationField = 'duration'

var topic string

var threshold = 0.0

//var outputDB = 'chronograf'

//var outputRP = 'autogen'

//var outputMeasurement = 'alerts'

//var triggerType = 'deadman'



var data = stream
    |from()
        .database(db)
        .retentionPolicy(rp)
        .measurement(measurement)
        .groupBy(groupBy)
        .where(whereFilter)
    |window()
        .period(period)
        .every(interval)
    |mean('memory_rss')

var trigger = data
    |deadman(threshold, interval)
        .message(message)
        .id(idVar)
        .idTag(idTag)
        .levelTag(levelTag)
        .messageField(messageField)
        .durationField(durationField)
        .stateChangesOnly()
        .topic(topic)

trigger
    |eval(lambda: "emitted")
        .as('value')
        .keep('value', messageField, durationField)
    |eval(lambda: float("value"))
        .as('value')
        .keep()

please help me, I am frustrated from dat deadman.


#2

Hi @gaizeror

Would the stateDuration node help to achieve what you want? If you write the deadman node out

Soemthing like this?

// Deadman switch
// Monitor SQL Prod for activity. If none detected then Deadman switch alert.
var deadman = data
    |stats(10m)
        .align()
    |derivative('emitted')
        .unit(10m)
        .nonNegative()
    |stateDuration(lambda: "emitted" <= 0.0)
        .unit(1m)
        .as('DeadDuration')
    |alert()
        .warn(lambda: "DeadDuration" >= 3)
        .stateChangesOnly()
        .details(details)
        .message(message)
        .id(idVar)
        .idTag(idTag)
        .levelTag(levelTag)
        .messageField(messageField)
        .durationField(durationField)
        .email('you@youremail.com')

That should count how long emitted <= 0.0 then in the crit lambda statement specify the time you want. I do something similar on a SQL server i monitor. If state duration is over 3 minutes then it alerts me.

Does that help? if not then i’ve misunderstood the issue, in which case. Apologies

PhilB

EDIT: i haven’t tested the above, it was modified from using stateCount in another script i have. The rest of the script monitors disk latency, I included the deadman switch in the same script.


#3

Hi @philb, thanks for your answer, why aren’t you using the deadman node? I can see your var’s name is deadman, but there is no use of the deadman node


#4

Hi @gaizeror ,

That script above was used initially with the stateCount node. I’d used the state count node because sometimes i would get the deadman alerts but the service/app would stil be running, the data point was missing. In counting the state i was able to wait until i had 5 consecutive “criticals” and then trigger the alert.

I just swapped the state count node for the state duration node to post here.

If i recall the deadman node and the state count node would not work together so the alternative was to write the deadman part of the script using the derivative and stats node as above, which i think the deadman node itself is based on. You can see here: Deadman node

I was then able to use the value from the duration to generate the alerts. Another reason to separate them for me was the fact i already use the CRIT and WARN level in the full script to monitor disk latency. This only left me with the INFO level which i didn’t like, doing it this way i could assign WARN and CRIT levels to both alert nodes.

I suppose in reality i don’t actually use the deadman node as intended. Instead of alerting on on the actual throughput (or lack of) I alert on the amount of times the deadman alert would trigger in a set time frame.

I hope that make sense, i can tend to ramble a little bit when i get going.

EDIT: just read your code again, saw the topic on stack overflow. I notice you are using {{ index .Tags “host” }}, this won’t work unless you change your groupBy variable to

var groupBy = [‘host’]

In all honesty, that might actually fix the issue you were having. Either way, your tag values will be empty unless you group by them.


#5

@philb Thanks, will try that.
And I added groupBy since than, thanks!