Hey,
I am trying for 2 days solve this problem.
All I want is creating a template that alerts if a process is down for X minutes or more!
thats my latest try:
var db = 'telegraf'
var rp = 'autogen'
var measurement string
var groupBy = []
var whereFilter = lambda: TRUE
var period = 5m
var interval = 1m
var name string
var idVar = name
var message = '{{.ID}} is {{.Level}} value: {{ index .Tags "host" }}'
var idTag = 'alertID'
var levelTag = 'level'
var messageField = 'message'
var durationField = 'duration'
var topic string
var threshold = 0.0
//var outputDB = 'chronograf'
//var outputRP = 'autogen'
//var outputMeasurement = 'alerts'
//var triggerType = 'deadman'
var data = stream
|from()
.database(db)
.retentionPolicy(rp)
.measurement(measurement)
.groupBy(groupBy)
.where(whereFilter)
|window()
.period(period)
.every(interval)
|mean('memory_rss')
var trigger = data
|deadman(threshold, interval)
.message(message)
.id(idVar)
.idTag(idTag)
.levelTag(levelTag)
.messageField(messageField)
.durationField(durationField)
.stateChangesOnly()
.topic(topic)
trigger
|eval(lambda: "emitted")
.as('value')
.keep('value', messageField, durationField)
|eval(lambda: float("value"))
.as('value')
.keep()
Would the stateDuration node help to achieve what you want? If you write the deadman node out
Soemthing like this?
// Deadman switch
// Monitor SQL Prod for activity. If none detected then Deadman switch alert.
var deadman = data
|stats(10m)
.align()
|derivative('emitted')
.unit(10m)
.nonNegative()
|stateDuration(lambda: "emitted" <= 0.0)
.unit(1m)
.as('DeadDuration')
|alert()
.warn(lambda: "DeadDuration" >= 3)
.stateChangesOnly()
.details(details)
.message(message)
.id(idVar)
.idTag(idTag)
.levelTag(levelTag)
.messageField(messageField)
.durationField(durationField)
.email('you@youremail.com')
That should count how long emitted <= 0.0 then in the crit lambda statement specify the time you want. I do something similar on a SQL server i monitor. If state duration is over 3 minutes then it alerts me.
Does that help? if not then iāve misunderstood the issue, in which case. Apologies
PhilB
EDIT: i havenāt tested the above, it was modified from using stateCount in another script i have. The rest of the script monitors disk latency, I included the deadman switch in the same script.
Hi @philb, thanks for your answer, why arenāt you using the deadman node? I can see your varās name is deadman, but there is no use of the deadman node
That script above was used initially with the stateCount node. Iād used the state count node because sometimes i would get the deadman alerts but the service/app would stil be running, the data point was missing. In counting the state i was able to wait until i had 5 consecutive ācriticalsā and then trigger the alert.
I just swapped the state count node for the state duration node to post here.
If i recall the deadman node and the state count node would not work together so the alternative was to write the deadman part of the script using the derivative and stats node as above, which i think the deadman node itself is based on. You can see here: Deadman node
I was then able to use the value from the duration to generate the alerts. Another reason to separate them for me was the fact i already use the CRIT and WARN level in the full script to monitor disk latency. This only left me with the INFO level which i didnāt like, doing it this way i could assign WARN and CRIT levels to both alert nodes.
I suppose in reality i donāt actually use the deadman node as intended. Instead of alerting on on the actual throughput (or lack of) I alert on the amount of times the deadman alert would trigger in a set time frame.
I hope that make sense, i can tend to ramble a little bit when i get going.
EDIT: just read your code again, saw the topic on stack overflow. I notice you are using {{ index .Tags āhostā }}, this wonāt work unless you change your groupBy variable to
var groupBy = [āhostā]
In all honesty, that might actually fix the issue you were having. Either way, your tag values will be empty unless you group by them.