Hi,
I face strange problem. Right now, I can trigger alerts through the “StateDuration” node, but I am not able to reset the alert through a second “StateDuration” node. From my point of view everything looks ok, but the Reset is never triggert.
Following my Tick script:
var db = 'db_cmc'
var rp = 'autogen'
var measurement = 'ipmi_sensor'
var groupBy = ['server', 'name', 'status_desc']
var whereFilter = lambda: ("name" == 'psu1' OR "name" == 'psu2')
var name = 'CMC PSU Fault Check'
var alert_psu = '{{ index .Tags "server"}}-{{ index .Tags "name"}}'
// var idVar = name + '-{{.Group}}'
var idVar = name + alert_psu
var message = '
ID {{.ID}}
Name {{.Name}}
TaskName {{.TaskName}}
Level {{.Level}}
GroupBy {{.Group}}
Tags {{.Tags}}
Server {{ index .Tags "server" }}
PSU {{ index .Tags "name" }}
Fault {{ index .Tags "status_desc" }}
Time {{.Time}}
'
var idTag = 'alertID'
var levelTag = 'level'
var messageField = 'message'
var durationField = 'duration'
var outputDB = 'chronograf'
var outputRP = 'autogen'
var outputMeasurement = 'alerts'
var triggerType = 'threshold'
var data = stream
|from()
.database(db)
.retentionPolicy(rp)
.measurement(measurement)
.groupBy(groupBy)
.where(whereFilter)
|stateDuration(lambda: ("status_desc" == 'presence_detected,_power_supply_ac_lost' OR "status_desc" == 'no_reading'))
.unit(1m)
.as('CritDuration')
|stateDuration(lambda: ("status_desc" == 'presence_detected,_power_supply_ac_lost' OR "status_desc" == 'no_reading'))
.unit(24h)
.as('CritDurationReminder')
|stateDuration(lambda: ("status_desc" == 'presence_detected'))
.unit(1m)
.as('OkDuration')
|eval(lambda: ceil("CritDurationReminder"))
.as('iCritDurationReminder')
.keep()
var trigger = data
|alert()
// state duration crit
.crit(lambda: "CritDuration" > 5.0)
// set state to OK
.critReset(lambda: "OkDuration" > 5.0)
.message(message)
.id(idVar)
.idTag(idTag)
.levelTag(levelTag)
.messageField(messageField)
.stateChangesOnly()
.durationField(durationField)
.log('/etc/kapacitor/templates/alert_logs/psu-fault.log')
trigger
|eval(lambda: string("status_desc"))
.as('value')
.keep()
|eval(lambda: string("server" + '-' + "name"))
.as('host')
.keep()
|influxDBOut()
.create()
.database(outputDB)
.retentionPolicy(outputRP)
.measurement(outputMeasurement)
.tag('alertName', name)
.tag('triggerType', triggerType)
trigger
|httpOut('output')
data
|derivative('iCritDurationReminder')
.as('fire_alert')
.unit(10m)
.nonNegative()
|alert()
.crit(lambda: "fire_alert" > 1.0)
.stateChangesOnly()
.noRecoveries()
.message(message)
.id(idVar + '-reminder')
.idTag(idTag + '-reminder')
.levelTag(levelTag)
.messageField(messageField)
.log('/etc/kapacitor/templates/alert_logs/psu-ac-fault-reminder.log')
So we check in general for the status description of the bmc readings done through the telegraf collector.
If we pull the power cable the alert is fired after 5 minutes as expected. But when we replug the cable, the alert never is reset to OK state. So not within the log file nor within the Chronograf db
I only see within the Chronograf alerts measurement, that the Alert fires through a single value with “CritDuration = 5.5”, “OkDuration=-1” and “iCritDurationReminder = 0”.
So the second entry is missing, when the “status_desc” went to ‘presende_deteced’ again. The strange thing is, that we can see within the Influxdb db_cmc, that the status_desc change as expected.
So from my point of view this is really a problem, since shouldn’t see any further PSU fails for the specific node, that is not able to recover.
I already realized, that i have to remove the status_desc tag from the IdVar variable, but now I rally have no idea what could cause the problem. So it would be great to get some kind of help or even a hint how to further debug the problem.
Best Regards,
Stephan