State duration doesn't trigger an CritReset within Tick script

Hi,

I face strange problem. Right now, I can trigger alerts through the “StateDuration” node, but I am not able to reset the alert through a second “StateDuration” node. From my point of view everything looks ok, but the Reset is never triggert.

Following my Tick script:

var db = 'db_cmc'

var rp = 'autogen'

var measurement = 'ipmi_sensor'

var groupBy = ['server', 'name', 'status_desc']

var whereFilter = lambda: ("name" == 'psu1' OR "name" == 'psu2')

var name = 'CMC PSU Fault Check'

var alert_psu = '{{ index .Tags "server"}}-{{ index .Tags "name"}}'

// var idVar = name + '-{{.Group}}'
var idVar = name + alert_psu

var message = '
        ID  {{.ID}}
        Name  {{.Name}}
        TaskName  {{.TaskName}}
        Level {{.Level}}
        GroupBy  {{.Group}}
        Tags  {{.Tags}}
        Server  {{ index .Tags "server" }}
        PSU {{ index .Tags "name" }}
        Fault {{ index .Tags "status_desc" }}
        Time  {{.Time}}
'

var idTag = 'alertID'

var levelTag = 'level'

var messageField = 'message'

var durationField = 'duration'

var outputDB = 'chronograf'

var outputRP = 'autogen'

var outputMeasurement = 'alerts'

var triggerType = 'threshold'

var data = stream
    |from()
        .database(db)
        .retentionPolicy(rp)
        .measurement(measurement)
        .groupBy(groupBy)
        .where(whereFilter)
    |stateDuration(lambda: ("status_desc" == 'presence_detected,_power_supply_ac_lost' OR "status_desc" == 'no_reading'))
        .unit(1m)
        .as('CritDuration')
    |stateDuration(lambda: ("status_desc" == 'presence_detected,_power_supply_ac_lost' OR "status_desc" == 'no_reading'))
        .unit(24h)
        .as('CritDurationReminder')
    |stateDuration(lambda: ("status_desc" == 'presence_detected'))
        .unit(1m)
        .as('OkDuration')
    |eval(lambda: ceil("CritDurationReminder"))
        .as('iCritDurationReminder')
        .keep()

var trigger = data
    |alert()
        // state duration crit
        .crit(lambda: "CritDuration" > 5.0)
        // set state to OK
        .critReset(lambda: "OkDuration" > 5.0)
        .message(message)
        .id(idVar)
        .idTag(idTag)
        .levelTag(levelTag)
        .messageField(messageField)
        .stateChangesOnly()
        .durationField(durationField)
        .log('/etc/kapacitor/templates/alert_logs/psu-fault.log')

trigger
    |eval(lambda: string("status_desc"))
        .as('value')
        .keep()
    |eval(lambda: string("server" + '-' + "name"))
        .as('host')
        .keep()
    |influxDBOut()
        .create()
        .database(outputDB)
        .retentionPolicy(outputRP)
        .measurement(outputMeasurement)
        .tag('alertName', name)
        .tag('triggerType', triggerType)

trigger
    |httpOut('output')

data
    |derivative('iCritDurationReminder')
        .as('fire_alert')
        .unit(10m)
        .nonNegative()
    |alert()
        .crit(lambda: "fire_alert" > 1.0)
        .stateChangesOnly()
        .noRecoveries()
        .message(message)
        .id(idVar + '-reminder')
        .idTag(idTag + '-reminder')
        .levelTag(levelTag)
        .messageField(messageField)
        .log('/etc/kapacitor/templates/alert_logs/psu-ac-fault-reminder.log')

So we check in general for the status description of the bmc readings done through the telegraf collector.

If we pull the power cable the alert is fired after 5 minutes as expected. But when we replug the cable, the alert never is reset to OK state. So not within the log file nor within the Chronograf db

I only see within the Chronograf alerts measurement, that the Alert fires through a single value with “CritDuration = 5.5”, “OkDuration=-1” and “iCritDurationReminder = 0”.

So the second entry is missing, when the “status_desc” went to ‘presende_deteced’ again. The strange thing is, that we can see within the Influxdb db_cmc, that the status_desc change as expected.

So from my point of view this is really a problem, since shouldn’t see any further PSU fails for the specific node, that is not able to recover.

I already realized, that i have to remove the status_desc tag from the IdVar variable, but now I rally have no idea what could cause the problem. So it would be great to get some kind of help or even a hint how to further debug the problem.

Best Regards,

Stephan

A small update.

After we have removed the reminder alert from the tick script, we also see the recover from the fault state.

We have absolutely no idea why this is a problem, so it would be nice if somebody could explain the root cause of the problem and how to work with such a kind of requirement.

Best Regards,

Stephan