State duration doesn't trigger an CritReset within Tick script


I face strange problem. Right now, I can trigger alerts through the “StateDuration” node, but I am not able to reset the alert through a second “StateDuration” node. From my point of view everything looks ok, but the Reset is never triggert.

Following my Tick script:

var db = 'db_cmc'

var rp = 'autogen'

var measurement = 'ipmi_sensor'

var groupBy = ['server', 'name', 'status_desc']

var whereFilter = lambda: ("name" == 'psu1' OR "name" == 'psu2')

var name = 'CMC PSU Fault Check'

var alert_psu = '{{ index .Tags "server"}}-{{ index .Tags "name"}}'

// var idVar = name + '-{{.Group}}'
var idVar = name + alert_psu

var message = '
        ID  {{.ID}}
        Name  {{.Name}}
        TaskName  {{.TaskName}}
        Level {{.Level}}
        GroupBy  {{.Group}}
        Tags  {{.Tags}}
        Server  {{ index .Tags "server" }}
        PSU {{ index .Tags "name" }}
        Fault {{ index .Tags "status_desc" }}
        Time  {{.Time}}

var idTag = 'alertID'

var levelTag = 'level'

var messageField = 'message'

var durationField = 'duration'

var outputDB = 'chronograf'

var outputRP = 'autogen'

var outputMeasurement = 'alerts'

var triggerType = 'threshold'

var data = stream
    |stateDuration(lambda: ("status_desc" == 'presence_detected,_power_supply_ac_lost' OR "status_desc" == 'no_reading'))
    |stateDuration(lambda: ("status_desc" == 'presence_detected,_power_supply_ac_lost' OR "status_desc" == 'no_reading'))
    |stateDuration(lambda: ("status_desc" == 'presence_detected'))
    |eval(lambda: ceil("CritDurationReminder"))

var trigger = data
        // state duration crit
        .crit(lambda: "CritDuration" > 5.0)
        // set state to OK
        .critReset(lambda: "OkDuration" > 5.0)

    |eval(lambda: string("status_desc"))
    |eval(lambda: string("server" + '-' + "name"))
        .tag('alertName', name)
        .tag('triggerType', triggerType)


        .crit(lambda: "fire_alert" > 1.0)
        .id(idVar + '-reminder')
        .idTag(idTag + '-reminder')

So we check in general for the status description of the bmc readings done through the telegraf collector.

If we pull the power cable the alert is fired after 5 minutes as expected. But when we replug the cable, the alert never is reset to OK state. So not within the log file nor within the Chronograf db

I only see within the Chronograf alerts measurement, that the Alert fires through a single value with “CritDuration = 5.5”, “OkDuration=-1” and “iCritDurationReminder = 0”.

So the second entry is missing, when the “status_desc” went to ‘presende_deteced’ again. The strange thing is, that we can see within the Influxdb db_cmc, that the status_desc change as expected.

So from my point of view this is really a problem, since shouldn’t see any further PSU fails for the specific node, that is not able to recover.

I already realized, that i have to remove the status_desc tag from the IdVar variable, but now I rally have no idea what could cause the problem. So it would be great to get some kind of help or even a hint how to further debug the problem.

Best Regards,


A small update.

After we have removed the reminder alert from the tick script, we also see the recover from the fault state.

We have absolutely no idea why this is a problem, so it would be nice if somebody could explain the root cause of the problem and how to work with such a kind of requirement.

Best Regards,