Hello,
I am using the stateDuration node to alert us when the CPU usage on a server has been above a certain level for a certain duration. For example, “WARNING” when greater than 60%, for 5 minutes. This is working fine. However, as soon as the CPU usage drops below 60% again, the alert immediately returns to the “OK” state.
I can see this is the expected behaviour as per the docs - “When a point evaluates as false, the state duration is reset.”
Is it possible to set a duration required before the alert returns to the “OK” state? i.e “WARNING when CPU usage is greater than 60% for 5 minutes. Return to OK when CPU usage is below 60% for 5 minutes”?
Here is an example of the current TICK scripts we’re using (slightly slimmed down):
stream
|from()
.measurement(‘cpu’)
|where(lambda: (“cpu” == ‘cpu-total’) AND (“host” == ‘ubuntu-xenial’))
|groupBy(‘host’)
|stateDuration(lambda: “usage_idle” <= 40)
.unit(1m)
.as(‘warn_duration’)
|stateDuration(lambda: “usage_idle” <= 20)
.unit(1m)
.as(‘crit_duration’)
|alert()
// Warn after 2 minutes
.warn(lambda: “warn_duration” >= 2)
// Crit after 5 minutes
.crit(lambda: “crit_duration” >= 5)
// Only alert when an alert is triggered or returns to normal
.stateChangesOnly()
Thanks in advance for any help. Please let me know if you need any further information