Well, I’ve seen dozens of times the task list check script from the link you send me, but I’m not sure how’s that related to what I’m saying. That script from your link doesn’t do any flap detection. So if a task runs fine and fails 30 times in a row, you’re gonna get 30 notifications, which obviously is not what I want.
This is the cpu check task I’m currently using with stateDuration()
import "influxdata/influxdb/monitor"
option task = {name: "CPU usage", every: 15s}
check = {
_check_id: "cpu_usage_idle",
_check_name: "CPU Usage",
_type: "custom",
tags: {},
}
input = from(bucket: "system")
|> range(start: -15m)
|> filter(fn: (r) =>
(r["_measurement"] == "cpu" and r["_field"] == "usage_idle"))
|> group(columns: ["host", "_measurement"])
|> aggregateWindow(every: 1m, fn: min, createEmpty: false)
|> filter(fn: (r) =>
(exists r._value))
|> map(fn: (r) =>
({r with _value: 100.0 - r._value}))
|> stateDuration(fn: (r) =>
(r._value >= 90), column: "crit_duration", unit: 1m)
|> stateDuration(fn: (r) =>
(r._value >= 80 and r._value < 90), column: "warn_duration", unit: 1m)
|> stateDuration(fn: (r) =>
(r._value < 75), column: "ok_duration", unit: 1m)
crit = (r) =>
(r["crit_duration"] > 10)
warn = (r) =>
(r["warn_duration"] > 10)
ok = (r) =>
(r["ok_duration"] > 30)
messageFn = (r) =>
(if r._level == "crit" or r._level == "warn" then "${r.host}: High CPU usage (${string(v: int(v: r._value))}%)" else "${r.host}: CPU usage back to normal (${string(v: int(v: r._value))}%)")
input
|> monitor.check(
crit: crit,
warn: warn,
ok: ok,
messageFn: messageFn,
data: check,
)
Basically, I’m getting the maximum cpu usage value within a 1-min aggregate across all cores (cpu tag), and using stateDuration() to avoid spamming alerts, hysteresis and flapping values. I want to trigger CRIT/WARN notifications when they’ve been within the CRIT/WARN ranges for 10 mins, and I want to get the ok/recovery notification when it’s been 30 mins ok.
The problem with stateDuration() is that all values must be in a row otherwise the duration counter resets to -1. The alert will only recover when ALL the values for last 30 mins are within the OK range. So if I’m an alert in CRIT level because the CPU has been 100% for long time and now the cpu usage values are 1%, 2%, 0% all the time but every now and then I get a single 95% value, the max (or quantile) selector will ruin the aggregate and the alert will never recover. It will go to UNKOWN level, and when it goes back to OK/WARN/CRIT will trigger the notification (this is why I receive new notifications with the same level)
Now, this is pretty much the same task, but using aggregateWindow()+reduce()
import "influxdata/influxdb/monitor"
option task = {name: "CPU usage", every: 15s}
check = {
_check_id: "cpu_usage_idle",
_check_name: "CPU Usage",
_type: "custom",
tags: {},
}
crit_threshold = 90.0
warn_threshold = 80.0
ok_threshold = 75.0
input = from(bucket: "system")
|> range(start: -20m)
|> filter(fn: (r) =>
(r["_measurement"] == "cpu" and r["_field"] == "usage_idle"))
|> group(columns: ["host", "_measurement"])
|> aggregateWindow(every: 1m, fn: min, createEmpty: false)
|> filter(fn: (r) =>
(exists r._value))
|> map(fn: (r) =>
({r with _value: 100.0 - r._value}))
|> reduce(identity: {
total_count: 1.0,
crit_count: 0.0,
warn_count: 0.0,
ok_count: 0.0,
crit_idx: 0.0,
warn_idx: 0.0,
ok_idx: 0.0,
}, fn: (r, accumulator) =>
({
crit_count: if r._value >= crit_threshold then accumulator.crit_count + 1.0 else accumulator.crit_count + 0.0,
warn_count: if r._value < crit_threshold and r._value >= warn_threshold then accumulator.warn_count + 1.0 else accumulator.warn_count + 0.0,
ok_count: if r._value < ok_threshold then accumulator.ok_count + 1.0 else accumulator.ok_count + 0.0,
crit_idx: accumulator.crit_count / accumulator.total_count,
warn_idx: accumulator.warn_count / accumulator.total_count,
ok_idx: accumulator.ok_count / accumulator.total_count,
total_count: accumulator.total_count + 1.0,
}))
crit = (r) =>
(r["total_count"] > 10 and r["crit_idx"] >= 0.75)
warn = (r) =>
(r["total_count"] > 10 and r["warn_idx"] >= 0.75)
ok = (r) =>
(r["total_count"] > 10 and r["ok_idx"] >= 0.9)
messageFn = (r) =>
(if r._level == "crit" or r._level == "warn" then "${r.host}: High CPU usage (${string(v: int(v: r._value))}%)" else "${r.host}: CPU usage back to normal (${string(v: int(v: r._value))}%)")
input
|> monitor.check(
crit: crit,
warn: warn,
ok: ok,
messageFn: messageFn,
data: check,
)
basically, i’m reducing the table to the number of rows meeting the crit/warn/ok checks and the counter/total index, and i trigger the alert when those indices are too big (0.75 means 75% of the rows are in the “warn” range)
Both scripts work fine, but in both cases i get a lot of repeated notifications, for example, a WARNING a couple of minutes of the same WARNING (_monitoring._statuses show it’s in UNKNOWN state).
So, my main doubt is: how can I avoid alerts spamming due to flapping values? how can i get the ok/recover notification only when this have been ok for, say, 20 mins?
i also tried another approach: to ignore notifications if a notification has been sent in last X mins. i tried to do this reading from _monitoring.statuses and tried to join it to the cpu values, but i couldn’t manage to make it work, and it seemed to me i was trying to reinvent the wheel