Flapping values / Hysteresis

I’m using 2.0.4 and I’m trying to set up a monitoring task to get a notification when any CPU core has a high usage for a sustained period of time.

I can easily transform the data to get the output table I want, either with stateDuration() or with a combination of aggregateWindow() and reduce(). When I check on data explorer, every looks fine. That’s not a problem. The problem comes with the monitor/notification part: when the values are flapping near a threshold, I get a lot of unwanted (even repeated) notifications. Regardless if I set the notification rule condition to “when status changes from ANY to CRIT” of “when status is equal to CRIT”, I receive a lot of repeated CRIT notifications.

I also tried with StatusChangesOnly(). I’ve been using that function on my tick alerts (influx 1.x) for years, but it doesn’t work with influx 2.x, and there’s very little documentation about that function.

Also, theres no flapping() function on flux.

How can avoid flapping values and hysteresis on influx 2.x?

Hello @vvilaplana,
Welcome! I’m sorry you’re having trouble with Flux. I would love to help!
Can you please share some input data and your Flux script?
You can always transform a task into a custom check and alert by adding the alert functionality into the task itself.
In this post, we learn how to use tasks in combination with checks for monitoring with InfluxDB:

This post talks about monitoring states with InfluxDB and the monitor.statesonly() function. This TL;DR assumes that you already know how to create a check:

Here’s an example of a alert task script:
https://docs.influxdata.com/influxdb/v2.0/monitor-alert/custom-checks/#example-alert-task-script
https://docs.influxdata.com/influxdb/v2.0/monitor-alert/custom-checks/
Here’s an example of using the Telegram Flux package to send alerts to Telegram:

Well, I’ve seen dozens of times the task list check script from the link you send me, but I’m not sure how’s that related to what I’m saying. That script from your link doesn’t do any flap detection. So if a task runs fine and fails 30 times in a row, you’re gonna get 30 notifications, which obviously is not what I want.

This is the cpu check task I’m currently using with stateDuration()

import "influxdata/influxdb/monitor"

option task = {name: "CPU usage", every: 15s}

check = {
_check_id: "cpu_usage_idle",
_check_name: "CPU Usage",
_type: "custom",
tags: {},
}

input = from(bucket: "system")
	|> range(start: -15m)
	|> filter(fn: (r) =>
		(r["_measurement"] == "cpu" and r["_field"] == "usage_idle"))
	|> group(columns: ["host", "_measurement"])
	|> aggregateWindow(every: 1m, fn: min, createEmpty: false)
	|> filter(fn: (r) =>
		(exists r._value))
	|> map(fn: (r) =>
		({r with _value: 100.0 - r._value}))
	|> stateDuration(fn: (r) =>
		(r._value >= 90), column: "crit_duration", unit: 1m)
	|> stateDuration(fn: (r) =>
		(r._value >= 80 and r._value < 90), column: "warn_duration", unit: 1m)
	|> stateDuration(fn: (r) =>
		(r._value < 75), column: "ok_duration", unit: 1m)

   crit = (r) =>
    	(r["crit_duration"] > 10)
    warn = (r) =>
    	(r["warn_duration"] > 10)
    ok = (r) =>
    	(r["ok_duration"] > 30)
    messageFn = (r) =>
    	(if r._level == "crit" or r._level == "warn" then "${r.host}: High CPU usage (${string(v: int(v: r._value))}%)" else "${r.host}: CPU usage back to normal (${string(v: int(v: r._value))}%)")

    input
    	|> monitor.check(
    		crit: crit,
    		warn: warn,
    		ok: ok,
    		messageFn: messageFn,
    		data: check,
    	)

Basically, I’m getting the maximum cpu usage value within a 1-min aggregate across all cores (cpu tag), and using stateDuration() to avoid spamming alerts, hysteresis and flapping values. I want to trigger CRIT/WARN notifications when they’ve been within the CRIT/WARN ranges for 10 mins, and I want to get the ok/recovery notification when it’s been 30 mins ok.

The problem with stateDuration() is that all values must be in a row otherwise the duration counter resets to -1. The alert will only recover when ALL the values for last 30 mins are within the OK range. So if I’m an alert in CRIT level because the CPU has been 100% for long time and now the cpu usage values are 1%, 2%, 0% all the time but every now and then I get a single 95% value, the max (or quantile) selector will ruin the aggregate and the alert will never recover. It will go to UNKOWN level, and when it goes back to OK/WARN/CRIT will trigger the notification (this is why I receive new notifications with the same level)

Now, this is pretty much the same task, but using aggregateWindow()+reduce()

import "influxdata/influxdb/monitor"

option task = {name: "CPU usage", every: 15s}

check = {
	_check_id: "cpu_usage_idle",
	_check_name: "CPU Usage",
	_type: "custom",
	tags: {},
}
crit_threshold = 90.0
warn_threshold = 80.0
ok_threshold = 75.0
input = from(bucket: "system")
	|> range(start: -20m)
	|> filter(fn: (r) =>
		(r["_measurement"] == "cpu" and r["_field"] == "usage_idle"))
	|> group(columns: ["host", "_measurement"])
	|> aggregateWindow(every: 1m, fn: min, createEmpty: false)
	|> filter(fn: (r) =>
		(exists r._value))
	|> map(fn: (r) =>
		({r with _value: 100.0 - r._value}))
	|> reduce(identity: {
		total_count: 1.0,
		crit_count: 0.0,
		warn_count: 0.0,
		ok_count: 0.0,
		crit_idx: 0.0,
		warn_idx: 0.0,
		ok_idx: 0.0,
	}, fn: (r, accumulator) =>
		({
			crit_count: if r._value >= crit_threshold then accumulator.crit_count + 1.0 else accumulator.crit_count + 0.0,
			warn_count: if r._value < crit_threshold and r._value >= warn_threshold then accumulator.warn_count + 1.0 else accumulator.warn_count + 0.0,
			ok_count: if r._value < ok_threshold then accumulator.ok_count + 1.0 else accumulator.ok_count + 0.0,
			crit_idx: accumulator.crit_count / accumulator.total_count,
			warn_idx: accumulator.warn_count / accumulator.total_count,
			ok_idx: accumulator.ok_count / accumulator.total_count,
			total_count: accumulator.total_count + 1.0,
		}))
crit = (r) =>
	(r["total_count"] > 10 and r["crit_idx"] >= 0.75)
warn = (r) =>
	(r["total_count"] > 10 and r["warn_idx"] >= 0.75)
ok = (r) =>
	(r["total_count"] > 10 and r["ok_idx"] >= 0.9)
messageFn = (r) =>
	(if r._level == "crit" or r._level == "warn" then "${r.host}: High CPU usage (${string(v: int(v: r._value))}%)" else "${r.host}: CPU usage back to normal (${string(v: int(v: r._value))}%)")

input
	|> monitor.check(
		crit: crit,
		warn: warn,
		ok: ok,
		messageFn: messageFn,
		data: check,
	)

basically, i’m reducing the table to the number of rows meeting the crit/warn/ok checks and the counter/total index, and i trigger the alert when those indices are too big (0.75 means 75% of the rows are in the “warn” range)

Both scripts work fine, but in both cases i get a lot of repeated notifications, for example, a WARNING a couple of minutes of the same WARNING (_monitoring._statuses show it’s in UNKNOWN state).

So, my main doubt is: how can I avoid alerts spamming due to flapping values? how can i get the ok/recover notification only when this have been ok for, say, 20 mins?

i also tried another approach: to ignore notifications if a notification has been sent in last X mins. i tried to do this reading from _monitoring.statuses and tried to join it to the cpu values, but i couldn’t manage to make it work, and it seemed to me i was trying to reinvent the wheel

@vvilaplana,
Do you think you could create those alerts, and then create a separate task drawing from “_monitoring” bucket and consolidate redundant/flapping alerts before sending a notification?
You can include the cpu level, timestamp, and any other information in the message so you don’t have to perform a join but can more easily narrow in on the times when you get a “true” alert.

Could you count the number of OK, CRIT, WARN values for the past 30 min and calculate the percentage of OK and if that percentage exceeds a threshold you set and X number of consecutive levels have been OK then throw a notification? I agree you’re basically going to have to create your own version of the monitor.StateDuration() function. I think this use case would be good to share with the Flux team. Also I appreciate your patience while I try to write some Flux examples for your use case.

Yeah, I can remove the monitor part of those tasks and send the output to _monitoring bucket, and then have a unique “alerting” task processing the _monitoring bucket and triggering the alerts. I don’t have any problem with that, my problem is “how do I avoid flapping values?” or “how to keep previous _level state instead of getting a UNKNOWN state?”

How? I don’t know how to do that and I haven’t seen any documentation about how to avoid spamming alerts due to flapping values or hysteresis.

Yeah, I tried that (it’s the first task I pasted here) but it has a defect: StateDuration() just counts how many values are consecutively meeting a condition. So I have an alert that triggers when the CPU is >90% for a minute and I get hundreds of values >90% in a single minute but I have A SINGLE value <90%, the alert won’t trigger. It’s better to use StateCount() to count how many values meet that condition and then calculate the percentage.

That problem is already solved. Regardless if I use StateDuration(), StateCount() or the aggregateWindow()+reduce(), I’m hitting the same problem: what happens when the outcome is not Critical/Warning/Info/Ok? InfluxDB 1.x keeps the previous _level. InfluxDB 2.x puts _level to “UNKNOWN”, which is completely useless to me.

Please, don’t send me your link “How to Monitor States with InfluxDB” from Aug 2020, because I’ve studied it many times and it doesn’t address the issue I have.

I think this example perfectly explains my problem:
“A High CPU alert triggers CRIT when >90% for 10 mins, WARN when between 80% and 90% for 10 mins and OK when <75 for 20 mins.”

Let’s suppose I’m having values of 10% (OK)… after 20 mins, I’m getting the OK notification. Fine.
Now the CPU gets stress (100%) and we receive the first value. Because it’s >90%, it won’t trigger WARN nor OK, and because it’s not been >90% for 10 mins, it won’t trigger CRIT either, so the alert will be UNKNOWN, which is wrong. If no condition is met, I want it to keep the previous state.
Correct me if I’m wrong, but as far as I know, there is no monitor.xxxx() function that does that.

So I guess the only option is to avoid monitor.check() and either:

  • feed the _monitoring bucket only when CRIT/WARN/INFO/OK conditions are meet (this way we avoid populating _monitoring.statuses with UNKNOWN values and it will only contain “defined” states)

or

  • read the previous _level, put it in some $prevState variable and, if no condition is met, put the $prevState on _monitoring.statuses.

Both solutions seem to me a bit too much working “under the hood”, because we’re not using UI checks and we’re not using monitor() tasks; we’re directly touching _monitoring internal bucket. I’m not sure if that’s gonna be performant/scalable. What is the best way to do that? Any full example?

Hello @vvilaplana,
Touching the _monitoring internal bucket is not too much under the hood. Not using UI checks likewise is not a problem. The UI checks are fairly limited in what they can do as you see.

I would write a custom check that meets your requirements–A High CPU alert triggers CRIT when >90% for 10 mins, WARN when between 80% and 90% for 10 mins and OK when <75 for 20 mins and fills previous–and writes those statuses to a new bucket (“my custom alerts”). Then you can create another custom task/alert to send notifications for the statuses you want. Similar to the first bullet you suggested.

Let me try to get someone from the Flux team to weigh in on this to make sure though. I feel your pain and I appreciate your patience.

@Anaisdg, @vvilaplana I have a potential solution (I qualify it with “potential” because I find the behavior of monitor.stateChanges() to be somewhat erratic).

This uses map() to manually assign a _level to each row based on then CPU usage (you have to have a _level column to use monitor.stateChanges). It then uses monitor.stateChanges() to (in theory) return only rows where the state change. It then uses the events.duration() function (awesome function contributed by a community member) to calculate the duration of each state. It then uses the value of the state duration and the _level to determine the alert level before writing it to the _monitoring bucket.

import "influxdata/influxdb/monitor"
import "contrib/tomhollingworth/events"

option task = {name: "CPU usage", every: 15s}

check = {
_check_id: "cpu_usage_idle",
_check_name: "CPU Usage",
_type: "custom",
tags: {},
}

input = from(bucket: "default")
	|> range(start: -15m)
	|> filter(fn: (r) =>
		(r["_measurement"] == "cpu" and r["_field"] == "usage_idle"))
	|> group(columns: ["host", "_measurement"])
	|> aggregateWindow(every: 1m, fn: min, createEmpty: false)
	|> filter(fn: (r) => (exists r._value))
	|> map(fn: (r) => ({r with _value: 100.0 - r._value}))
  |> map(fn: (r) => ({ r with _level:
    if r._value >= 90 then "crit"
    else if r._value >= 80 then "warn"
    else "ok"
  }))
  |> monitor.stateChanges()
  |> events.duration(unit: 1m)

crit = (r) =>	(r._level == "crit" and r.duration > 10)
warn = (r) =>	(r._level == "warn" and r.duration > 10)
ok = (r) =>	(r._level == "ok" and r.duration > 30)
messageFn = (r) =>	(if r._level == "crit" or r._level == "warn" then "${r.host}: High CPU usage (${string(v: int(v: r._value))}%)" else "${r.host}: CPU usage back to normal (${string(v: int(v: r._value))}%)")

input
  |> monitor.check(
    crit: crit,
    warn: warn,
    ok: ok,
    messageFn: messageFn,
    data: check,
  )

Nah, it doesn’t work. I just tested it and it throws a lot of _level = UNKNOWN to _monitoring.statuses when none of the conditions is met (for examples, CPU values >90% and duration is still <10m). :frowning: Is there any way to keep the previous _level when no crit/warn/ok condition is met instead of populating it with “UNKNOWN”? Anyway, that “monitor.stateChanges()” is in the “data” bucket (it will only detect changes on last 15 mins), not the _monitoring bucket.