Hi,
Is there a way to create a check/task which not trigger for the first threshold breach but only after an X time ?
For example I want to run the check every minute and when the value is above the threshold 5 times in a row than trigger the event not before
Hello @Suhanbongo,
Thanks for asking this question. I’ve been meaning to write up a tutorial on this so thank you for the inspiration.
If you’ve set up a check with the UI then you could add a task like this:
import "date"
import "influxdata/influxdb/tasks"
today = date.truncate(t: now(), unit: 1d)
cutoff = tasks.lastSuccess(orTime: -1m) // runs every minute
from(bucket: "_monitoring")
|> range(start: today)
|> filter(fn: (r) => r["_measurement"] == "statuses")
|> filter(fn: (r) => r["_check_name"] == "Query Rate Limit Check")
|> filter(fn: (r) => r["_field"] == "_message")
|> filter(fn: (r) => r["_level"] == "crit" and exists r._value)
|> stateCount(fn: (r) => r._level == "crit", column: "crit_query_counter")
|> filter(fn: (r) => r.crit_query_counter % 5 == 0 and r._time >= cutoff)
|> map(fn:(r) => ({ r with crit_query_counter: string(v: r.crit_query_counter)}))
|> map(fn:(r) => ({r with _measurement: "statuses_counter"}))
|> to(bucket: "_monitoring")
It checks the _monitoring bucket to see if you have any crit levels for the entire day. Then it uses the statecount function to count the number of crit levels. It filters for any counter value that’s divisible by 5 and has also reached that count level in the last minute or since the last successful task run. We store this new 5 time threshold reach in a new column as a string so we could more easily send an alert without having to perform interpolation (optional) (you can interpolate with ${}). Then change the _measurement name so we ca write this value to back into the _monitoring bucket. We could now set up a notification based on the “statuses_counter” measurement in the UI.
Lmk if this makes sense. I will also include an example for incremental stateCount, that’s not just for one day. But I need to get to that tomorrow. Please tag me if I haven’t responded with a full answer by end of day tomorrow. Thank you!
Hi @Anaisdg
Thanks for the explanation it helped, but I think its not suitable for our needs
Let me explain our situation/requirements
We getting performance data from enterprise storage system into our influxdb
We need to monitor these metrics based on predefined thresholds with two level (warn and crit)
in some cases the requirement is to check the metrics
- every minute and after a 5th threshold breach we need to create a ticket
- every minute and after a 10th threshold breach we need to create a ticket
- every minute and after a 30th threshold breach we need to create a ticket
And there is a twist as we need to read some “static” data from a second bucket where the storage class information stored (every storage class has different thresholds)
Btw I noticed that if I create a check and when its enabled then the first outcome is not OK than a notification rule which set as from ANY to CRIT is not triggered, is that a normal behavior ?
Beside I also noticed that if our check is returning 5-8-10-12 whatever result after check execution than the notification rule does not pick up all of them only few or one, thus we loose the events and they not ticketed
@Suhanbongo,
Thanks for explaining in more detail. I will write a task that meets your needs and get back to you.
Btw I noticed that if I create a check and when its enabled then the first outcome is not OK than a notification rule which set as from ANY to CRIT is not triggered, is that a normal behavior ?
No that sounds like a bug
Beside I also noticed that if our check is returning 5-8-10-12 whatever result after check execution than the notification rule does not pick up all of them only few or one, thus we loose the events and they not ticketed
Do you mean in between task runs?
@Anaisdg Thanks!
Yes, the next check/task runs every minutes and the notification rule every two, so lets say there is 8 result (threshold breach) of the check, but only few of them sent out by the notification rule (ticketed)
Hello @Suhanbongo,
For this reason, I sometimes prefer to write custom alert tasks where the checks and notifications are in the same task. Here are some examples:
import "array"
import "slack"
option task = { name: "Event Alert", every: 1h0m0s, offset: 5m0s }
alert = (eventValue, threshold) =>
(if eventValue >= threshold then slack.message(
url: "https://hooks.slack.com/services/####/####/####",
text: "An alert event has occurred! The number of field values= \"${string(v: eventValue)}\".",
color: "warning",
) else 0)
data = from(bucket: "bucket1")
|> range(start: -task.every, stop: now())
|> filter(fn: (r) =>
(r._measurement == "measurement1" and r._field == "field1" and exists r._value))
|> sum()
data_0 = array.from(rows: [{_value: 0}])
events = union(tables: [data_0, data])
|> group()
|> sum()
|> findRecord(fn: (key) =>
(true), idx: 0)
eventTotal = events._value
data_0
|> yield(name: "ignore")
alert(eventValue: eventTotal, threshold: 1)
@Anaisdg Thanks, we will experiment with these…
but our main issue at the moment if we integrate the http.post into the task and skip the _monitoring bucket that if the (complex) query return more than one result, than only on sent out to the http endpoint.
this puzzle what we want to solve:
- bucket1: query the data (storage latency) - this is constantly change
- bucket2: query system classification data - more or less static
- assign different threshold (at least two level) based on the system class
- incorporate the requirement that only after 10 or more consecutive threshold breach should be alerted
- than forward all alert to a http endpoint which create an incident in ticketing system
Hello @Suhanbongo,
If you call the http.post() function as a part of a map you will notify on every record that meets your alert criterial (or every remaining record).
As seen in the example from the TICKscript to FLux blog:
// Step 1: import Flux packages
import "influxdata/influxdb/monitor"
import "influxdata/influxdb/schema"
import "math"
// Step 2: define your task options.
// Always include an offset to avoid read and write conflicts. Period and every are defined by the every parameter.
option task = {
name: "generic",
every: 10s,
offset: 2s,
}
// Step 3: Define your thresholds.
infoVal = <info_level>
warnVal = <warn_level>
critVal = <crit_level>
infoSig = 1.0
warnSig = 2.0
critSig = 3.0
// Step 4: Query for data.
Data is grouped by tags or host by default so no need to groupBy('host') as with line 28 in generic_batch_example.tick
data = from(bucket: "<bucket>")
|> range(start: -task.every)
|> filter(fn: (r) => r._measurement == "<measurement>")
|> filter(fn: (r) => r.host == "hostValue1" or r.host == "hostValue2")
|> filter(fn: (r) => r._field == "stat")
// Step 5: Calculate the mean and standard deviation instead of .sigma and extract the scalar value.
// Calculate mean from sample and extract the value with findRecord()
mean_val = (data
|> mean(column: "_value")
// Insert yield() statements to visualize how your data is being transformed.
// |> yield(name: "mean_val")
|> findRecord(fn: (key) => true, idx: 0))._value
// Calculate standard deviation from sample and extract the value with findRecord()
stddev_val = (data
|> stddev()
// Insert yield() statements to visualize how your data is being transformed.
// |> yield(name: "stddev")
|> findRecord(fn: (key) => true, idx: 0))._value
// Step 6: Create a custom message to alert on data
alert = (level, type, eventValue) => {
slack.message(
// Will send alerts to the #notifications-testing channel in the InfluxData Slack Community
url: "https://hooks.slack.com/services/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 ",
text: "An alert \"${string(v: type)}\" event has occurred! The number of field values= \"${string(v: eventValue)}\".",
color: "warning",
)
return level
}
data
// Step 7: Map across values and return the number of stddev to the level as well as a custom slack message defined in the alert() function.
|> map(
fn: (r) => ({r with
level: if r._value < mean_val + math.abs(x: stddev_val) and r._value > mean_val - math.abs(x: stddev_val) or r._value > infoVal then
alert(level: 1, type: info, eventValue: r._value)
else if r._value < mean_val + math.abs(x: stddev_val) * float(v: 2) and r.airTemperature > mean_val - math.abs(x: stddev_val) * float(v: 2) or r._value > okVal then
alert(level: 2, type: ok, eventValue: r._value)
else if r._value < mean_val + math.abs(x: stddev_val) * float(v: 3) and r.airTemperature > mean_val - math.abs(x: stddev_val) * float(v: 3) or r._value > warnVal then
alert(level: 3, type: warn, eventValue: r._value)
else
alert(level: 4, type: crit, eventValue: r._value)
)
// Use the to() function to write the level created by the map() function if you desire. This is not shown