Trigger an alert after an X time the threshold reached

Suhanbongo · June 29, 2022, 7:58am

Hi,

Is there a way to create a check/task which not trigger for the first threshold breach but only after an X time ?
For example I want to run the check every minute and when the value is above the threshold 5 times in a row than trigger the event not before

Anaisdg · June 30, 2022, 11:31pm

Hello @Suhanbongo,
Thanks for asking this question. I’ve been meaning to write up a tutorial on this so thank you for the inspiration.

If you’ve set up a check with the UI then you could add a task like this:

import "date"
import "influxdata/influxdb/tasks"

today = date.truncate(t: now(), unit: 1d)
cutoff = tasks.lastSuccess(orTime: -1m) // runs every minute

from(bucket: "_monitoring")
    |> range(start: today)
    |> filter(fn: (r) => r["_measurement"] == "statuses")
    |> filter(fn: (r) => r["_check_name"] == "Query Rate Limit Check")
    |> filter(fn: (r) => r["_field"] == "_message")
    |> filter(fn: (r) => r["_level"] == "crit" and exists r._value)
    |> stateCount(fn: (r) => r._level == "crit", column: "crit_query_counter")
    |> filter(fn: (r) => r.crit_query_counter % 5 == 0 and r._time >= cutoff)
    |> map(fn:(r) => ({ r with crit_query_counter: string(v: r.crit_query_counter)}))
    |> map(fn:(r) => ({r with _measurement: "statuses_counter"}))
    |> to(bucket: "_monitoring")

It checks the _monitoring bucket to see if you have any crit levels for the entire day. Then it uses the statecount function to count the number of crit levels. It filters for any counter value that’s divisible by 5 and has also reached that count level in the last minute or since the last successful task run. We store this new 5 time threshold reach in a new column as a string so we could more easily send an alert without having to perform interpolation (optional) (you can interpolate with ${}). Then change the _measurement name so we ca write this value to back into the _monitoring bucket. We could now set up a notification based on the “statuses_counter” measurement in the UI.

Lmk if this makes sense. I will also include an example for incremental stateCount, that’s not just for one day. But I need to get to that tomorrow. Please tag me if I haven’t responded with a full answer by end of day tomorrow. Thank you!

Suhanbongo · July 1, 2022, 6:54am

Hi @Anaisdg

Thanks for the explanation it helped, but I think its not suitable for our needs
Let me explain our situation/requirements

We getting performance data from enterprise storage system into our influxdb
We need to monitor these metrics based on predefined thresholds with two level (warn and crit)
in some cases the requirement is to check the metrics

every minute and after a 5th threshold breach we need to create a ticket
every minute and after a 10th threshold breach we need to create a ticket
every minute and after a 30th threshold breach we need to create a ticket

And there is a twist as we need to read some “static” data from a second bucket where the storage class information stored (every storage class has different thresholds)

Btw I noticed that if I create a check and when its enabled then the first outcome is not OK than a notification rule which set as from ANY to CRIT is not triggered, is that a normal behavior ?
Beside I also noticed that if our check is returning 5-8-10-12 whatever result after check execution than the notification rule does not pick up all of them only few or one, thus we loose the events and they not ticketed

Anaisdg · July 1, 2022, 5:09pm

@Suhanbongo,
Thanks for explaining in more detail. I will write a task that meets your needs and get back to you.

Btw I noticed that if I create a check and when its enabled then the first outcome is not OK than a notification rule which set as from ANY to CRIT is not triggered, is that a normal behavior ?

No that sounds like a bug

Beside I also noticed that if our check is returning 5-8-10-12 whatever result after check execution than the notification rule does not pick up all of them only few or one, thus we loose the events and they not ticketed

Do you mean in between task runs?

Suhanbongo · July 4, 2022, 6:13am

@Anaisdg Thanks!

Yes, the next check/task runs every minutes and the notification rule every two, so lets say there is 8 result (threshold breach) of the check, but only few of them sent out by the notification rule (ticketed)

Anaisdg · July 5, 2022, 7:49pm

Hello @Suhanbongo,
For this reason, I sometimes prefer to write custom alert tasks where the checks and notifications are in the same task. Here are some examples:

import "array"
import "slack"

option task = { name: "Event Alert", every: 1h0m0s, offset: 5m0s }

alert = (eventValue, threshold) =>
   (if eventValue >= threshold then slack.message(
       url: "https://hooks.slack.com/services/####/####/####",
       text: "An alert event has occurred! The number of field values= \"${string(v: eventValue)}\".",
       color: "warning",
   ) else 0)
data = from(bucket: "bucket1")  
   |> range(start: -task.every, stop: now())
   |> filter(fn: (r) =>
       (r._measurement == "measurement1" and r._field == "field1" and exists r._value))
   |> sum()
data_0 = array.from(rows: [{_value: 0}])
events = union(tables: [data_0, data])
   |> group()
   |> sum()
   |> findRecord(fn: (key) =>
       (true), idx: 0)
eventTotal = events._value

data_0
   |> yield(name: "ignore")
alert(eventValue: eventTotal, threshold: 1)

Suhanbongo · July 12, 2022, 3:23pm

@Anaisdg Thanks, we will experiment with these…
but our main issue at the moment if we integrate the http.post into the task and skip the _monitoring bucket that if the (complex) query return more than one result, than only on sent out to the http endpoint.

this puzzle what we want to solve:

bucket1: query the data (storage latency) - this is constantly change
bucket2: query system classification data - more or less static
assign different threshold (at least two level) based on the system class
incorporate the requirement that only after 10 or more consecutive threshold breach should be alerted
than forward all alert to a http endpoint which create an incident in ticketing system

Anaisdg · July 12, 2022, 6:43pm

Hello @Suhanbongo,
If you call the http.post() function as a part of a map you will notify on every record that meets your alert criterial (or every remaining record).
As seen in the example from the TICKscript to FLux blog:

// Step 1: import Flux packages
import "influxdata/influxdb/monitor"
import "influxdata/influxdb/schema"
import "math"

// Step 2: define your task options. 
// Always include an offset to avoid read and write conflicts. Period and every are defined by the every parameter.
option task = {
name: "generic",
every: 10s,
offset: 2s,
}

// Step 3: Define your thresholds.
infoVal = <info_level>
warnVal = <warn_level>
critVal = <crit_level>
infoSig = 1.0
warnSig = 2.0
critSig = 3.0

// Step 4: Query for data.
Data is grouped by tags or host by default so no need to groupBy('host') as with line 28 in generic_batch_example.tick
data = from(bucket: "<bucket>")
   |> range(start: -task.every)
   |> filter(fn: (r) => r._measurement == "<measurement>")
   |> filter(fn: (r) => r.host == "hostValue1" or r.host == "hostValue2")
   |> filter(fn: (r) => r._field == "stat")

// Step 5: Calculate the mean and standard deviation instead of .sigma and extract the scalar value. 

// Calculate mean from sample and extract the value with findRecord()
mean_val = (data
   |> mean(column: "_value")
   // Insert yield() statements to visualize how your data is being transformed. 
   // |> yield(name: "mean_val")
   |> findRecord(fn: (key) => true, idx: 0))._value

// Calculate standard deviation from sample and extract the value with findRecord()
stddev_val = (data
   |> stddev()
   // Insert yield() statements to visualize how your data is being transformed. 
   // |> yield(name: "stddev")
   |> findRecord(fn: (key) => true, idx: 0))._value

// Step 6: Create a custom message to alert on data
alert = (level, type, eventValue)  => {
slack.message(
      // Will send alerts to the #notifications-testing channel in the InfluxData Slack Community
      url: "https://hooks.slack.com/services/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 ",
      text: "An alert \"${string(v: type)}\" event has occurred! The number of field values= \"${string(v: eventValue)}\".",
      color: "warning",
      )
      return level
      }
data
   // Step 7: Map across values and return the number of stddev to the level as well as a custom slack message defined in the alert() function.
   |> map(
       fn: (r) => ({r with
level: if r._value < mean_val + math.abs(x: stddev_val) and r._value > mean_val - math.abs(x: stddev_val) or r._value > infoVal then
             alert(level: 1, type: info, eventValue: r._value)
           else if r._value < mean_val + math.abs(x: stddev_val) * float(v: 2) and r.airTemperature > mean_val - math.abs(x: stddev_val) * float(v: 2) or r._value > okVal then
             alert(level: 2, type: ok, eventValue: r._value)
           else if r._value < mean_val + math.abs(x: stddev_val) * float(v: 3) and r.airTemperature > mean_val - math.abs(x: stddev_val) * float(v: 3) or r._value > warnVal then
             alert(level: 3, type: warn, eventValue: r._value)
           else
              alert(level: 4, type: crit, eventValue: r._value)
)

   // Use the to() function to write the level created by the map() function if you desire. This is not shown

Topic		Replies	Views
Flapping values / Hysteresis InfluxDB 2	9	1415	September 20, 2021
Alerting in InfluxDB 2.0 using Flux InfluxDB 2	4	595	May 26, 2021
Task sending alerts randomly InfluxDB 2 influxdb , query , flux , tasks	1	295	October 13, 2023
Influx DB monitoring system development error Checks & Notifications	14	908	December 14, 2021
Repeat Interval after first alert was sent InfluxDB 2 influxdb , checks	2	224	October 26, 2023

Trigger an alert after an X time the threshold reached

Related topics