Create alert to check number of running instances are not less than 3

dogan · April 28, 2023, 2:40pm

Hello, I have the following data coming into influx and I want to have an alert notification in case number of instances that are up and running are less than 3.

_measurement	_field	_value	cnpg.io/instanceName
cnpg_collector_up	gauge	0 or 1 (0: down, 1 is up)	Instance-1, …, instance-n

For this I thought about using a threshold check that sums up the instances that are up in the last minute and then send warning notification if less than 3. But I am having hard time writing the query for the threshold check. The UI is quite limiting (does not let me use sum or group operations for example) and when I do it via javascript API the check fails without giving much info. (It just say Last Run Status: Completed(failed) in the UI without any details).

The only error in the logs are:
ts=2023-04-28T14:33:15.017598Z lvl=info msg=“Error exhausting result iterator” log_id=0hP2dMN0000 service=task-executor error=“unknown column "_source_measurement"” name=wide-to19
ts=2023-04-28T14:33:15.022706Z lvl=debug msg=“Execution failed” log_id=0hP2dMN0000 service=task-executor error=“could not execute task run: unknown column "_source_measurement"” taskID=0b1e0c524ee96000

And the query I wrote to get number of instances that are up:

from(bucket: "telegraf")
                |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
                |> filter(fn: (r) => r["_measurement"] == "cnpg_collector_up")
                |> filter(fn: (r) => r["_field"] == "gauge")
                |> group(columns: ["cnpg.io/instanceName", "_field", "_measurement"], mode:"by")
                |> aggregateWindow(every: 1m, fn: mean)
                |> drop(columns: ["cnpg.io/instanceName"])
                |> group(columns: ["_time", "_field", "_measurement"])
                |> sum()
                |> group()

So I wonder if this is a right approach and what is failing here? Checks do not support more complex queries?

Anaisdg · May 3, 2023, 3:52pm

Hello @dogan,
Did you create a seprate threshold check through the UI? and then query that?
Id recommend creating one task that does it all.
Something like:

import "array"
import "slack"

option task = { name: "Alert on instances", every: 1h0m0s, offset: 5m0s }

alert = (eventValue, threshold) =>
   (if eventValue >= threshold then slack.message(
       url: "https://hooks.slack.com/services/####/####/####",
       text: "An alert event has occurred! The number of field values= \"${string(v: eventValue)}\".",
       color: "warning",
   ) else 0)

data = from(bucket: "telegraf")
                |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
                |> filter(fn: (r) => r["_measurement"] == "cnpg_collector_up")
                |> filter(fn: (r) => r["_field"] == "gauge")
                |> group(columns: ["cnpg.io/instanceName", "_field", "_measurement"], mode:"by")
                |> aggregateWindow(every: 1m, fn: mean)
                |> drop(columns: ["cnpg.io/instanceName"])
                |> group(columns: ["_time", "_field", "_measurement"])
                |> sum()
                |> group()

data_0 = array.from(rows: [{_value: 0}])
events = union(tables: [data_0, data])
   |> group()
   |> sum()
   |> findRecord(fn: (key) =>
       (true), idx: 0)
eventTotal = events._value

data_0
   |> yield(name: "ignore")
alert(eventValue: eventTotal, threshold: 3)

It’s taken from

dogan · May 3, 2023, 9:53pm

Hi Anaisdg,
Thanks for the informative answer.

I have created a threshold check via UI but it did not let me write the query. Then created another one programmatically using ChecksAPI from @influxdata/influxdb-client-apis and that one fails with the generic error message (shown in UI).

I am a bit confused with your recommendation here. Do you recommend implementing own alerting function instead of using NotificationEndpointsAPI and NotificationRulesAPI? Is there an example of how to send an HTTP alert from a task run in the same format of NotificationEndpointsAPI?

Topic		Replies	Views
Alerts "check" for aggregated data (count) InfluxDB 2	1	518	December 5, 2021
Trigger an alert after an X time the threshold reached Fluxlang checks	7	1617	July 12, 2022
Creating a task/check in Influx 2.0 InfluxDB 2	3	1168	June 18, 2020
Influx 2.7 Checks fail Checks & Notifications checks	1	415	May 20, 2023
Alerting in InfluxDB 2.0 using Flux InfluxDB 2	4	595	May 26, 2021

Create alert to check number of running instances are not less than 3

Related topics