Hello, I have the following data coming into influx and I want to have an alert notification in case number of instances that are up and running are less than 3.
_measurement | _field | _value | cnpg.io/instanceName |
---|---|---|---|
cnpg_collector_up | gauge | 0 or 1 (0: down, 1 is up) | Instance-1, …, instance-n |
For this I thought about using a threshold check that sums up the instances that are up in the last minute and then send warning notification if less than 3. But I am having hard time writing the query for the threshold check. The UI is quite limiting (does not let me use sum or group operations for example) and when I do it via javascript API the check fails without giving much info. (It just say Last Run Status: Completed(failed) in the UI without any details).
The only error in the logs are:
ts=2023-04-28T14:33:15.017598Z lvl=info msg=“Error exhausting result iterator” log_id=0hP2dMN0000 service=task-executor error=“unknown column "_source_measurement"” name=wide-to19
ts=2023-04-28T14:33:15.022706Z lvl=debug msg=“Execution failed” log_id=0hP2dMN0000 service=task-executor error=“could not execute task run: unknown column "_source_measurement"” taskID=0b1e0c524ee96000
And the query I wrote to get number of instances that are up:
from(bucket: "telegraf")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "cnpg_collector_up")
|> filter(fn: (r) => r["_field"] == "gauge")
|> group(columns: ["cnpg.io/instanceName", "_field", "_measurement"], mode:"by")
|> aggregateWindow(every: 1m, fn: mean)
|> drop(columns: ["cnpg.io/instanceName"])
|> group(columns: ["_time", "_field", "_measurement"])
|> sum()
|> group()
So I wonder if this is a right approach and what is failing here? Checks do not support more complex queries?