Bug in DeadMan checks

Nam_Giang · June 4, 2021, 9:18pm

Hi there,

My deadman checks work when I set the “When values are not reporting for” period to the default of 90s.

But when I change it to a longer period, such as 5m, it fails to set the status.

FixTestRepeat · June 4, 2021, 10:07pm

Can you share your

Influxdb version
The monitoring rule code (paste it inside a </> code block)
A sample/description of your data, how often records are inserted etc

Nam_Giang · June 4, 2021, 11:18pm

Hi there,

I’m using InfluxDB Cloud, I suppose the latest?

I suppose you’re asking the Check code? (sorry, noobs to InfluxDB). Here is what I’ve got from clicking on Edit task for the Check:

package main


import "influxdata/influxdb/monitor"
import "experimental"
import "influxdata/influxdb/v1"

data = from(bucket: "host-statuses")
	|> range(start: -15s)
	|> filter(fn: (r) =>
		(r["_measurement"] == "statuses"))
	|> filter(fn: (r) =>
		(r["_field"] == "rtt"))

option task = {name: "host down", every: 1m, offset: 0s}

check = {
	_check_id: "079e31334c73c000",
	_check_name: "host down",
	_type: "deadman",
	tags: {},
}
crit = (r) =>
	(r["dead"])
messageFn = (r) =>
	("host:${r.host} 25m no see")

data
	|> v1["fieldsAsCols"]()
	|> monitor["deadman"](t: experimental["subDuration"](from: now(), d: 25m))
	|> monitor["check"](data: check, messageFn: messageFn, crit: crit)

My Check:

Nam_Giang · June 4, 2021, 11:20pm

I have ~100 hosts that are sending a network quality (e.g. ping RTT to google.com) status to influx every 30 seconds. Each data point is named or tagged with the host’s id (host=1), so I have r.host in the message.

I want to be notified when any of those that haven’t been sending their stats for the last 25 minutes.

FixTestRepeat · June 5, 2021, 11:01pm

Try expanding the filter selection criteria. If it’s only 15 seconds, but your hosts only report in every 30 seconds it’s not going to give a meaningful result. Try 2x the interval that hosts write their data in. This should catch hosts that have been “dead” for 2 expected samples.

The other thing I noticed is that you have “and stop after 15 seconds.” That might also contribute to issues when setting longer check 25mins.

Last thing, you may need to alter the rule . It looks like it will be satisfied provided it sees any Rtt row it will think everything is fine, instead of that there must be at least one Rtt per host. But full disclosure I haven’t played around with this exact scenario myself with data I have on hand.

Nam_Giang · June 6, 2021, 12:04am

Hi,

Do you know how would I make sure at least one per host is required? That makes sense to me. Worst case scenario, I might need 100 checks, one for each host, which is impossible for a growing number of hosts.

Also the filter criteria is set by the influx cloud, I just copy from the GUI. Do you know how to adjust the parameters to achieve longer seconds?

Thanks

Nam_Giang · June 8, 2021, 4:36pm

Actually I’ve just thought about this today and this should not be the problem as when I set the "not reporting for " period to a short period as 90s, it works for all the host. That is, if any of them doesn’t send a rtt, it will raise the status.

Topic		Replies	Views
Cannot get monitor.check function to write to _monitoring Bucket Kapacitor influxdb-cloud-2-0	4	968	April 15, 2021
Deadman checks - group-by functionality? InfluxDB 2 kapacitor , chronograf	3	1328	April 21, 2021
Silence deadman alert InfluxDB 2	0	531	September 18, 2020
Help with DeadMan Checks Kapacitor checks	8	1819	March 8, 2021
Best practices regarding deadman checks Checks & Notifications	3	121	July 2, 2024

Bug in DeadMan checks

Related topics