Bug in DeadMan checks

Hi there,

My deadman checks work when I set the “When values are not reporting for” period to the default of 90s.

But when I change it to a longer period, such as 5m, it fails to set the status.

Can you share your

  1. Influxdb version
  2. The monitoring rule code (paste it inside a </> code block)
  3. A sample/description of your data, how often records are inserted etc

Hi there,

I’m using InfluxDB Cloud, I suppose the latest?

I suppose you’re asking the Check code? (sorry, noobs to InfluxDB). Here is what I’ve got from clicking on Edit task for the Check:

image

package main


import "influxdata/influxdb/monitor"
import "experimental"
import "influxdata/influxdb/v1"

data = from(bucket: "host-statuses")
	|> range(start: -15s)
	|> filter(fn: (r) =>
		(r["_measurement"] == "statuses"))
	|> filter(fn: (r) =>
		(r["_field"] == "rtt"))

option task = {name: "host down", every: 1m, offset: 0s}

check = {
	_check_id: "079e31334c73c000",
	_check_name: "host down",
	_type: "deadman",
	tags: {},
}
crit = (r) =>
	(r["dead"])
messageFn = (r) =>
	("host:${r.host} 25m no see")

data
	|> v1["fieldsAsCols"]()
	|> monitor["deadman"](t: experimental["subDuration"](from: now(), d: 25m))
	|> monitor["check"](data: check, messageFn: messageFn, crit: crit)

My Check:

image
image

1 Like

I have ~100 hosts that are sending a network quality (e.g. ping RTT to google.com) status to influx every 30 seconds. Each data point is named or tagged with the host’s id (host=1), so I have r.host in the message.

I want to be notified when any of those that haven’t been sending their stats for the last 25 minutes.

Try expanding the filter selection criteria. If it’s only 15 seconds, but your hosts only report in every 30 seconds it’s not going to give a meaningful result. Try 2x the interval that hosts write their data in. This should catch hosts that have been “dead” for 2 expected samples.

The other thing I noticed is that you have “and stop after 15 seconds.” That might also contribute to issues when setting longer check 25mins.

Last thing, you may need to alter the rule . It looks like it will be satisfied provided it sees any Rtt row it will think everything is fine, instead of that there must be at least one Rtt per host. But full disclosure I haven’t played around with this exact scenario myself with data I have on hand.

Hi,

Do you know how would I make sure at least one per host is required? That makes sense to me. Worst case scenario, I might need 100 checks, one for each host, which is impossible for a growing number of hosts.

Also the filter criteria is set by the influx cloud, I just copy from the GUI. Do you know how to adjust the parameters to achieve longer seconds?

Thanks

Actually I’ve just thought about this today and this should not be the problem as when I set the "not reporting for " period to a short period as 90s, it works for all the host. That is, if any of them doesn’t send a rtt, it will raise the status.