In the Deadman documentation this example deadman is given:
data
|deadman(100.0, 10s)
It’s said that this is equivalent to this:
data
|stats(10s)
.align()
|derivative('emitted')
.unit(10s)
.nonNegative()
|alert()
.id('node \'stream0\' in task \'{{ .TaskName }}\'')
.message('{{ .ID }} is {{ if eq .Level "OK" }}alive{{ else }}dead{{ end }}: {{ index .Fields "emitted" | printf "%0.3f" }} points/10s.')
.crit(lambda: "emitted" <= 100.0)
My application is related to measurements that should be recorded every 90 minutes. If several of these measurements are missing in a row, there is a problem and we need to take action.
However, we want to avoid getting alerts if we miss a single measurement, because that is within the tolerance of the system.
So, my question is related to the period of the alert, and the sample rate. Does this sample occur at fixed intervals?
The amount of time we decided would be appropriate to wait before alerting was 5 hours from the last recorded data point. But, in the case that we get our last data point right after the beginning of a fixed 5 hour interval, if I understand correctly then we would need to wait effectively two intervals before we get our alert. This additional 5 hour wait would be a problem for us.
Ideally what I would like to do is have the deadman take it’s derivative check over the previous 5 hours, but to do this check more often.
One option I have been trying to implement this is to use different times for the stats( 1m) and derivative - unit( 5h )
However, in my preliminary testing, this does not seem to work. I’m curious if anyone has experience resolving this sort of issue.