I created the following tick script to get notified whenever a windows host is not sending any data. It’s working fine.
dbrp "telegraf"."autogen" var data = stream |from() .database('telegraf') .retentionPolicy('autogen') .measurement('win_cpu') data |deadman(0.0, 2m) .stateChangesOnly() .log('/tmp/deadman-alerts.log')
I’m facing two problems and any advice would help.
1. Is there a more generic measurement I can use for the “heartbeat”? The one above only works for windows hosts, because it depends on “win_cpu”. Can telegraf send a generic ping or heartbeat, which can be used on any operating system?
2. How can I get the hostname of the dead host?
The example above will notify that some host is not sending data. But I don’t know which it is. That’s reasonable because capacitor does not know which hosts are connected and which of them must be treated as “active”.
My idea was to create a rule for each host I’m expecting data from.
dbrp "telegraf"."autogen" var data = stream |from() .database('telegraf') .retentionPolicy('autogen') .measurement('win_cpu') .where(lambda: "host" == 'webserver01.example.local') data |deadman(0.0, 2m) .stateChangesOnly() .message('webserver01.example.local dead') .log('/tmp/deadman-alerts.log')
That works fine. But the drawback is, I have to create a rule for each host. But even if the rules can be created by a script, it doesn’t seem to be a smart solution.
Is there a better way to get notified which host is not sending data?
I have 500-700 hosts per influxdb. If I create a deadman-tick-script for each of them, does it impact the performance of capacitor?
Thanks for any hints.