Hi all,
I created the following tick script to get notified whenever a windows host is not sending any data. It’s working fine.
dbrp "telegraf"."autogen"
var data = stream
|from()
.database('telegraf')
.retentionPolicy('autogen')
.measurement('win_cpu')
data
|deadman(0.0, 2m)
.stateChangesOnly()
.log('/tmp/deadman-alerts.log')
I’m facing two problems and any advice would help.
1. Is there a more generic measurement I can use for the “heartbeat”? The one above only works for windows hosts, because it depends on “win_cpu”. Can telegraf send a generic ping or heartbeat, which can be used on any operating system?
2. How can I get the hostname of the dead host?
The example above will notify that some host is not sending data. But I don’t know which it is. That’s reasonable because capacitor does not know which hosts are connected and which of them must be treated as “active”.
My idea was to create a rule for each host I’m expecting data from.
dbrp "telegraf"."autogen"
var data = stream
|from()
.database('telegraf')
.retentionPolicy('autogen')
.measurement('win_cpu')
.where(lambda: "host" == 'webserver01.example.local')
data
|deadman(0.0, 2m)
.stateChangesOnly()
.message('webserver01.example.local dead')
.log('/tmp/deadman-alerts.log')
That works fine. But the drawback is, I have to create a rule for each host. But even if the rules can be created by a script, it doesn’t seem to be a smart solution.
Is there a better way to get notified which host is not sending data?
I have 500-700 hosts per influxdb. If I create a deadman-tick-script for each of them, does it impact the performance of capacitor?
Thanks for any hints.