Deadman alerts including the hostname of the dead host


#1

Hi all,

I created the following tick script to get notified whenever a windows host is not sending any data. It’s working fine.

dbrp "telegraf"."autogen"

var data = stream
    |from()
        .database('telegraf')
        .retentionPolicy('autogen')
        .measurement('win_cpu')
data
    |deadman(0.0, 2m)
        .stateChangesOnly()
        .log('/tmp/deadman-alerts.log')

I’m facing two problems and any advice would help.

1. Is there a more generic measurement I can use for the “heartbeat”? The one above only works for windows hosts, because it depends on “win_cpu”. Can telegraf send a generic ping or heartbeat, which can be used on any operating system?

2. How can I get the hostname of the dead host?
The example above will notify that some host is not sending data. But I don’t know which it is. That’s reasonable because capacitor does not know which hosts are connected and which of them must be treated as “active”.

My idea was to create a rule for each host I’m expecting data from.

dbrp "telegraf"."autogen"

var data = stream
    |from()
        .database('telegraf')
        .retentionPolicy('autogen')
        .measurement('win_cpu')
        .where(lambda: "host" == 'webserver01.example.local')

data
    |deadman(0.0, 2m)
        .stateChangesOnly()
        .message('webserver01.example.local dead')
        .log('/tmp/deadman-alerts.log')

That works fine. But the drawback is, I have to create a rule for each host. But even if the rules can be created by a script, it doesn’t seem to be a smart solution.
Is there a better way to get notified which host is not sending data?
I have 500-700 hosts per influxdb. If I create a deadman-tick-script for each of them, does it impact the performance of capacitor?

Thanks for any hints.


Generic Deadman Alerts On Sparse Events
#2

I was also interested in this kind of problem.

1. I didn’t find any existing Telegraf plugin, which would provide heartbeat, and was thinking that I would need to implement one myself and contribute.
Although instead of heartbeat I would like to have an uptime value, which will indicate if the Telegraf was restarted. Ideally it would provide system uptime with the same measurement across OS types.

2. For the deadman on host you would need to group by host tag. Each group will alert independently. Kapacitor will start a new group when data for a new host comes it and will watch for that group being dead.
Note. In addition, I would recommend to alert on a specific host when only that host is dead, and not on some problem with the ingest pipeline. For example if all data stop coming because of the network issue you don’t want to alerts for all hosts. To prevent that you would need ungrouped stream to watched for deadman, but for a shorter duration, then join it with the stream grouped by host.


#3

Thanks Igor for your reply.

The second point sounds logical. Can you provide an example how the TICK script could look like?


#4

Here is an example script from another thread,

var data = stream
    |from()
        .database('telegraf')
        .retentionPolicy('autogen')
        .measurement('system')
        .groupBy(*)
    |deadman(1.0, 10s)
        .id('{{ index .Tags "node" }}')
        .message('Server {{ .ID }} is OFFLINE')
        .slack()
        .stateChangesOnly()

Group multile messages into one
#5

Hi,

If Kapacitor crashes, then a different monitored node crashes, and Kapacitor comes backup in 20 seconds (your deadman period is 10s). Does that mean you miss notifications for that node being down because the stream missed the group?