Deadman alerts including the hostname of the dead host

Tkramm · March 2, 2018, 2:40pm

Hi all,

I created the following tick script to get notified whenever a windows host is not sending any data. It’s working fine.

dbrp "telegraf"."autogen"

var data = stream
    |from()
        .database('telegraf')
        .retentionPolicy('autogen')
        .measurement('win_cpu')
data
    |deadman(0.0, 2m)
        .stateChangesOnly()
        .log('/tmp/deadman-alerts.log')

I’m facing two problems and any advice would help.

1. Is there a more generic measurement I can use for the “heartbeat”? The one above only works for windows hosts, because it depends on “win_cpu”. Can telegraf send a generic ping or heartbeat, which can be used on any operating system?

2. How can I get the hostname of the dead host?
The example above will notify that some host is not sending data. But I don’t know which it is. That’s reasonable because capacitor does not know which hosts are connected and which of them must be treated as “active”.

My idea was to create a rule for each host I’m expecting data from.

dbrp "telegraf"."autogen"

var data = stream
    |from()
        .database('telegraf')
        .retentionPolicy('autogen')
        .measurement('win_cpu')
        .where(lambda: "host" == 'webserver01.example.local')

data
    |deadman(0.0, 2m)
        .stateChangesOnly()
        .message('webserver01.example.local dead')
        .log('/tmp/deadman-alerts.log')

That works fine. But the drawback is, I have to create a rule for each host. But even if the rules can be created by a script, it doesn’t seem to be a smart solution.
Is there a better way to get notified which host is not sending data?
I have 500-700 hosts per influxdb. If I create a deadman-tick-script for each of them, does it impact the performance of capacitor?

Thanks for any hints.

Igor · March 2, 2018, 3:15pm

I was also interested in this kind of problem.

1. I didn’t find any existing Telegraf plugin, which would provide heartbeat, and was thinking that I would need to implement one myself and contribute.
Although instead of heartbeat I would like to have an uptime value, which will indicate if the Telegraf was restarted. Ideally it would provide system uptime with the same measurement across OS types.

2. For the deadman on host you would need to group by host tag. Each group will alert independently. Kapacitor will start a new group when data for a new host comes it and will watch for that group being dead.
Note. In addition, I would recommend to alert on a specific host when only that host is dead, and not on some problem with the ingest pipeline. For example if all data stop coming because of the network issue you don’t want to alerts for all hosts. To prevent that you would need ungrouped stream to watched for deadman, but for a shorter duration, then join it with the stream grouped by host.

Tkramm · March 5, 2018, 9:20am

Thanks Igor for your reply.

The second point sounds logical. Can you provide an example how the TICK script could look like?

sidstuart · March 21, 2018, 4:19am

Here is an example script from another thread,

var data = stream
    |from()
        .database('telegraf')
        .retentionPolicy('autogen')
        .measurement('system')
        .groupBy(*)
    |deadman(1.0, 10s)
        .id('{{ index .Tags "node" }}')
        .message('Server {{ .ID }} is OFFLINE')
        .slack()
        .stateChangesOnly()

budric · November 5, 2018, 9:54pm

Hi,

If Kapacitor crashes, then a different monitored node crashes, and Kapacitor comes backup in 20 seconds (your deadman period is 10s). Does that mean you miss notifications for that node being down because the stream missed the group?

Topic		Replies	Views
Kapacitor Deadman not giving the host information on STDIN/Alert telegraf , kapacitor	8	3059	June 1, 2018
TICK Script for alert Host down Kapacitor kapacitor	15	7538	July 23, 2019
Do we have a Deadman if entire stream stopped for all host? Kapacitor kapacitor	5	892	May 4, 2021
DeadMan Alert Setup Kapacitor kapacitor	0	535	July 29, 2019
Deadman alert not showing hostname details Kapacitor kapacitor	0	588	December 12, 2018

Deadman alerts including the hostname of the dead host

Related topics