Kapacitor Deadman not giving the host information on STDIN/Alert

vjyavj · June 29, 2017, 3:02pm

Hello,
I have an environment with pool of servers being monitored for various metrics. I have wrote a kapacitor script using deadman to check whether the telegraf is sending metrics to the Influx. I use a custom script in kapacitor deadman to check whether “telegraf is running/not” in the hosts using SNMP.
When I disable the metrics being sent from the servers, Deadman works like a charm. When I stop kapacitor, Deadman sends weird alerts like “0/5m task Metrics_Deadman_Res_metrics is Down”. It would really appreciable to have the hostname added. At times, I receive the same when the services and metrics are up too.
Can someone help on this

Thanks!!

nathaniel · June 29, 2017, 3:19pm

Can you share your TICKscript?

vjyavj · June 29, 2017, 3:21pm

Here is the script I use

var data = stream
|from()
.measurement(measurement)
// .where(where_filter)
.groupBy(*)

// Handle issue #2 where data stop arriving all together
data
|deadman(0.0,5m)
.id(’ {{ index .Tags “host” }}’)
.message(’(5m) {{ .ID }} / Telegraf in {{ index .Tags “host” }} with task {{ .TaskName }} is {{ if eq .Level “OK” }} UP {{ else }} DOWN {{ end }}: {{ index .Fields “emitted” }}/5m’)
.details(deadman_details)
// .stateChangesOnly()
.log(deadman_log_file)
.exec(custom_script)
.email()

nathaniel · July 10, 2017, 9:53pm

Is the host information missing from all message or just some of the message?

If its just some, then its possible you are running into the nil group alert. Since the deadman’s job is to trigger an alert if no data has arrived it will create a nil alert if no data for any host has arrived. As a result the alert has no information about which host is down since it hasn’t received data from any host. A nil alert means that all hosts are down. You could update the logic in your message to check if host is empty and change the message to say something about all hosts being down.

insider · July 12, 2017, 11:25am

Hi, i have exactly the same issue and it looks like kapacitor bug.
Here’s example message from kapacitor:
CRITICAL - Node has not responded in 120s!

And here’s tick script

var period = 120s
var every = 60s

var sys_data = stream
    |from()
      .database('telegraf')
      .measurement('system')
      .groupBy('host')
    |window()
      .period(period)
      .every(every)

sys_data|deadman(1.0, period)
    .message('{{ .Level }} - Node {{ index .Tags "host" }} {{ if ne .Level "OK" }}has not responded in 120s!{{ else }}is back online.{{ end }}')
    .stateChangesOnly()
    .slack()

That message was produced when i tested with stopping telegraf on one server. Can you please explain how to get host field in such case?

Please include example of correct deadman check into documentation.

vjyavj · July 13, 2017, 12:55pm

Hello,
Thanks for the revert. I have tried it in a bunch of hosts running and made telegraf down on a host. I have got an alert stating 5m Deadman_alert/Telegraf in with task is DOWN : 0/5m.

I tried setting telegraf stop on one other node. Same alert I have received.

Kindly help
Thanks!!

nathaniel · July 19, 2017, 4:35pm

Both of your examples look correct. It is possible to get a single message that does not have the host tag if the task starts and receives no data from any hosts. After data is received for at least one host all subsequent messages will have host tag information.

Try this test:

Start the hosts telegraf writing to Kapcitor
Start the task, wait to make sure the alert is green and not triggering.
Stop the telegraf on the first host.

At this point you should get a message including the host tag.
If that is not the behavior you are seeing can you provide the exact steps you used to produce the error.

vjyavj · September 11, 2017, 3:25pm

Hello Nathenial,

sorry for the delayed revert

I tried the below way as you said on a test machine, where telegraf, kapacitor and influx runs on same host.

Made sure telegraf is up and running
Made the task up
Made telegraf down

the alert I received

subject: (5m) / in with task self_monitoring_metrics_NN is DOWN : 0/10m
Trigger_source":deadman ; “Level”:CRITICAL ; “Time”:2017-09-11 15:15:00 +0000 UTC ; “Trigger_condition”:For test: Service failure ; “Field_name”:pid ; “Custom_tag”:self-monitoring ; “Detail_tag”:_pid ; “Severity”:CRITICAL ; “Msg_id”:005-000-015-247-149-066-000 ; “Assignment_group”:XXXX ; “Host”: ; “Tag_name”: ; " Measurement":stats ; “Tags”:map[] ; “Field”:map[emitted:0] ; “TaskName”:self_monitoring_metrics_NN ; “ID”: ;

exo0436 · June 1, 2018, 2:23pm

I have this same issue. I have many servers sending CollectD information to InfluxDB. I have all of the servers grouped in a DEADMAN script and first I am not receiving any alerts for servers that do not have any data at all because they quit running or collecting some time ago. And I get nil servers on alerts.

Kapacitor / Chronograf must know the server host name because it is alerting, and it is specified in the script. Couldn’t the tag be retrieved from the Script that is generating a nil condition vs having to get the tag from the influx data?

Topic		Replies	Views
Deadman alerts including the hostname of the dead host Kapacitor	4	5823	November 5, 2018
Kapacitor deadman false alert Kapacitor kapacitor	3	730	December 15, 2020
DeadMan Alert Setup Kapacitor kapacitor	0	535	July 29, 2019
TICK Script for alert Host down Kapacitor kapacitor	15	7539	July 23, 2019
Do we have a Deadman if entire stream stopped for all host? Kapacitor kapacitor	5	892	May 4, 2021

Kapacitor Deadman not giving the host information on STDIN/Alert

Related topics