Kapacitor Deadman not giving the host information on STDIN/Alert

kapacitor
telegraf

#1

Hello,
I have an environment with pool of servers being monitored for various metrics. I have wrote a kapacitor script using deadman to check whether the telegraf is sending metrics to the Influx. I use a custom script in kapacitor deadman to check whether “telegraf is running/not” in the hosts using SNMP.
When I disable the metrics being sent from the servers, Deadman works like a charm. When I stop kapacitor, Deadman sends weird alerts like “0/5m task Metrics_Deadman_Res_metrics is Down”. It would really appreciable to have the hostname added. At times, I receive the same when the services and metrics are up too.
Can someone help on this

Thanks!!


#2

Can you share your TICKscript?


#3

Here is the script I use

var data = stream
|from()
.measurement(measurement)
// .where(where_filter)
.groupBy(*)

// Handle issue #2 where data stop arriving all together
data
|deadman(0.0,5m)
.id(’ {{ index .Tags “host” }}’)
.message(’(5m) {{ .ID }} / Telegraf in {{ index .Tags “host” }} with task {{ .TaskName }} is {{ if eq .Level “OK” }} UP {{ else }} DOWN {{ end }}: {{ index .Fields “emitted” }}/5m’)
.details(deadman_details)
// .stateChangesOnly()
.log(deadman_log_file)
.exec(custom_script)
.email()


#4

Is the host information missing from all message or just some of the message?

If its just some, then its possible you are running into the nil group alert. Since the deadman’s job is to trigger an alert if no data has arrived it will create a nil alert if no data for any host has arrived. As a result the alert has no information about which host is down since it hasn’t received data from any host. A nil alert means that all hosts are down. You could update the logic in your message to check if host is empty and change the message to say something about all hosts being down.


#5

Hi, i have exactly the same issue and it looks like kapacitor bug.
Here’s example message from kapacitor:
CRITICAL - Node has not responded in 120s!

And here’s tick script

var period = 120s
var every = 60s

var sys_data = stream
    |from()
      .database('telegraf')
      .measurement('system')
      .groupBy('host')
    |window()
      .period(period)
      .every(every)

sys_data|deadman(1.0, period)
    .message('{{ .Level }} - Node {{ index .Tags "host" }} {{ if ne .Level "OK" }}has not responded in 120s!{{ else }}is back online.{{ end }}')
    .stateChangesOnly()
    .slack()

That message was produced when i tested with stopping telegraf on one server. Can you please explain how to get host field in such case?

Please include example of correct deadman check into documentation.


#6

Hello,
Thanks for the revert. I have tried it in a bunch of hosts running and made telegraf down on a host. I have got an alert stating 5m Deadman_alert/Telegraf in with task is DOWN : 0/5m.

I tried setting telegraf stop on one other node. Same alert I have received.

Kindly help
Thanks!!


#7

Both of your examples look correct. It is possible to get a single message that does not have the host tag if the task starts and receives no data from any hosts. After data is received for at least one host all subsequent messages will have host tag information.

Try this test:

  1. Start the hosts telegraf writing to Kapcitor
  2. Start the task, wait to make sure the alert is green and not triggering.
  3. Stop the telegraf on the first host.

At this point you should get a message including the host tag.
If that is not the behavior you are seeing can you provide the exact steps you used to produce the error.


#8

Hello Nathenial,

sorry for the delayed revert

I tried the below way as you said on a test machine, where telegraf, kapacitor and influx runs on same host.

  1. Made sure telegraf is up and running
  2. Made the task up
  3. Made telegraf down

the alert I received

subject: (5m) / in with task self_monitoring_metrics_NN is DOWN : 0/10m
Trigger_source":deadman ; “Level”:CRITICAL ; “Time”:2017-09-11 15:15:00 +0000 UTC ; “Trigger_condition”:For test: Service failure ; “Field_name”:pid ; “Custom_tag”:self-monitoring ; “Detail_tag”:_pid ; “Severity”:CRITICAL ; “Msg_id”:005-000-015-247-149-066-000 ; “Assignment_group”:XXXX ; “Host”: ; “Tag_name”: ; " Measurement":stats ; “Tags”:map[] ; “Field”:map[emitted:0] ; “TaskName”:self_monitoring_metrics_NN ; “ID”: ;