Hello,
I have an environment with pool of servers being monitored for various metrics. I have wrote a kapacitor script using deadman to check whether the telegraf is sending metrics to the Influx. I use a custom script in kapacitor deadman to check whether “telegraf is running/not” in the hosts using SNMP.
When I disable the metrics being sent from the servers, Deadman works like a charm. When I stop kapacitor, Deadman sends weird alerts like “0/5m task Metrics_Deadman_Res_metrics is Down”. It would really appreciable to have the hostname added. At times, I receive the same when the services and metrics are up too.
Can someone help on this
var data = stream
|from()
.measurement(measurement)
// .where(where_filter)
.groupBy(*)
// Handle issue #2 where data stop arriving all together
data
|deadman(0.0,5m)
.id(’ {{ index .Tags “host” }}’)
.message(’(5m) {{ .ID }} / Telegraf in {{ index .Tags “host” }} with task {{ .TaskName }} is {{ if eq .Level “OK” }} UP {{ else }} DOWN {{ end }}: {{ index .Fields “emitted” }}/5m’)
.details(deadman_details)
// .stateChangesOnly()
.log(deadman_log_file)
.exec(custom_script)
.email()
Is the host information missing from all message or just some of the message?
If its just some, then its possible you are running into the nil group alert. Since the deadman’s job is to trigger an alert if no data has arrived it will create a nil alert if no data for any host has arrived. As a result the alert has no information about which host is down since it hasn’t received data from any host. A nil alert means that all hosts are down. You could update the logic in your message to check if host is empty and change the message to say something about all hosts being down.
Hi, i have exactly the same issue and it looks like kapacitor bug.
Here’s example message from kapacitor: CRITICAL - Node has not responded in 120s!
And here’s tick script
var period = 120s
var every = 60s
var sys_data = stream
|from()
.database('telegraf')
.measurement('system')
.groupBy('host')
|window()
.period(period)
.every(every)
sys_data|deadman(1.0, period)
.message('{{ .Level }} - Node {{ index .Tags "host" }} {{ if ne .Level "OK" }}has not responded in 120s!{{ else }}is back online.{{ end }}')
.stateChangesOnly()
.slack()
That message was produced when i tested with stopping telegraf on one server. Can you please explain how to get host field in such case?
Please include example of correct deadman check into documentation.
Hello,
Thanks for the revert. I have tried it in a bunch of hosts running and made telegraf down on a host. I have got an alert stating 5m Deadman_alert/Telegraf in with task is DOWN : 0/5m.
I tried setting telegraf stop on one other node. Same alert I have received.
Both of your examples look correct. It is possible to get a single message that does not have the host tag if the task starts and receives no data from any hosts. After data is received for at least one host all subsequent messages will have host tag information.
Try this test:
Start the hosts telegraf writing to Kapcitor
Start the task, wait to make sure the alert is green and not triggering.
Stop the telegraf on the first host.
At this point you should get a message including the host tag.
If that is not the behavior you are seeing can you provide the exact steps you used to produce the error.
I have this same issue. I have many servers sending CollectD information to InfluxDB. I have all of the servers grouped in a DEADMAN script and first I am not receiving any alerts for servers that do not have any data at all because they quit running or collecting some time ago. And I get nil servers on alerts.
Kapacitor / Chronograf must know the server host name because it is alerting, and it is specified in the script. Couldn’t the tag be retrieved from the Script that is generating a nil condition vs having to get the tag from the influx data?