TICK Script for alert Host down

gon · May 18, 2017, 1:24pm

Hi,

I’m noob using Kapacitor and influxdb, but, I created 3 alerts (cpu,mem and disk) for differents hosts and they are working fine.
Now, I want to create and alert that notify me if a host/server is down or offline.
I read about deadman and created and alert, but, I think it doesn’t work:

|from()
 .database('telegraf')
 .retentionPolicy('autogen')
 .measurement('system')
|deadman(1.0, 10s)
 .message('Server {{ index .Tags "node" }} DOWN!')
 .hipChat()
 .stateChangesOnly()````

I think that something is wrong, but I can't find ane example for my case.

Any ideas?

michael · May 18, 2017, 5:45pm

Hey @gon can you include a bit more information. In particular the output of

kapacitor show <task>

gon · May 19, 2017, 7:32am

Hi @michael,

thanks for answer me.
Yes, This is the output of kapacitor show deadman_alert_stream

root@52d4eb22efe9:~# kapacitor show deadman_alert_stream
ID: deadman_alert_stream
Error:
Template:
Type: stream
Status: enabled
Executing: true
Created: 18 May 17 11:56 UTC
Modified: 18 May 17 15:08 UTC
LastEnabled: 18 May 17 15:08 UTC
Databases Retention Policies: ["telegraf"."autogen"]
TICKscript:
var data = stream
    |from()
        .database('telegraf')
        .retentionPolicy('autogen')
        .measurement('system')
    |deadman(0.0, 10s)
        .id('{{ index .Tags "node" }}')
        .message('Server {{ .ID }} is OFFLINE')
        .hipChat()
        .stateChangesOnly()

DOT:
digraph deadman_alert_stream {
graph [throughput="0.00 points/s"];

stream0 [avg_exec_time_ns="0s" ];
stream0 -> from1 [processed="138890"];

from1 [avg_exec_time_ns="9.168µs" ];
from1 -> noop3 [processed="138890"];

noop3 [avg_exec_time_ns="0s" ];

stats2 [avg_exec_time_ns="26.445µs" ];
stats2 -> derivative4 [processed="5787"];

derivative4 [avg_exec_time_ns="6.432µs" ];
derivative4 -> alert5 [processed="5786"];

alert5 [alerts_triggered="0" avg_exec_time_ns="77.149µs" crits_triggered="0" infos_triggered="0" oks_triggered="0" warns_triggered="0" ];
}

Do you need more info?

michael · May 19, 2017, 2:17pm

Try changing the deadman to the following

...
  |deadman(1.0, 10s)
        .id('{{ index .Tags "node" }}')
        .message('Server {{ .ID }} is OFFLINE')
        .hipChat()
        .stateChangesOnly()

gon · May 22, 2017, 9:39am

Hi again @michael

I modified deadman tick script as you show me. Then, I stopped a machine that sends data to influxdb by telegraf, but the deadman alert does trigger.

root@52d4eb22efe9:~# kapacitor show deadman_alert_stream
ID: deadman_alert_stream
Error:
Template:
Type: stream
Status: enabled
Executing: true
Created: 18 May 17 11:56 UTC
Modified: 22 May 17 08:50 UTC
LastEnabled: 22 May 17 08:50 UTC
Databases Retention Policies: ["telegraf"."autogen"]
TICKscript:
var data = stream
    |from()
        .database('telegraf')
        .retentionPolicy('autogen')
        .measurement('system')
    |deadman(1.0, 10s)
        .id('{{ index .Tags "node" }}')
        .message('Server {{ .ID }} is OFFLINE')
        .hipChat()
        .stateChangesOnly()

DOT:
digraph deadman_alert_stream {
graph [throughput="0.00 points/s"];

stream0 [avg_exec_time_ns="0s" ];
stream0 -> from1 [processed="6764"];

from1 [avg_exec_time_ns="1.213µs" ];
from1 -> noop3 [processed="6764"];

noop3 [avg_exec_time_ns="0s" ];

stats2 [avg_exec_time_ns="24.136µs" ];
stats2 -> derivative4 [processed="279"];

derivative4 [avg_exec_time_ns="6.478µs" ];
derivative4 -> alert5 [processed="278"];

alert5 [alerts_triggered="0" avg_exec_time_ns="75.662µs" crits_triggered="0" infos_triggered="0" oks_triggered="0" warns_triggered="0" ];
}

Any idea?
Thanks for your help

michael · May 22, 2017, 2:26pm

Ah, You don’t have a .groupBy(*) so it still thinks its receiving data. Try the following

var data = stream
    |from()
        .database('telegraf')
        .retentionPolicy('autogen')
        .measurement('system')
        .groupBy(*)
    |deadman(1.0, 10s)
        .id('{{ index .Tags "node" }}')
        .message('Server {{ .ID }} is OFFLINE')
        .hipChat()
        .stateChangesOnly()

gon · May 22, 2017, 2:49pm

It’s works! Thanks a lot!

fchiorascu · January 6, 2018, 5:31pm

Hi,

I can have something similar for Host - UP/ DOWN using telegraf + InlfuxDB + Grafana an not having Kapacitor?
I could do something with Grafana “Alert”?

Kind Regards,

veymarjoe · August 23, 2018, 2:24pm

Hello I have the same problem, and I add the groupBy(*) sentence, but nothing seems to happen. this is my task:

ID: webserver_health
Error:
Template:
Type: stream
Status: enabled
Executing: true
Created: 16 Aug 18 12:25 UTC
Modified: 23 Aug 18 13:46 UTC
LastEnabled: 23 Aug 18 13:46 UTC
Databases Retention Policies: [“telegraf.dev”.“dev”]
TICKscript:
dbrp “telegraf.dev”.“dev”

stream
|from()
.measurement(‘httpjson_webserver_stats’)
.groupBy(*)
|deadman(1.0, 10s)
.message(‘Be-services is offline’)
.stateChangesOnly()
.slack()
.channel(’#kapacitortests’)

DOT:
digraph webserver_health {
graph [throughput=“0.00 points/s”];

stream0 [avg_exec_time_ns=“0s” errors=“0” working_cardinality=“0” ];
stream0 -> from1 [processed=“166”];

from1 [avg_exec_time_ns=“11.396µs” errors=“0” working_cardinality=“0” ];
from1 -> noop3 [processed=“166”];

noop3 [avg_exec_time_ns=“0s” errors=“0” working_cardinality=“0” ];

stats2 [avg_exec_time_ns=“37.684µs” errors=“0” working_cardinality=“0” ];
stats2 -> derivative4 [processed=“190”];

derivative4 [avg_exec_time_ns=“6.752µs” errors=“0” working_cardinality=“2” ];
derivative4 -> alert5 [processed=“188”];

alert5 [alerts_inhibited=“0” alerts_triggered=“0” avg_exec_time_ns=“39.059µs” crits_triggered=“0” errors=“0” infos_triggered=“0” oks_triggered=“0” warns_triggered=“0” working_cardinality=“1” ];
}

Please you have any Idea. thanks

virajrathod · April 16, 2019, 12:45pm

Hey @michael
I am using kapacitor v1.5 and I want alerting to be set on a metric whenever the “Active” changes.

This is the code I am using.
var info = 1
var warn = 2
var crit = 3
var period = 15s
var every = 15s

// Dataframe
var data = batch
|query('SELECT ActiveBaselineNodes AS Active FROM “telegraf_ignite_sit”.“autogen”.“kernel_cluster_metrics” ')
.groupBy(‘host’)
.period(period)
.every(every)

// Thresholds
var alert = data
|alert()
.id(’{{ index .Tags “host”}}/baselinenodes’)
.message(’{{ .ID }}:{{ index .Fields “Active” }}’)
// .info(lambda: “Active” > info)
// .warn(lambda: “Active” > warn)
// .crit(lambda: “Active” > crit)
.stateChangesOnly()
.slack()
.log(’/data/kapacitor_alerts/ignite/activebaseline.txt’)

How should I use stateChangesOnly()? I don’t want to statically define the comparing parameters.

MarcV · April 16, 2019, 12:52pm

HI @virajrathod welcome ,

I think you can use the ChangeDetectNode …

followed by the alertNode …

virajrathod · April 16, 2019, 1:22pm

Hi @MarcV,
I need to put up the alert only when the number of nodes go down and not on state change.

MarcV · April 16, 2019, 1:29pm

Hi ,

You want an alert when the ActiveBaselineNodes is lower than the previous value ?

virajrathod · April 16, 2019, 1:32pm

Yes
Correct.

I want the Active to give alerts when the ActiveBaselineNodes value is lower than previously check value.

MarcV · April 16, 2019, 1:34pm

Okay ,

then maybe this one is for you : difference
if the difference is negative , the ActiveBaselineNodes is lower than the previous …

ram0973 · July 23, 2019, 1:49pm

Good day, can you explain, please, how to do alert like nagios:
Check host every 10 second, if host is down, send message: “Host is down”,
continue check host every 10 second, and send repeated messages “Host is down” ONLY ONCE in 1-2 hours, or even just ONCE. When host will up, send message “Host is up”. Thanks in advance.
When I tried to do this, messages come every 10 seconds, even if .stateChangesOnly() is stated.

Topic		Replies	Views
DeadMan Alert Setup Kapacitor kapacitor	0	535	July 29, 2019
Deadman alerts including the hostname of the dead host Kapacitor	4	5818	November 5, 2018
Tickscript to Alert Service Down Kapacitor	0	349	January 19, 2023
Deadman alert not showing hostname details Kapacitor kapacitor	0	588	December 12, 2018
Deadman giving false alerts	3	440	April 21, 2021

TICK Script for alert Host down

Related topics