Node down alert with tickstack

Hi,

I am new to tickstack we have deployed tickstack in our environment currently we are monitoring cpu and memory usage.I need help getting to monitor node down alerts.Can someone guide through script to monitor node down/up alerts?

Regards
Kumaresan

are you using Kapacitor ?

Kapacitor script can raise alerting in case no data coming from node.

Hi Pawan,

Thanks for addressing. I am new to tickstk currently we are monitoring cpu and memory.

We have kcapcitor but not sure how to implement node down alerting. Could you help me on it? Is there any script or json for monitoring node down?

Regards
Kumaresan

Does this help?

Kapacitor Deadman Node

It processes data points in your measurement. Once throughput drops below your define threshold it will trigger an alert. When a data point is received it will recover.

Example: This will process data on the “mem” measurement, when data points drop below 100.0 in 1 minute an alert is generated. If an alert is generated its written to a log file. Take in consideration, this is from a template i use so some values are coded into settings files so some values will need updating.

the log file is created using the name variable, so the file will be whatever you name your alert

var data = stream
  |from()
        .measurement('mem')
        .groupBy(*)
        .where(whereFilter)
    
   

    

var trigger = data
        |deadman(100.0, 1m)
        .message(message)
        .id(idVar)
        .idTag(idTag)
        .levelTag(levelTag)
        .messageField(messageField)
        .durationField(durationField)
        .log('/tmp/'+ name + '.log')
        

trigger
    |influxDBOut()
        .create()
        .database(outputDB)
        .retentionPolicy(outputRP)
        .measurement(outputMeasurement)
        .tag('alertName', name)
        .tag('triggerType', triggerType)

trigger
    |httpOut('output')

I would personally suggest dropping Influx Data’s “Chronograf” package, you can define alerts from there. They are very basic but give you a good starting point/template for your general alerts. Unfortutnately it doesn’t use all the possible Kapacitor nodes.

Once you’ve got the hang of that, then you can start editing the scripts in an IDE and defining them in Kapacitor yourself.

Hi Phlip,

Thanks for the update.Is there way to get server down alert monitored through tickstack?

Could you provide some example scripts that will help me?

Regards
Kumare

Hey @Kumaresan78, sorry i was out of the office for the last week.

The above script is a TICK script so would run on the TICK stack. Chronograf (the C in TICK) will visualise and graph your data. But, if you have it installed you can go to

“tasks” > “manage tasks” > “Build new rule”

From there, you have a few options for the threshold type. If you select “deadman” you can pick the measurement you want to monitor for through put. Saving the script will let you go back to the manage tasks section, from there you can view the tick script itself.

The deadman node will detect through put for a specific measurement. So if you were monitoring linux machines you could target the mem measurement and alert when there is no data being inserted to that measurement.

I’ve generated a basic deadman alert in Chronograf and attached the code. I’ve removed my database references.

This version will check the mem measurement for zero through put.

var db = ‘your_data_source’

var rp = ‘your_RP’

var measurement = ‘mem’

var groupBy = [‘host’]

var whereFilter = lambda: TRUE

var period = 10m

var name = ‘DeadManMem’

var idVar = name + ‘-{{.Group}}’

var message = ‘’

var idTag = ‘alertID’

var levelTag = ‘level’

var messageField = ‘message’

var durationField = ‘duration’

var outputDB = ‘chronograf’

var outputRP = ‘autogen’

var outputMeasurement = ‘alerts’

var triggerType = ‘deadman’

var threshold = 0.0

var data = stream
|from()
.database(db)
.retentionPolicy(rp)
.measurement(measurement)
.groupBy(groupBy)
.where(whereFilter)

var trigger = data
|deadman(threshold, period)
.stateChangesOnly()
.message(message)
.id(idVar)
.idTag(idTag)
.levelTag(levelTag)
.messageField(messageField)
.durationField(durationField)
.log(‘/tmp/deadmanlogs.log’)

trigger
|eval(lambda: “emitted”)
.as(‘value’)
.keep(‘value’, messageField, durationField)
|eval(lambda: float(“value”))
.as(‘value’)
.keep()
|influxDBOut()
.create()
.database(outputDB)
.retentionPolicy(outputRP)
.measurement(outputMeasurement)
.tag(‘alertName’, name)
.tag(‘triggerType’, triggerType)

trigger
|httpOut(‘output’)

As mentioned, Chronograf is great for building the basics of a script or some templates but lacks the functionality of the some of the Kapacitor nodes. Once you get the hang of the TICK scripts generated by Chronograf you can start to modify the code to use different Kapacitor nodes such as |stateDuration, |stateCount.

Hope that helps a little.

Thanks Philb.

We are not using chronograf rather we use grafana.We have telegraf,kapapcitor,influxdb installed on all servers from grafana we add source and import via json then create dashboard.The above tick script build from chronograf is there way to achieve similar with grafana?

Regards
Kumaresan

hi @Kumaresan78

I’m afraid i don’t know how to configure alerts from within Grafana, we were unable to include some of the template tags from Grafana in our alerts so we only use Grafana purely for our dashboards.

Our alerts are handled by Kapacitor, the tick script itself is generated by Chronograf but the code would be the same whether generated from Chronograf or written inside an IDE.

A quick question, are you wanting to alert when a node with the TICK stack goes offline? if so, then the following won’t work as Kapacitor won’t be running if the node is offline. If thats the case, then you might need to ask on the Grafana forums about setting an alert up for this. Maybe based on throughput, I’m unsure.

Otherwise, if you montor multiple nodes with per your TICK servers (as in 3 or 4 nodes reporting into one influx server) and want to alert on one of those nodes going offline, the following should work.

Chronograf is just an easy way to get to grips with TICK script in general, however if you don’t want to install Chronograf in your stack then I’d recommend having a look through some of the sample TICK scripts

From there, you can create a base template to work from.

Regarding the deadman script above, if you update the DB and RP variable to use your database and retention policy and upload this script to one of your servers you can enable it using the Kapacitor CLI.

Upload the script, you can leave the script wherever you choose. I move them into /etc/kapacitor/tick but running it from the home directory for now would suffice.

Once the scripts on the server, you can use the following to enable the script (assuming you are using linux)

sudo kapacitor define deadman_alert -type stream -tick /path/to/script -dbrp database.retention_policy

then sudo kapacitor enable deadman_alert

When a node goes offline, the alert should trigger. But, as i say
 If you want to monitor the TICK server itself then Kapacitor won’t be able to help you, unless you monitor your tick nodes with separate instance dedicated to “watching the watcher”.

We use an heartbeat API through OpsGenie to alert us if our TICK server is offline, that might be something worth looking into.

I think though maybe you are wanting to alert when your TICK node is offline. In this instance, you might need to ask on the Grafana forums about configuring this.

Let me know if theres anything else.

Hi Philb,

Thanks for the update. Is there way we can implement ping requests if the any of node not responding to ping it would trigger an alert?

Regards
Kumar

Hi @Kumaresan78

What I would do is create a TICK instance independent from the nodes you want to monitor. From there, using telegraf on this new instance you can configure ping sensors.

Ping Plugin

Then, using that instance you could create a TICK script which would basically alert when packet loss was over a set threshold, say 75%. If the alert triggers it’s an indicator that the node is in trouble, although it isn’t a perfect solution. IF you set it to 100% then you can be fairly certain the server is offline, even if it is still switched on.

We use ping sensors (among other alerts) to monitor ESX hosts.

This script should help, You’ll need to add your DB and RP

var db = ''

var rp = ''

var measurement = 'ping'

var groupBy = ['host', 'name', 'url']

var whereFilter = lambda: TRUE

var period = 60s

var every = 60s

var name = 'packetLoss'

var idVar = name + ':{{.Group}}'

var message = ' {{.Level}} Over 50% Packet Loss {{ index .Tags "name" }} '

var idTag = 'alertID'

var levelTag = 'level'

var messageField = 'message'

var durationField = 'duration'

var outputDB = 'chronograf'

var outputRP = 'autogen'

var outputMeasurement = 'alerts'

var triggerType = 'threshold'

var data = stream
    |from()
        .database(db)
        .retentionPolicy(rp)
        .measurement(measurement)
        .groupBy(groupBy)
        .where(whereFilter)
    |window()
        .period(period)
        .every(every)
        .align()
    |mean('percent_packet_loss')
        .as('percent_packet_loss')
   
var trigger = data
    |alert()
        .crit(lambda: "percent_packet_loss" > 50)
        .stateChangesOnly()
        .message(message)
        .id(idVar)
        .idTag(idTag)
        .levelTag(levelTag)
        .messageField(messageField)
        .durationField(durationField)
        .log('/tmp/packetloss.log')

trigger
    |eval(lambda: 'percent_packet_loss')
        .as('percent_packet_loss')
        .keep()
    |influxDBOut()
        .create()
        .database(outputDB)
        .retentionPolicy(outputRP)
        .measurement(outputMeasurement)
        .tag('alertName', name)
        .tag('triggerType', triggerType)

trigger
    |httpOut('output')

So, in short.

  1. install a new instance of InfluxDB, Kapacitor and Telegraf on a fresh server. If you’re only collecting pings it shouldn’t need too much resource.
  2. configure the ping plugin on this new server with telegraf, filling in the config as shown in the plugin example (above link)
  3. confirm you are receiving Ping results from your nodes.
  4. add that script through Kapacitor - ftp to the server, then define and enable it with the commands mentioned in my previous post

Change one the URL’s or IP’s in the config to one you know will not resolve, after about 60 seconds you should have an alert.

You could also use an additional script to run the deadman node to montior for throughput.

You could also set up a dashboard in Grafana with repeating panels based on the ping results, showing packet loss for the last X amount of time.

Hi Philb,

Thanks for the update.I achieved node down alerts with grafana.

In Short I used telegraf ping plugin where list of nodes are specified from grafana used below queries basically it will look for the values of packet loss>80 and result.code>0 if both condition matches node down alert triggers.

Query:

SELECT mean(“result_code”) AS “result” FROM “ping” WHERE (“url” = ‘node1’) AND timeFilter GROUP BY time(__interval) fill(null)

SELECT “percent_packet_loss” FROM “ping” WHERE (“url” = ‘node1’) AND $timeFilter

Regards
Kumar

1 Like