Hello,
I have a 3 host system containing a management server running the TICK stack and 2 linux nodes reporting to it.
I am just getting started and trying to test on how well this will work for me, and I’m having a bit of trouble. I am targeting the cpu_idle metric, and changing it’s threshold from 10% to 100% manually for testing purposes.
When I set the threshold to kick off at less than 100% CPU idle, this triggers a critical alert in the dashboard and I get an e-mail for all 3 hosts.
When I set it back to less than 10%, I only get an OK alert for 1 host, and the other 2 hosts still report a critical status. This behavior seems to be the same no matter what I do, so I’m looking for help. The status also remains critical for the other 2 hosts.
I have tried a few things, including wiping out my telegraf and chronograf databases, with no joy.
Here’s the information on this particular check:
[root@xss tmp]# kapacitor show cpu_idle
ID: cpu_idle
Error:
Template:
Type: stream
Status: enabled
Executing: true
Created: 08 May 18 15:52 EDT
Modified: 08 May 18 17:06 EDT
LastEnabled: 08 May 18 17:06 EDT
Databases Retention Policies: [“telegraf”.“autogen”]
TICKscript:
var db = ‘telegraf’
var rp = ‘autogen’
var measurement = ‘cpu’
var groupBy = [‘host’]
var whereFilter = lambda: (“cpu” == ‘cpu-total’)
var name = ‘cpu_idle’
var idVar = name
var message = ‘’
var idTag = ‘alertID’
var levelTag = ‘level’
var messageField = ‘message’
var durationField = ‘duration’
var outputDB = ‘chronograf’
var outputRP = ‘autogen’
var outputMeasurement = ‘alerts’
var triggerType = ‘threshold’
var crit = 10
var warn = 20
var info = 30
var data = stream
|from()
.database(db)
.retentionPolicy(rp)
.measurement(measurement)
.groupBy(groupBy)
.where(whereFilter)
|eval(lambda: “usage_idle”)
.as(‘value’)
var trigger = data
|alert()
.crit(lambda: “value” < crit)
// .warn(lambda: “value” < warn)
// .info(lambda: “value” < info)
.message(message)
.id(idVar)
.idTag(idTag)
.levelTag(levelTag)
.messageField(messageField)
.durationField(durationField)
// .stateChangesOnly()
.email()
.to(‘testjddb@XXXXXXXXXXXXXXXX’)
.log(’/tmp/cpu_alert_test.txt’)
trigger
|eval(lambda: float(“value”))
.as(‘value’)
.keep()
|influxDBOut()
.create()
.database(outputDB)
.retentionPolicy(outputRP)
.measurement(outputMeasurement)
.tag(‘alertName’, name)
.tag(‘triggerType’, triggerType)
trigger
|httpOut(‘output’)
DOT:
digraph cpu_idle {
graph [throughput=“127.98 points/s”];
stream0 [avg_exec_time_ns=“0s” errors=“0” working_cardinality=“0” ];
stream0 -> from1 [processed=“15232”];
from1 [avg_exec_time_ns=“3.957µs” errors=“357” working_cardinality=“0” ];
from1 -> eval2 [processed=“357”];
eval2 [avg_exec_time_ns=“6.827µs” errors=“0” working_cardinality=“3” ];
eval2 -> alert3 [processed=“357”];
alert3 [alerts_triggered=“1” avg_exec_time_ns=“21.168µs” crits_triggered=“0” errors=“0” infos_triggered=“0” oks_triggered=“1” warns_triggered=“0” working_cardinality=“3” ];
alert3 -> http_out6 [processed=“1”];
alert3 -> eval4 [processed=“1”];
http_out6 [avg_exec_time_ns=“0s” errors=“0” working_cardinality=“1” ];
eval4 [avg_exec_time_ns=“0s” errors=“0” working_cardinality=“1” ];
eval4 -> influxdb_out5 [processed=“1”];
influxdb_out5 [avg_exec_time_ns=“0s” errors=“0” points_written=“1” working_cardinality=“0” write_errors=“0” ];
}