State not changing to OK for all hosts

#1

Hello,

I have a 3 host system containing a management server running the TICK stack and 2 linux nodes reporting to it.

I am just getting started and trying to test on how well this will work for me, and I’m having a bit of trouble. I am targeting the cpu_idle metric, and changing it’s threshold from 10% to 100% manually for testing purposes.

When I set the threshold to kick off at less than 100% CPU idle, this triggers a critical alert in the dashboard and I get an e-mail for all 3 hosts.

When I set it back to less than 10%, I only get an OK alert for 1 host, and the other 2 hosts still report a critical status. This behavior seems to be the same no matter what I do, so I’m looking for help. The status also remains critical for the other 2 hosts.

I have tried a few things, including wiping out my telegraf and chronograf databases, with no joy.

Here’s the information on this particular check:

[root@xss tmp]# kapacitor show cpu_idle
ID: cpu_idle
Error:
Template:
Type: stream
Status: enabled
Executing: true
Created: 08 May 18 15:52 EDT
Modified: 08 May 18 17:06 EDT
LastEnabled: 08 May 18 17:06 EDT
Databases Retention Policies: [“telegraf”.“autogen”]
TICKscript:
var db = ‘telegraf’

var rp = ‘autogen’

var measurement = ‘cpu’

var groupBy = [‘host’]

var whereFilter = lambda: (“cpu” == ‘cpu-total’)

var name = ‘cpu_idle’

var idVar = name

var message = ‘’

var idTag = ‘alertID’

var levelTag = ‘level’

var messageField = ‘message’

var durationField = ‘duration’

var outputDB = ‘chronograf’

var outputRP = ‘autogen’

var outputMeasurement = ‘alerts’

var triggerType = ‘threshold’

var crit = 10

var warn = 20

var info = 30

var data = stream
|from()
.database(db)
.retentionPolicy(rp)
.measurement(measurement)
.groupBy(groupBy)
.where(whereFilter)
|eval(lambda: “usage_idle”)
.as(‘value’)

var trigger = data
|alert()
.crit(lambda: “value” < crit)
// .warn(lambda: “value” < warn)
// .info(lambda: “value” < info)
.message(message)
.id(idVar)
.idTag(idTag)
.levelTag(levelTag)
.messageField(messageField)
.durationField(durationField)
// .stateChangesOnly()
.email()
.to(‘testjddb@XXXXXXXXXXXXXXXX’)
.log(’/tmp/cpu_alert_test.txt’)

trigger
|eval(lambda: float(“value”))
.as(‘value’)
.keep()
|influxDBOut()
.create()
.database(outputDB)
.retentionPolicy(outputRP)
.measurement(outputMeasurement)
.tag(‘alertName’, name)
.tag(‘triggerType’, triggerType)

trigger
|httpOut(‘output’)

DOT:
digraph cpu_idle {
graph [throughput=“127.98 points/s”];

stream0 [avg_exec_time_ns=“0s” errors=“0” working_cardinality=“0” ];
stream0 -> from1 [processed=“15232”];

from1 [avg_exec_time_ns=“3.957µs” errors=“357” working_cardinality=“0” ];
from1 -> eval2 [processed=“357”];

eval2 [avg_exec_time_ns=“6.827µs” errors=“0” working_cardinality=“3” ];
eval2 -> alert3 [processed=“357”];

alert3 [alerts_triggered=“1” avg_exec_time_ns=“21.168µs” crits_triggered=“0” errors=“0” infos_triggered=“0” oks_triggered=“1” warns_triggered=“0” working_cardinality=“3” ];
alert3 -> http_out6 [processed=“1”];
alert3 -> eval4 [processed=“1”];

http_out6 [avg_exec_time_ns=“0s” errors=“0” working_cardinality=“1” ];

eval4 [avg_exec_time_ns=“0s” errors=“0” working_cardinality=“1” ];
eval4 -> influxdb_out5 [processed=“1”];

influxdb_out5 [avg_exec_time_ns=“0s” errors=“0” points_written=“1” working_cardinality=“0” write_errors=“0” ];
}

#2

I re-installed the solution on brand new hosts with brand new configuration, and I’m getting the same behavior. I don’t see any similar bugs, so I must be doing something wrong, any ideas?

#3

OK. I figured this out. I am tagging the alert ID with the host, and that seems to fix it.

For anyone else who encounters a similar problem and googles, it’ll be here.

var idVar = name + ‘{{ index .Tags “host”}}’