Kapacitor - One rule to alert for multiple hosts CPU usage?

crow_spotx · November 7, 2017, 10:59pm

I’m trying to figure out Kapacitor alerts so I’ve setup TICK stack in my lab. I’ve deployed Telegraf 1.4, Kapacitor 1.3, Chronograf 1.3, and InfluxDB 1.2.2. The older version on Influx is because it’s what I’m working with in production at the moment.

Everything appears to be communicating, the problem I have is I’m trying to create alert rules that will act on all hosts but I’m seeing an odd behaviour with the following configuration.

From a database that contains telegraf data I create a rule that watches:
cpu > cpu-total > usage_user

The Alert GUI reads: Send Alert where usage_user is equal to or greater 90

I then begin stress -c 2 on server named jump.crow.lab to spike CPU

If, I do not select a host under the cpu > host tags, jump.crow.lab will alert at 99% then the next server that gets checked will recover that alert even if it’s not jump.crow.lab then jump.crow.lab will critical again right after.

If I select the single host, jump.crow.lab, I get the Alerts for Critical and OK as the CPU on that single host is tested.

If I select multiple hosts a problem similar to no hosts selected but only the hosts that are selected cause the errant ‘OK’ recovery.

For example, the following query shows output from the chronograf.alerts measurement. It begins with the jump sever CRITICAL alerts, then I stopped stress and it recovered. This was with jump.crow.lab selected in the alert. I then went back and selected jump and ipa and saved the rule. You’ll see that jump alerted, then ipa cleared with an OK, then that repeated.

Is there a way to have a CPU alert for all hosts based on a single rule?

SELECT * FROM “alerts”
name: alerts
time alertID alertName cpu duration host level message triggerType value

1510093620000000000 Host User CPU Usage:nil Host User CPU Usage cpu-total 0 jump.crow.lab CRITICAL threshold 99.70000000000255
1510093770000000000 Host User CPU Usage:nil Host User CPU Usage cpu-total 150000000000 jump.crow.lab OK threshold 0.2002002001978718
1510094500000000000 Host User CPU Usage:nil Host User CPU Usage cpu-total 0 jump.crow.lab CRITICAL threshold 99.50000000116106
1510094500000000000 Host User CPU Usage:nil Host User CPU Usage cpu-total 0 ipa.crow.lab OK threshold 0.7000000000016371
1510094510000000000 Host User CPU Usage:nil Host User CPU Usage cpu-total 0 jump.crow.lab CRITICAL threshold 99.69969969979302
1510094520000000000 Host User CPU Usage:nil Host User CPU Usage cpu-total 10000000000 ipa.crow.lab OK threshold 0.7028112449796263
1510094520000000000 Host User CPU Usage:nil Host User CPU Usage cpu-total 0 jump.crow.lab CRITICAL threshold 99.699999999998

crow_spotx · November 9, 2017, 6:28pm

I found the Group By button on the tags in the Chronograf WebUI, I think that gets me what I need when I group by host. I just missed it in the UI when trying to follow along with the tutorial.

-Chris

Topic		Replies	Views
Chronograf - query over multiple hosts/cpu usage over time alert Kapacitor kapacitor , chronograf	2	2593	June 21, 2017
State not changing to OK for all hosts Kapacitor	2	698	May 11, 2018
Kapacitor did not sent alert even after thresholds crossed	6	815	May 22, 2019
Help wanted with tickscript Kapacitor	1	719	May 30, 2017
Need Help on Creating Alerts Kapacitor kapacitor , chronograf , grafana	9	1110	December 18, 2019

Kapacitor - One rule to alert for multiple hosts CPU usage?

Related topics