I’m trying to figure out Kapacitor alerts so I’ve setup TICK stack in my lab. I’ve deployed Telegraf 1.4, Kapacitor 1.3, Chronograf 1.3, and InfluxDB 1.2.2. The older version on Influx is because it’s what I’m working with in production at the moment.
Everything appears to be communicating, the problem I have is I’m trying to create alert rules that will act on all hosts but I’m seeing an odd behaviour with the following configuration.
From a database that contains telegraf data I create a rule that watches:
cpu > cpu-total > usage_user
The Alert GUI reads: Send Alert where usage_user is equal to or greater 90
I then begin
stress -c 2 on server named jump.crow.lab to spike CPU
If, I do not select a host under the cpu > host tags, jump.crow.lab will alert at 99% then the next server that gets checked will recover that alert even if it’s not jump.crow.lab then jump.crow.lab will critical again right after.
If I select the single host, jump.crow.lab, I get the Alerts for Critical and OK as the CPU on that single host is tested.
If I select multiple hosts a problem similar to no hosts selected but only the hosts that are selected cause the errant ‘OK’ recovery.
For example, the following query shows output from the chronograf.alerts measurement. It begins with the jump sever CRITICAL alerts, then I stopped stress and it recovered. This was with jump.crow.lab selected in the alert. I then went back and selected jump and ipa and saved the rule. You’ll see that jump alerted, then ipa cleared with an OK, then that repeated.
Is there a way to have a CPU alert for all hosts based on a single rule?
SELECT * FROM "alerts"
time alertID alertName cpu duration host level message triggerType value
1510093620000000000 Host User CPU Usage:nil Host User CPU Usage cpu-total 0 jump.crow.lab CRITICAL threshold 99.70000000000255
1510093770000000000 Host User CPU Usage:nil Host User CPU Usage cpu-total 150000000000 jump.crow.lab OK threshold 0.2002002001978718
1510094500000000000 Host User CPU Usage:nil Host User CPU Usage cpu-total 0 jump.crow.lab CRITICAL threshold 99.50000000116106
1510094500000000000 Host User CPU Usage:nil Host User CPU Usage cpu-total 0 ipa.crow.lab OK threshold 0.7000000000016371
1510094510000000000 Host User CPU Usage:nil Host User CPU Usage cpu-total 0 jump.crow.lab CRITICAL threshold 99.69969969979302
1510094520000000000 Host User CPU Usage:nil Host User CPU Usage cpu-total 10000000000 ipa.crow.lab OK threshold 0.7028112449796263
1510094520000000000 Host User CPU Usage:nil Host User CPU Usage cpu-total 0 jump.crow.lab CRITICAL threshold 99.699999999998