we use templates to generate our alerts. We did have hundreds of alerts before but its unmanagble. Templating Unforuntately we’ve put a lot of development into how we use them so I can’t give much more than the documentation.
flapping works, there is stateChangesOnly, which will only generate an alert when the state changes whcih works well unless a system is up and down like a yoyo.
Push notifications, We have been using OpsGenie to manage our alerts however we’re moving to a bespoke Microsoft Flow and Powerapp to handle handle all our alerts due to limitations in the OG integration. That being said, If you don’t have a lot of people receiving alerts i think there is a free tier of OpsGenie that supports up to 5 people, that will help to dedplucate niosy alerts and route them to where you need them (be careful though, some of the options only appear in the paid tiers).
- Most of my templates have been converted for our use case, but regarding CPU and memory I’d reecommend using the stateDuration node - Instead of triggering as soon as a threshold is crossed, that node will count how long it is in that given state and then alert after X minutes. Triggering immediately could result in a lot of warning > ok > warning > ok > warning.
You could assign warn/crit thresholds as the minutes to wait and set the usage threshold at 99% for example, or you could set warn/crit as the % thresholds and set the state duration to alert when either condition is true for the defined time. This snippet will alert when the threshold is exceeded for either the warn or crit time.
|eval(lambda: "cpu_usage" / 10000)
|stateDuration(lambda: "cpu_used" > threshold)
Threshold being 99%
Then the alert node
.crit(lambda: "state_duration" > crit)
.warn(lambda: "state_duration" > warn)
.log('/tmp/' + name + '.log')
Where warn/crit would be something like 10 minutes and 15 minutes
I’ll see if i can find a couple of normal TICK scripts on my system to post.
no, we don’t use UDF’s we just send data from Telegraf > Influx. Kapacitor sits in the middle and processes the data as it arrives.
We decided against Grafana as we were having trouble passing template variables through in the alert messages (things like host name, data centre, cluster and so on). Kapacitor + TICK scripts are enough for alerting IMO but we do use Grafana for our dashboards.
no other solution from me, Kapacitor and TICK scripts do most of what we need.
Also, a good starting point for at least generating TICK scripts is Chronograf. It only gives you basic alerts but if you build a rule in there and then go to edit it you can copy the script out and work in an IDE. As i say, its basic though so things like stateDuration aren’t included. You need to add them to the script yourself.