Need help with TICKScripts and Alerting

Hi, I need help on alerts with kapacitor in a devops environment.

I have looked into internet, this community and github yet I couldn’t find a good source for TICKScripts. Examples are too simple and when you write an alert with a simple threshold it’ll make alerts flapping too frequantly. I’ve considered to use flapping property then see this

  • Can you share your strategy on alerting?
  • Are you using push notifications? How are you handling the incoming alert rate?
  • Can you share your scripts if it’s ok? Essential scripts like cpu usage, cpu load, memory usage?
  • Are you using udfs? I am actually trying to keep things simple for start.
  • Have you used grafana for alerting ? How is that worked out for you?
  • If you have any other solution than kapacitor, please share.

Best regards,

Hi,

  1. we use templates to generate our alerts. We did have hundreds of alerts before but its unmanagble. Templating Unforuntately we’ve put a lot of development into how we use them so I can’t give much more than the documentation.

  2. flapping works, there is stateChangesOnly, which will only generate an alert when the state changes whcih works well unless a system is up and down like a yoyo.

Push notifications, We have been using OpsGenie to manage our alerts however we’re moving to a bespoke Microsoft Flow and Powerapp to handle handle all our alerts due to limitations in the OG integration. That being said, If you don’t have a lot of people receiving alerts i think there is a free tier of OpsGenie that supports up to 5 people, that will help to dedplucate niosy alerts and route them to where you need them (be careful though, some of the options only appear in the paid tiers).

  1. Most of my templates have been converted for our use case, but regarding CPU and memory I’d reecommend using the stateDuration node - Instead of triggering as soon as a threshold is crossed, that node will count how long it is in that given state and then alert after X minutes. Triggering immediately could result in a lot of warning > ok > warning > ok > warning.

You could assign warn/crit thresholds as the minutes to wait and set the usage threshold at 99% for example, or you could set warn/crit as the % thresholds and set the state duration to alert when either condition is true for the defined time. This snippet will alert when the threshold is exceeded for either the warn or crit time.

|eval(lambda: "cpu_usage" / 10000)
        .as('cpu_used')
        .keep('cpu_usage','cpu_used')
    |stateDuration(lambda: "cpu_used" > threshold)
        .as('state_duration')

Threshold being 99%

Then the alert node

|alert()
        .crit(lambda: "state_duration" > crit)
        .warn(lambda: "state_duration" > warn)
        .stateChangesOnly()
        .message(message)
        .id(idVar)
        .idTag(idTag)
        .levelTag(levelTag)
        .messageField(messageField)
        .durationField(durationField)
        .log('/tmp/' + name + '.log')

Where warn/crit would be something like 10 minutes and 15 minutes

I’ll see if i can find a couple of normal TICK scripts on my system to post.

  1. no, we don’t use UDF’s we just send data from Telegraf > Influx. Kapacitor sits in the middle and processes the data as it arrives.

  2. We decided against Grafana as we were having trouble passing template variables through in the alert messages (things like host name, data centre, cluster and so on). Kapacitor + TICK scripts are enough for alerting IMO but we do use Grafana for our dashboards.

  3. no other solution from me, Kapacitor and TICK scripts do most of what we need.

Also, a good starting point for at least generating TICK scripts is Chronograf. It only gives you basic alerts but if you build a rule in there and then go to edit it you can copy the script out and work in an IDE. As i say, its basic though so things like stateDuration aren’t included. You need to add them to the script yourself.

1 Like

@philb Thanks for the effort and detailed answer.

Why did you divide cpu_usage by 10000? I would really like to see if you can find more examples from your system.

I’m already using chronograf. It serves well for tickscripting. I’ll try to utilize stateDuration node.

Thanks again,

Hi @Mert no worries. Glad to help where i can.

Sorry, it was a little misleading. The measurement I’m looking at in that script is actually a nutanix measurement. To get the right value for CPU usage we have to divide by 10000. I think the cpu usage values for Windows and Linux metrics are normal values like 98.00 so you should be able to get away with

|eval(lambda: “cpu_usage”)
.as(‘cpu_usage’)
.keep(‘cpu_usage’)
|stateDuration(lambda: “cpu_usage” > threshold)
.as(‘state_duration’)

I’ll certainly have a look for some of the original scripts i had. We’ve moved through to templating pretty quickly and most of the scripts have been tweaked there. I’ve definitely got some kicking around somewhere though. I’ll get back to you after the weekend.

stateDuration is probably the way to go for metrics like cpu and memory usage. It would be quite tiresome getting an alert every time cpu hit 99% for a few seconds.

I actually love chronograf for developing new scripts and working with data. its simple and clean.

Anyway, happy scripting have a good weekend

1 Like

Hi @philb, do you have time to check your scripts?
Also, I wonder if you use window() function?

Hey @Mert, sorry I’ve not been on much. I’ve got some annual leave coming up so been trying to clear the backlog before i go. Will have a look this afternoon for some templates.

Regarding the |window node, we did use this but i encountered an issue trying to create new tag values and sending them to our MS Flow API. They were being converted to fields which are harder to parse in MS Flow - Dropping the |window() node reslolved that.

1 Like