Kapacitor - if value exceeds threshold for specified duration

Hi guys,

As I’m diving into Kapacitor more and more, I’m trying to refine my checks and alerts. One thing I’d like to check if you if a value meets a criteria (e.g. Exceeds a value, within s range) for a duration. This will help me exclude flapping and better get an idea if a metric has spiked for an extended period.

Currently im doing this by using timeShift a few times and comparing, but it’s messy and I’m sure there must be a better way. Is there a nice easy way to do this?

One example would be cpu usage > 90% for 1 minute, or even better, using one of the queries chronograf generated for me (:heart:️) cpu usage is 50% higher than the last 10 minute period, for at least 5 minutes.

Cheers!

To get to an exact answer, you need to share a bit more information, but the basics probably involve using a batch job like this example QueryNode | InfluxData Documentation Archive.

You would set the period to 10m and every to the frequency you want the job the run.

If you are sending the cpu_usage stats every minute, then you could simple count the number of instances over 50% using the WHERE clause. If you are sending the stats more or less frequently, then you would do the counting of values in a later function. Afterward, I would assume you would use AlertNode to see if the counted value exceeds the threshold and then outputs some message.

Please share more details about your data and its frequency if you want some specific query help.

Thanks for your reply.

I’ve spent a bit of time on this today, trying to figure out the best way to process this. In the end, I settled for a ‘critical rate’ per time window (Eg. alert if 30% of points for CPU usage are above 90% utilisation). Not sure how well this will work out, but I’ll give it a go.

So far I have Hastebin: Send and Save Text or Code Snippets for Free | Toptal® which pretty much does the job, the only issue I’m having is that the final measurement does not retain any of the tags (So, when I alert to Slack, I can’t pass the hosts’s name), Any suggestions how I can streamline this a bit?

EDIT: As per replay-live lacks tags data · Issue #1078 · influxdata/kapacitor · GitHub, it looks like the initial recording/query must contain a GROUP BY "tag" statement to retain tags. :+1:

Regards,
James.

As an aside, is it possible to utilise the following from an AlertNode?

Available Statistics:
alerts_triggered – Total number of alerts triggered
oks_triggered – Number of OK alerts triggered
infos_triggered – Number of Info alerts triggered
warns_triggered – Number of Warn alerts triggered
crits_triggered – Number of Crit alerts triggered

Because I could use that node to trigger my warns and alerts and use the value later, if I can get the values out somehow.

I’ve seen there’s a StatsNode which I might play with later.

So far my 'x % of ‘critical’ alerts in a time window works out nicely to see if a host is persistently providing critical metrics :+1:

@absolutejam Hey, I’m trying to do something very similar to what you are, but the hastebin link no longer works. Can you re-post your script? Or did you find a better way to do this?

Hey @alexphillips, not sure if it’s the right one, but I have https://gitlab.com/absolutejam/tickscripts/blob/master/cpu_crit_rate_15m.tick saved which I think rings a bell.

Would the stateDuration node do what you want?

|stateDuration(lambda: “thing you want to count” > crit)
.unit(1m)
.as(‘critDuration’)

Then in your alert node

.crit(lambda: "critDuration > 5)

There is alittle bit more to it though stateDuration

Thats how i count it anyway

hope that helps