HI,
I’m having a very weird problem with my stream task.
TLDR: the task should alert 12 INFO’s, one every 5 minutes in a hour, but only alerts 11 times and one of the times is always wrong value > 0 and therefore alerting CRITICAL.
Here’s the main code:
var field_lambda = lambda: "dropped"
var data = stream
|from()
.database(db)
.retentionPolicy(rp)
.measurement(measurement)
.groupBy(groupBy)
//.where(whereFilter)
|window()
.align()
.period(5m)
.every(5m)
|eval(field_lambda)
.as('value')
|sum('value')
.as('sum_value')
|derivative('sum_value')
.unit(5m) // match .every(5m)
.nonNegative()
.as('final_value')
I skip the alert node because it’s generic, I’m just checking if final_value is > 0 (alert CRITICAL) or not (alert INFO). I always alert, but consider INFO as a “ok”.
The metric points are being generated by a script that is called every minute by a cron job. The script takes 1-2 seconds to run, and it generates multiple series. All the series will be pushed with the same timestamp, for example hh:mm:01 or hh:mm:02, with the seconds depending on how much time the script takes to run. The series are always the same for each host where the script runs. The relevant field is “dropped”. This value is cumulative. It only increases. And may reset to 0 if the process from which these metrics are collected are restarted (hence the use of derivative.nonNegative()). So on each host the script will output something like:
stats_destinations,host=host123,destination=dstA dropped=123 timestampX
stats_destinations,host=host123,destination=dstB dropped=0 timestampX
The idea of this task is: for a given window of time, for any given host, if the sum of drops for all destinations of the host increases, then the value of “dropped” for at least one destination of the host has increased, and that should generate a critical alert.
Again, the metrics are generated by a script that runs every minute, and takes 1-2 seconds to output.
I configured the task to create windows of 5mins, emit every 5mins as well (so sum’ing the points for the last 5 mins)
Now, the problem! Consider the situation where the values of dropped for a given host do not change. For an hour, the task should 12 times for each host. 10 times the script alerts as INFO. But 1 times, it alerts a CRITICAL, because one time somehow “final_value” is being computed as 8x the current sum of the dropped values. Then it only alerts again 10 minutes later (skips one alert). And emits 10 INFO’s,
For example (notice how it goes from minute 10 to 20, skipping 15)
"time": "2024-05-02T22:10:00Z",
119104
"time": "2024-05-02T22:20:00Z",
0
"time": "2024-05-02T22:25:00Z",
0
"time": "2024-05-02T22:30:00Z",
0
"time": "2024-05-02T22:35:00Z",
0
"time": "2024-05-02T22:40:00Z",
0
"time": "2024-05-02T22:45:00Z",
0
"time": "2024-05-02T22:50:00Z",
0
"time": "2024-05-02T22:55:00Z",
0
"time": "2024-05-02T23:00:00Z",
0
"time": "2024-05-02T23:05:00Z",
0
where 119104 is 8*14888, for some odd reason. And this repeats over and over.
I don’t understand why this is happening, because the sum value is always 14888, so the difference should always be 0, but every 11’th time it decides it’s 8x that value?!
Any hints to why this may be happening?
thanks in advance!