Stream window derivative very weird problem - help?!

voiprodrigo · May 3, 2024, 12:04am

HI,

I’m having a very weird problem with my stream task.
TLDR: the task should alert 12 INFO’s, one every 5 minutes in a hour, but only alerts 11 times and one of the times is always wrong value > 0 and therefore alerting CRITICAL.

Here’s the main code:

var field_lambda = lambda: "dropped"

var data = stream
    |from()
        .database(db)
        .retentionPolicy(rp)
        .measurement(measurement)
        .groupBy(groupBy)
        //.where(whereFilter)
    |window()
        .align()
        .period(5m)
        .every(5m)
    |eval(field_lambda)
        .as('value')
    |sum('value')
        .as('sum_value')
    |derivative('sum_value')
        .unit(5m) // match .every(5m)
        .nonNegative()
        .as('final_value')

I skip the alert node because it’s generic, I’m just checking if final_value is > 0 (alert CRITICAL) or not (alert INFO). I always alert, but consider INFO as a “ok”.

The metric points are being generated by a script that is called every minute by a cron job. The script takes 1-2 seconds to run, and it generates multiple series. All the series will be pushed with the same timestamp, for example hh:mm:01 or hh:mm:02, with the seconds depending on how much time the script takes to run. The series are always the same for each host where the script runs. The relevant field is “dropped”. This value is cumulative. It only increases. And may reset to 0 if the process from which these metrics are collected are restarted (hence the use of derivative.nonNegative()). So on each host the script will output something like:

stats_destinations,host=host123,destination=dstA dropped=123 timestampX
stats_destinations,host=host123,destination=dstB dropped=0 timestampX

The idea of this task is: for a given window of time, for any given host, if the sum of drops for all destinations of the host increases, then the value of “dropped” for at least one destination of the host has increased, and that should generate a critical alert.

Again, the metrics are generated by a script that runs every minute, and takes 1-2 seconds to output.
I configured the task to create windows of 5mins, emit every 5mins as well (so sum’ing the points for the last 5 mins)

Now, the problem! Consider the situation where the values of dropped for a given host do not change. For an hour, the task should 12 times for each host. 10 times the script alerts as INFO. But 1 times, it alerts a CRITICAL, because one time somehow “final_value” is being computed as 8x the current sum of the dropped values. Then it only alerts again 10 minutes later (skips one alert). And emits 10 INFO’s,

For example (notice how it goes from minute 10 to 20, skipping 15)

    "time": "2024-05-02T22:10:00Z",
                        119104
    "time": "2024-05-02T22:20:00Z",
                        0
    "time": "2024-05-02T22:25:00Z",
                        0
    "time": "2024-05-02T22:30:00Z",
                        0
    "time": "2024-05-02T22:35:00Z",
                        0
    "time": "2024-05-02T22:40:00Z",
                        0
    "time": "2024-05-02T22:45:00Z",
                        0
    "time": "2024-05-02T22:50:00Z",
                        0
    "time": "2024-05-02T22:55:00Z",
                        0
    "time": "2024-05-02T23:00:00Z",
                        0
    "time": "2024-05-02T23:05:00Z",
                        0

where 119104 is 8*14888, for some odd reason. And this repeats over and over.

I don’t understand why this is happening, because the sum value is always 14888, so the difference should always be 0, but every 11’th time it decides it’s 8x that value?!

Any hints to why this may be happening?

thanks in advance!

Anaisdg · May 3, 2024, 8:41pm

Hello @voiprodrigo,
I’m sorry there really isn’t much support for Kapacitor outside of other community members other support.

Given this, the sudden spike to 8 times the expected value could be caused by several factors:

Is it possible that within the 5-minute window, the data fluctuates significantly, causing a spike that the derivative then magnifies?
There could be a potential issue with how the windowing is handled, causing it to accumulate or process data incorrectly at certain intervals.
Ensure that your data precision matches the requirements of your calculations. It’s possible that rounding errors or precision issues might cause unexpected results, especially when dealing with derivatives.

To troubleshoot:

Check the raw data within those problematic intervals to see if there are any anomalies or unexpected spikes.
Look into the configuration of the derivative calculation. Ensure that it’s correctly configured to calculate the rate of change over the 5-minute window.
Consider logging intermediate results or debugging statements within your Kapacitor script to understand how the values are being processed at each stage of the pipeline.
Verify the version of Kapacitor you’re using for any known issues or bugs related to derivative calculations or windowing.

By systematically examining each step of your Kapacitor task and the data it processes, you should be able to identify the root cause of the issue.

Alternatively I’d look into using other ETL and task tools to replace Kapacitor. Like Mage.ai or maybe bytewax. Thanks!

voiprodrigo · May 18, 2024, 3:39pm

Thank you for the suggestions.
In the end I created a CQ to downsample last() of 5mins into another measurement, and pointed kapacitor to this new measurement with a 5 min window. It’s been fine like this, and it’s enough for what I need.
I do need to explore long term alternatives though…

voiprodrigo · May 18, 2024, 3:44pm

Just to add, I’m checking the metric points in influxdb , no gaps, no spikes in values, no duplicates, no missing, timestamps all aligned to the 0 second of the minute (I enforce this in my metric collection script). So I can only imagine it’s something weird with derivative in multi-point window.
I would gladly try non_negative_difference, but that function never arrived in Kapacitor.

Topic		Replies	Views
Performing Derivative On A Stream kapacitor	1	840	October 1, 2019
Window and Mean not working as expected kapacitor	2	480	May 26, 2020
Kapacitor not generating alert Kapacitor	1	938	March 26, 2018
Kapacitor question Telegraf	0	410	November 20, 2018
Stats(time) vs Unit(time) in Deadman alert Kapacitor kapacitor	4	752	April 25, 2022

Stream window derivative very weird problem - help?!

Related topics