Testing Kapacitor Batch Queries

I’m attempting to create a Kapacitor alarm based on a lack of successful queries from a script I’m running that gathers data from an external source. I’m new to TICKscript and believe I must be doing something wrong.

Here’s my script:

             SELECT non_negative_derivative(mean("SuccessfulQueryCount"), 10m) as "SuccessRate"
             FROM "telegraf"."autogen"."data-importer"
             WHERE time > now() - 2h
             GROUP BY time(10m)
         .groupBy(time(10m), 'env')
         .crit(lambda: "SuccessRate" <= 10)
         .critReset(lambda: "SuccessRate" > 50)

I’ve recorded 148m of data which begins with many successful queries and then ends with 2 hours of unsuccessful queries (I did this by changing the hosts file to make the queries fail). I replayed this to my task and don’t get the expected alerts. In fact, I get 0’s across the board within the DOT:

digraph test {
graph [throughput="0.00 batches/s"];

query1 [avg_exec_time_ns="0s" batches_queried="0" errors="0" points_queried="0" working_cardinality="0" ];
query1 -> alert2 [processed="0"];

alert2 [alerts_triggered="0" avg_exec_time_ns="0s" crits_triggered="0" errors="0" infos_triggered="0" oks_triggered="0" warns_triggered="0" working_cardinality="0" ];

At this point I’m trying to understand if the problem is with replaying the recording (my test process), or with the script itself. I have a series of questions:

  1. In the query I find using WHERE and GROUP BY to be redundant with using .period and .groupBy. Should I not use these in the SELECT query and only using the properties?

  2. Am I chaining the ‘mean’ to the ‘query’ and ‘alert’ properly? I’m attempting to take the average of the resulting 12 values (120 minute query / 10 minute groups).

  3. Is there some method for testing TICKscript? Specifically, execution of the DAG nodes? I haven’t found anything when scouring the docs, but this is a pain to troubleshoot!

If you see any other issue with my query or my methodology please let me know.

I’ve let this alert sit for some time while continuously running the queries in a mode that they’re mostly successful (they’re hitting the intended server). Now and see that the alert has triggered 3 times. It really doesn’t seem like the ‘mean’ node is executing properly. Below is the output in /tmp/test from one of the alerts (removing some detail to scrub proprietary info). I notice a few things:

  1. 10 values are shown when I’d expect 12. Is this just an excerpt of the data returned from the query? Or does this mean that my query is indeed only returning 10 data points?

  2. The message indicates that the value seen, {{ index .Fields “SuccessRate” }}, is 0. This shows me that this SuccessRate reference isn’t referring to the mean. I believe I need to fix this to have the mean node write to a placeholder, and to use that in the reference within my message.

  3. I’m surprised to see 3 alerts, and each of them moving to CRITICAL from a previous level of CRITICAL. Isn’t the .stateChangesOnly() intended to prevent this?

    “id”: “eng Data Importer”,
    “message”: “eng Data Importer success rate (0) dropped below an average critical level of 10 for 2 hour period”,
    “time”: “2018-03-06T19:50:00Z”,
    “duration”: 4200000000000,
    “level”: “CRITICAL”,
    “data”: {
    “series”: [{
    “name”: “data-importer”,
    “tags”: {
    “env”: “eng”
    “columns”: [“time”, “SuccessRate”],
    “values”: [
    [“2018-03-06T19:50:00Z”, 0],
    [“2018-03-06T20:00:00Z”, 0],
    [“2018-03-06T20:10:00Z”, 0],
    [“2018-03-06T20:20:00Z”, 0],
    [“2018-03-06T20:30:00Z”, 0],
    [“2018-03-06T20:40:00Z”, 81.025],
    [“2018-03-06T20:50:00Z”, 392.0583333333333],
    [“2018-03-06T21:00:00Z”, 426.49166666666673],
    [“2018-03-06T21:10:00Z”, 430.6833333333334],
    [“2018-03-06T21:20:00Z”, 378.78872549019593]
    “previousLevel”: “CRITICAL”