Alert on joined batch queries fails to fire if one query returns emty

I’ve spent the last several days trying to solve this issue, so I’m happy to make whatever changes will solve this.

Every 4 hours I send a slack status alert with sums for various metrics that we care about. I have 7 batch queries that I join together. All but 1 of the metrics has constant data. The last metric (restarts from app crashes) is usually null.

I’m seeing 2 strange behaviors:

  1. Whenever the restarts query returns empty, then it prevents the alert from firing.
  2. Occasionally I can force the alert to fire by disabling and re-enabling the task. This sometimes causes multiple alerts to fire (almost as if they were queued up).

You can see below that all the queries are executed, but there are no batch points for resstarts.

ts=2018-01-23T00:06:00.001Z lvl=debug msg="starting next batch query" service=kapacitor task_master=main task=prod_status_alert node=query1 query="SELECT sum(value) AS sum_value FROM primary.autogen.\"4xx-slingshot\" WHERE pool = 'p2pnodeweb' AND time >= '2018-01-22T20:06:00Z' AND time < '2018-01-23T00:06:00Z' GROUP BY pool, time(4h, 0s) fill(0) ORDER BY time DESC LIMIT 1"
ts=2018-01-23T00:06:00.001Z lvl=debug msg="starting next batch query" service=kapacitor task_master=main task=prod_status_alert node=query12 query="SELECT max(value) AS max_val FROM primary.autogen.\"fci-percent\" WHERE pool = 'p2pnodeweb' AND time >= '2018-01-22T20:06:00Z' AND time < '2018-01-23T00:06:00Z' GROUP BY pool, time(4h, 0s) fill(0) ORDER BY time DESC LIMIT 1"
ts=2018-01-23T00:06:00.002Z lvl=debug msg="starting next batch query" service=kapacitor task_master=main task=prod_status_alert node=query7 query="SELECT sum(value) AS sum_value FROM primary.autogen.\"total-requests-slingshot\" WHERE pool = 'p2pnodeweb' AND time >= '2018-01-22T20:06:00Z' AND time < '2018-01-23T00:06:00Z' GROUP BY pool, time(4h, 0s) fill(0) ORDER BY time DESC LIMIT 1"
ts=2018-01-23T00:06:00.002Z lvl=debug msg="starting next batch query" service=kapacitor task_master=main task=prod_status_alert node=query3 query="SELECT sum(value) AS sum_value FROM primary.autogen.\"2xx-slingshot\" WHERE pool = 'p2pnodeweb' AND time >= '2018-01-22T20:06:00Z' AND time < '2018-01-23T00:06:00Z' GROUP BY pool, time(4h, 0s) fill(0) ORDER BY time DESC LIMIT 1"
ts=2018-01-23T00:06:00.002Z lvl=debug msg="starting next batch query" service=kapacitor task_master=main task=prod_status_alert node=query5 query="SELECT sum(value) AS sum_value FROM primary.autogen.\"5xx-slingshot\" WHERE pool = 'p2pnodeweb' AND time >= '2018-01-22T20:06:00Z' AND time < '2018-01-23T00:06:00Z' GROUP BY pool, time(4h, 0s) fill(0) ORDER BY time DESC LIMIT 1"
ts=2018-01-23T00:06:00.001Z lvl=debug msg="starting next batch query" service=kapacitor task_master=main task=prod_status_alert node=query9 query="SELECT sum(value) AS sum_value FROM primary.autogen.restarts WHERE pool = 'p2pnodeweb' AND time >= '2018-01-22T23:56:00Z' AND time < '2018-01-23T00:06:00Z' GROUP BY pool, time(4h, 0s) fill(0) ORDER BY time DESC LIMIT 1"
ts=2018-01-23T00:06:00.002Z lvl=debug msg="starting next batch query" service=kapacitor task_master=main task=prod_status_alert node=query14 query="SELECT sum(value) AS sum_value FROM primary.autogen.\"client-side-errors\" WHERE pool = 'p2pnodeweb' AND time >= '2018-01-22T20:06:00Z' AND time < '2018-01-23T00:06:00Z' GROUP BY pool, time(4h, 0s) fill(0) ORDER BY time DESC LIMIT 1"
ts=2018-01-23T00:06:00.007Z lvl=error msg="begin batch" service=kapacitor task_master=main task=prod_status_alert node=log6 prefix=status-2 name=5xx-slingshot group=pool=p2pnodeweb tag_pool=p2pnodeweb time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.007Z lvl=error msg="batch point" service=kapacitor task_master=main task=prod_status_alert node=log6 prefix=status-2 name=5xx-slingshot group=pool=p2pnodeweb tag_pool=p2pnodeweb field_sum_value=0 time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.007Z lvl=error msg="end batch" service=kapacitor task_master=main task=prod_status_alert node=log6 prefix=status-2 name=5xx-slingshot group=pool=p2pnodeweb tag_pool=p2pnodeweb time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.007Z lvl=error msg="begin batch" service=kapacitor task_master=main task=prod_status_alert node=log2 prefix=status-0 name=4xx-slingshot group=pool=p2pnodeweb tag_pool=p2pnodeweb time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.007Z lvl=error msg="batch point" service=kapacitor task_master=main task=prod_status_alert node=log2 prefix=status-0 name=4xx-slingshot group=pool=p2pnodeweb tag_pool=p2pnodeweb field_sum_value=0 time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.007Z lvl=error msg="end batch" service=kapacitor task_master=main task=prod_status_alert node=log2 prefix=status-0 name=4xx-slingshot group=pool=p2pnodeweb tag_pool=p2pnodeweb time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.007Z lvl=error msg="begin batch" service=kapacitor task_master=main task=prod_status_alert node=log13 prefix=status-5 name=fci-percent group=pool=p2pnodeweb tag_pool=p2pnodeweb time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.007Z lvl=error msg="batch point" service=kapacitor task_master=main task=prod_status_alert node=log13 prefix=status-5 name=fci-percent group=pool=p2pnodeweb tag_pool=p2pnodeweb field_max_val=0 time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.007Z lvl=error msg="end batch" service=kapacitor task_master=main task=prod_status_alert node=log13 prefix=status-5 name=fci-percent group=pool=p2pnodeweb tag_pool=p2pnodeweb time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.009Z lvl=error msg="begin batch" service=kapacitor task_master=main task=prod_status_alert node=log8 prefix=status-3 name=total-requests-slingshot group=pool=p2pnodeweb tag_pool=p2pnodeweb time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.009Z lvl=error msg="begin batch" service=kapacitor task_master=main task=prod_status_alert node=log4 prefix=status-1 name=2xx-slingshot group=pool=p2pnodeweb tag_pool=p2pnodeweb time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.010Z lvl=error msg="batch point" service=kapacitor task_master=main task=prod_status_alert node=log4 prefix=status-1 name=2xx-slingshot group=pool=p2pnodeweb tag_pool=p2pnodeweb field_sum_value=0 time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.010Z lvl=error msg="end batch" service=kapacitor task_master=main task=prod_status_alert node=log4 prefix=status-1 name=2xx-slingshot group=pool=p2pnodeweb tag_pool=p2pnodeweb time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.010Z lvl=error msg="batch point" service=kapacitor task_master=main task=prod_status_alert node=log8 prefix=status-3 name=total-requests-slingshot group=pool=p2pnodeweb tag_pool=p2pnodeweb field_sum_value=0 time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.010Z lvl=error msg="end batch" service=kapacitor task_master=main task=prod_status_alert node=log8 prefix=status-3 name=total-requests-slingshot group=pool=p2pnodeweb tag_pool=p2pnodeweb time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.009Z lvl=error msg="begin batch" service=kapacitor task_master=main task=prod_status_alert node=log15 prefix=status-6 name=client-side-errors group=pool=p2pnodeweb tag_pool=p2pnodeweb time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.010Z lvl=error msg="batch point" service=kapacitor task_master=main task=prod_status_alert node=log15 prefix=status-6 name=client-side-errors group=pool=p2pnodeweb tag_pool=p2pnodeweb field_sum_value=0 time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.010Z lvl=error msg="end batch" service=kapacitor task_master=main task=prod_status_alert node=log15 prefix=status-6 name=client-side-errors group=pool=p2pnodeweb tag_pool=p2pnodeweb time=2018-01-23T00:00:00Z

Here are the relevant parts of my tick script

var four_xx_req = batch
    |query('SELECT sum("value") AS "sum_value" FROM "primary"."autogen"."4xx-slingshot" WHERE "pool"=\'p2pnodeweb\' ORDER BY time DESC LIMIT 1')
        .period(TIME_PERIOD) // 4h
        // run these batches every day at 4pm UTC (8am San Jose time) (PST is -8 from UTC). So add 8 hrs to get the cron time.
        .cron(CRON) // every 4h (during testing every 10min)
        .groupBy('pool', time(TIME_PERIOD))
        .fill(0)
    |log()
        .prefix('status-0')
        .level(LOG_LEVEL)

// 2xx
var two_xx_req = batch
    |query('SELECT sum("value") AS "sum_value" FROM "primary"."autogen"."2xx-slingshot" WHERE "pool"=\'p2pnodeweb\' ORDER BY time DESC LIMIT 1')
        .period(TIME_PERIOD)
        .cron(CRON)
        .groupBy('pool', time(TIME_PERIOD))
        .fill(0)
    |log()
        .prefix('status-1')
        .level(LOG_LEVEL)

// 5xx
var five_xx_req = batch
    |query('SELECT sum("value") AS "sum_value" FROM "primary"."autogen"."5xx-slingshot" WHERE "pool"=\'p2pnodeweb\' ORDER BY time DESC LIMIT 1')
        .period(TIME_PERIOD)
        .cron(CRON)
        .groupBy('pool', time(TIME_PERIOD))
        .fill(0)
    |log()
        .prefix('status-2')
        .level(LOG_LEVEL)

// total_reqs
var total_reqs = batch
    |query('SELECT sum("value") AS "sum_value" FROM "primary"."autogen"."total-requests-slingshot" WHERE "pool"=\'p2pnodeweb\' ORDER BY time DESC LIMIT 1')
        .period(TIME_PERIOD)
        .cron(CRON)
        .groupBy('pool', time(TIME_PERIOD))
        .fill(0)
    |log()
        .prefix('status-3')
        .level(LOG_LEVEL)

// restarts
// FIXME: When we add this one it breaks the alert :|
var restart_count = batch
    |query('SELECT sum("value") AS "sum_value" FROM "primary"."autogen"."restarts" WHERE "pool"=\'p2pnodeweb\' ORDER BY time DESC LIMIT 1')
        .period(4h)
        .cron(CRON)
        .groupBy('pool', time(TIME_PERIOD))
        // report 0 when there is no data
        .fill(0)
    |log()
        .prefix('status-4')
        .level(LOG_LEVEL)

var fci_percent = batch
    |query('SELECT max("value") AS "max_val" FROM "primary"."autogen"."fci-percent" WHERE "pool"=\'p2pnodeweb\' ORDER BY time DESC LIMIT 1')
        .period(TIME_PERIOD)
        // .every(EVERY)
        .cron(CRON)
        .groupBy('pool', time(TIME_PERIOD))
        // report 0 when there is no data
        .fill(0)
    |log()
        .prefix('status-5')
        .level(LOG_LEVEL)

var client_side_errs = batch
    |query('SELECT sum("value") AS "sum_value" FROM "primary"."autogen"."client-side-errors" WHERE "pool"=\'p2pnodeweb\' ORDER BY time DESC LIMIT 1')
        .period(TIME_PERIOD)
        // .every(EVERY)
        .cron(CRON)
        .groupBy('pool', time(TIME_PERIOD))
        // report 0 when there is no data
        .fill(0)
    |log()
        .prefix('status-6')
        .level(LOG_LEVEL)

// TODO: ask them to create a cookbook
// if one of the streams has NO data, then the alert wont trigger...
var joined = total_reqs
    |join(two_xx_req, four_xx_req, five_xx_req, restart_count, fci_percent, client_side_errs)
        // Save the batch queries as these prefixes so we can use them in our alert
        .as('total_r', '2xx', '4xx', '5xx', 'restarts', 'fci', 'client_errs')
        .tolerance(TIME_PERIOD)
        .fill(0)
    |log()
        .prefix('status-4-joined')
        .level(LOG_LEVEL)
    // calc percentages. Have to use 100.0 (float) since the values are floats.
    |eval(
        lambda: "2xx.sum_value" / "total_r.sum_value" * 100.0,
        lambda: "4xx.sum_value" / "total_r.sum_value" * 100.0,
        lambda: "5xx.sum_value" / "total_r.sum_value" * 100.0,
        lambda: "client_errs.sum_value" / "total_r.sum_value" * 100.0
    )
        .as('2xx_percent', '4xx_percent', '5xx_percent', 'client_err_percent')
        // you have to .keep() any vars you want to use after eval(), since using eval will wipe them out all vars.
        .keep('2xx_percent', '4xx_percent', '5xx_percent', 'client_err_percent', 'total_r.sum_value', 'restarts.sum_value2', 'fci.max_val')

// TODO: try .info(lambda: TRUE)
var trigger = joined
    |alert()
        // always alert
        .info(lambda: 2 > 1)
        .id(idVar)
        .idTag(idTag)
        .message('[{{.Level}}] PROD Metrics Report for last 4 hours:
        *2xx / Total*: {{ index .Fields "2xx_percent" | printf "%0.2f "}}%
        *4xx / Total*: {{ index .Fields "4xx_percent" | printf "%0.2f "}}%
        *5xx / Total*: {{ index .Fields "5xx_percent" | printf "%0.2f "}}%
        *Restarts*: {{ index .Fields "restarts.sum_value2" }}
        *FCI % high*: {{ index .Fields "fci.max_val" }}%
        *Client errs / Total*: {{ index .Fields "client_err_percent" | printf "%0.2f" }}%
        *Total Reqs*: {{ index .Fields "total_r.sum_value" | printf "%0.2f" }}')
        .slack()
        .channel(SLACK_CHANNEL)

Any thoughts at all? Thank you.

Here’s what’s strange, the first 2 or 3 times the query ran, it found a result.

The batch points here in the middle are present the first 2-3 runs, and then there is no batch point section, and so the alert stops firing at all. You can see in my earlier comment that the restart batch points are missing completely, and thus no alert fires.

ts=2018-01-23T00:55:00.000Z lvl=debug msg="starting next batch query" service=kapacitor task_master=main task=prod_status_alert node=query9 query="SELECT sum(value) AS sum_value FROM primary.autogen.restarts WHERE pool = 'p2pnodeweb' AND time >= '2018-01-22T20:55:00Z' AND time < '2018-01-23T00:55:00Z' GROUP BY pool, time(4h, 0s) fill(0) ORDER BY time DESC LIMIT 1"
[...]
ts=2018-01-23T00:55:00.018Z lvl=error msg="begin batch" service=kapacitor task_master=main task=prod_status_alert node=log10 prefix=status-4 name=restarts group=pool=p2pnodeweb tag_pool=p2pnodeweb time=2018-01-23T00:00:00Z
ts=2018-01-23T00:55:00.018Z lvl=error msg="batch point" service=kapacitor task_master=main task=prod_status_alert node=log10 prefix=status-4 name=restarts group=pool=p2pnodeweb tag_pool=p2pnodeweb field_sum_value=0 time=2018-01-23T00:00:00Z
ts=2018-01-23T00:55:00.018Z lvl=error msg="end batch" service=kapacitor task_master=main task=prod_status_alert node=log10 prefix=status-4 name=restarts group=pool=p2pnodeweb tag_pool=p2pnodeweb time=2018-01-23T00:00:00Z
[...]
ts=2018-01-23T00:55:00.020Z lvl=debug msg="alert triggered" service=kapacitor task_master=main task=prod_status_alert node=alert24 level=INFO id=Prod_status:pool=p2pnodeweb event_message="[INFO] PROD Metrics Report for last 4 hours:\n        *2xx / Total*: 83.97 %\n        *4xx / Total*: 3.09 %\n        *5xx / Total*: 0.01 %\n        *Restarts*: 0\n        *FCI % high*: 5.8%\n        *Client errs / Total*: 0.43%\n        *Total Reqs*: 194543.00" data="&{total-requests-slingshot map[pool:p2pnodeweb] [time 2xx_percent 4xx_percent 5xx_percent client_err_percent fci.max_val restarts.sum_value total_r.sum_value] [[2018-01-23 00:00:00 +0000 UTC 83.96601265529986 3.092889489727207 0.008738428008203842 0.42921102275589457 5.8 0 194543]]}"

I think what’s happening is this: When query returns several buckets, it zero fills properly, even if (because of my sort by time, and LIMIT 1) the number that gets alerted it 0. So the question becomes: How do I zero fill a query that returns as empty? I assumed using |default() would solve that, but it appears to have no effect.

First of all, sorry for the late response, this week was a travel week for me.

You have this exactly right. The join node needs to wait for data from each source in order to join the records. If one source never returns data the the join node thinks it can’t progress (since it may arrive at sometime in the future) and so the results queue up.

How do I zero fill a query that returns as empty?

This is a bit tricky: in short influxdb can’t fill empty results since it doesn’t know the which tags should exist for that time range if no data exists.

Kapacitor 1.4 has a new node called the barrier node which forces a flush of the data. So one solution would be to query the raw data instead of the aggregate sum for your restart counts. Then window the data on the Kapacitor side of things using a barrier node. This will force empty results to be produced correctly even if no data exists. And since most commonly there is very little data the data transfer should still be quite small.

Something like this (untested but should be close):

var restart_count = batch
    |query('SELECT "value" FROM "primary"."autogen"."restarts" WHERE "pool"=\'p2pnodeweb\' ')
        .period(4h)
        .cron(CRON)
    |barrier()
         // If no data has arrived in time plus a 2s grace period, emit a barrier, indicating that no more data for that time period will arrive
        .idle(TIME_PERIOD+2s)
    // If the window node gets a barrier message is emits the current window, even if its empty.
    |window()
        .every(TIME_PERIOD)
    // Sum will emit a 0 value for an empty window.
    |sum()
       .as("sum_value")        
    |log()
        .prefix('status-4')
        .level(LOG_LEVEL)

Could you explain more about idle and period of barrier? With .period we are sure the query will always be querying against the correct time?

Hi,

If i try to put the same script (batch|query|barrier|window) this error arise:
invalid TICKscript: line 23 char 3: error calling func “window” on obj *pipeline.BarrierNode: cannot Window batch edge

What i’m missing?

If a put period and every above, it seems to work!