I’ve spent the last several days trying to solve this issue, so I’m happy to make whatever changes will solve this.
Every 4 hours I send a slack status alert with sums for various metrics that we care about. I have 7 batch queries that I join together. All but 1 of the metrics has constant data. The last metric (restarts from app crashes) is usually null.
I’m seeing 2 strange behaviors:
- Whenever the restarts query returns empty, then it prevents the alert from firing.
- Occasionally I can force the alert to fire by disabling and re-enabling the task. This sometimes causes multiple alerts to fire (almost as if they were queued up).
You can see below that all the queries are executed, but there are no batch points for resstarts.
ts=2018-01-23T00:06:00.001Z lvl=debug msg="starting next batch query" service=kapacitor task_master=main task=prod_status_alert node=query1 query="SELECT sum(value) AS sum_value FROM primary.autogen.\"4xx-slingshot\" WHERE pool = 'p2pnodeweb' AND time >= '2018-01-22T20:06:00Z' AND time < '2018-01-23T00:06:00Z' GROUP BY pool, time(4h, 0s) fill(0) ORDER BY time DESC LIMIT 1"
ts=2018-01-23T00:06:00.001Z lvl=debug msg="starting next batch query" service=kapacitor task_master=main task=prod_status_alert node=query12 query="SELECT max(value) AS max_val FROM primary.autogen.\"fci-percent\" WHERE pool = 'p2pnodeweb' AND time >= '2018-01-22T20:06:00Z' AND time < '2018-01-23T00:06:00Z' GROUP BY pool, time(4h, 0s) fill(0) ORDER BY time DESC LIMIT 1"
ts=2018-01-23T00:06:00.002Z lvl=debug msg="starting next batch query" service=kapacitor task_master=main task=prod_status_alert node=query7 query="SELECT sum(value) AS sum_value FROM primary.autogen.\"total-requests-slingshot\" WHERE pool = 'p2pnodeweb' AND time >= '2018-01-22T20:06:00Z' AND time < '2018-01-23T00:06:00Z' GROUP BY pool, time(4h, 0s) fill(0) ORDER BY time DESC LIMIT 1"
ts=2018-01-23T00:06:00.002Z lvl=debug msg="starting next batch query" service=kapacitor task_master=main task=prod_status_alert node=query3 query="SELECT sum(value) AS sum_value FROM primary.autogen.\"2xx-slingshot\" WHERE pool = 'p2pnodeweb' AND time >= '2018-01-22T20:06:00Z' AND time < '2018-01-23T00:06:00Z' GROUP BY pool, time(4h, 0s) fill(0) ORDER BY time DESC LIMIT 1"
ts=2018-01-23T00:06:00.002Z lvl=debug msg="starting next batch query" service=kapacitor task_master=main task=prod_status_alert node=query5 query="SELECT sum(value) AS sum_value FROM primary.autogen.\"5xx-slingshot\" WHERE pool = 'p2pnodeweb' AND time >= '2018-01-22T20:06:00Z' AND time < '2018-01-23T00:06:00Z' GROUP BY pool, time(4h, 0s) fill(0) ORDER BY time DESC LIMIT 1"
ts=2018-01-23T00:06:00.001Z lvl=debug msg="starting next batch query" service=kapacitor task_master=main task=prod_status_alert node=query9 query="SELECT sum(value) AS sum_value FROM primary.autogen.restarts WHERE pool = 'p2pnodeweb' AND time >= '2018-01-22T23:56:00Z' AND time < '2018-01-23T00:06:00Z' GROUP BY pool, time(4h, 0s) fill(0) ORDER BY time DESC LIMIT 1"
ts=2018-01-23T00:06:00.002Z lvl=debug msg="starting next batch query" service=kapacitor task_master=main task=prod_status_alert node=query14 query="SELECT sum(value) AS sum_value FROM primary.autogen.\"client-side-errors\" WHERE pool = 'p2pnodeweb' AND time >= '2018-01-22T20:06:00Z' AND time < '2018-01-23T00:06:00Z' GROUP BY pool, time(4h, 0s) fill(0) ORDER BY time DESC LIMIT 1"
ts=2018-01-23T00:06:00.007Z lvl=error msg="begin batch" service=kapacitor task_master=main task=prod_status_alert node=log6 prefix=status-2 name=5xx-slingshot group=pool=p2pnodeweb tag_pool=p2pnodeweb time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.007Z lvl=error msg="batch point" service=kapacitor task_master=main task=prod_status_alert node=log6 prefix=status-2 name=5xx-slingshot group=pool=p2pnodeweb tag_pool=p2pnodeweb field_sum_value=0 time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.007Z lvl=error msg="end batch" service=kapacitor task_master=main task=prod_status_alert node=log6 prefix=status-2 name=5xx-slingshot group=pool=p2pnodeweb tag_pool=p2pnodeweb time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.007Z lvl=error msg="begin batch" service=kapacitor task_master=main task=prod_status_alert node=log2 prefix=status-0 name=4xx-slingshot group=pool=p2pnodeweb tag_pool=p2pnodeweb time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.007Z lvl=error msg="batch point" service=kapacitor task_master=main task=prod_status_alert node=log2 prefix=status-0 name=4xx-slingshot group=pool=p2pnodeweb tag_pool=p2pnodeweb field_sum_value=0 time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.007Z lvl=error msg="end batch" service=kapacitor task_master=main task=prod_status_alert node=log2 prefix=status-0 name=4xx-slingshot group=pool=p2pnodeweb tag_pool=p2pnodeweb time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.007Z lvl=error msg="begin batch" service=kapacitor task_master=main task=prod_status_alert node=log13 prefix=status-5 name=fci-percent group=pool=p2pnodeweb tag_pool=p2pnodeweb time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.007Z lvl=error msg="batch point" service=kapacitor task_master=main task=prod_status_alert node=log13 prefix=status-5 name=fci-percent group=pool=p2pnodeweb tag_pool=p2pnodeweb field_max_val=0 time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.007Z lvl=error msg="end batch" service=kapacitor task_master=main task=prod_status_alert node=log13 prefix=status-5 name=fci-percent group=pool=p2pnodeweb tag_pool=p2pnodeweb time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.009Z lvl=error msg="begin batch" service=kapacitor task_master=main task=prod_status_alert node=log8 prefix=status-3 name=total-requests-slingshot group=pool=p2pnodeweb tag_pool=p2pnodeweb time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.009Z lvl=error msg="begin batch" service=kapacitor task_master=main task=prod_status_alert node=log4 prefix=status-1 name=2xx-slingshot group=pool=p2pnodeweb tag_pool=p2pnodeweb time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.010Z lvl=error msg="batch point" service=kapacitor task_master=main task=prod_status_alert node=log4 prefix=status-1 name=2xx-slingshot group=pool=p2pnodeweb tag_pool=p2pnodeweb field_sum_value=0 time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.010Z lvl=error msg="end batch" service=kapacitor task_master=main task=prod_status_alert node=log4 prefix=status-1 name=2xx-slingshot group=pool=p2pnodeweb tag_pool=p2pnodeweb time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.010Z lvl=error msg="batch point" service=kapacitor task_master=main task=prod_status_alert node=log8 prefix=status-3 name=total-requests-slingshot group=pool=p2pnodeweb tag_pool=p2pnodeweb field_sum_value=0 time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.010Z lvl=error msg="end batch" service=kapacitor task_master=main task=prod_status_alert node=log8 prefix=status-3 name=total-requests-slingshot group=pool=p2pnodeweb tag_pool=p2pnodeweb time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.009Z lvl=error msg="begin batch" service=kapacitor task_master=main task=prod_status_alert node=log15 prefix=status-6 name=client-side-errors group=pool=p2pnodeweb tag_pool=p2pnodeweb time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.010Z lvl=error msg="batch point" service=kapacitor task_master=main task=prod_status_alert node=log15 prefix=status-6 name=client-side-errors group=pool=p2pnodeweb tag_pool=p2pnodeweb field_sum_value=0 time=2018-01-23T00:00:00Z
ts=2018-01-23T00:06:00.010Z lvl=error msg="end batch" service=kapacitor task_master=main task=prod_status_alert node=log15 prefix=status-6 name=client-side-errors group=pool=p2pnodeweb tag_pool=p2pnodeweb time=2018-01-23T00:00:00Z
Here are the relevant parts of my tick script
var four_xx_req = batch
|query('SELECT sum("value") AS "sum_value" FROM "primary"."autogen"."4xx-slingshot" WHERE "pool"=\'p2pnodeweb\' ORDER BY time DESC LIMIT 1')
.period(TIME_PERIOD) // 4h
// run these batches every day at 4pm UTC (8am San Jose time) (PST is -8 from UTC). So add 8 hrs to get the cron time.
.cron(CRON) // every 4h (during testing every 10min)
.groupBy('pool', time(TIME_PERIOD))
.fill(0)
|log()
.prefix('status-0')
.level(LOG_LEVEL)
// 2xx
var two_xx_req = batch
|query('SELECT sum("value") AS "sum_value" FROM "primary"."autogen"."2xx-slingshot" WHERE "pool"=\'p2pnodeweb\' ORDER BY time DESC LIMIT 1')
.period(TIME_PERIOD)
.cron(CRON)
.groupBy('pool', time(TIME_PERIOD))
.fill(0)
|log()
.prefix('status-1')
.level(LOG_LEVEL)
// 5xx
var five_xx_req = batch
|query('SELECT sum("value") AS "sum_value" FROM "primary"."autogen"."5xx-slingshot" WHERE "pool"=\'p2pnodeweb\' ORDER BY time DESC LIMIT 1')
.period(TIME_PERIOD)
.cron(CRON)
.groupBy('pool', time(TIME_PERIOD))
.fill(0)
|log()
.prefix('status-2')
.level(LOG_LEVEL)
// total_reqs
var total_reqs = batch
|query('SELECT sum("value") AS "sum_value" FROM "primary"."autogen"."total-requests-slingshot" WHERE "pool"=\'p2pnodeweb\' ORDER BY time DESC LIMIT 1')
.period(TIME_PERIOD)
.cron(CRON)
.groupBy('pool', time(TIME_PERIOD))
.fill(0)
|log()
.prefix('status-3')
.level(LOG_LEVEL)
// restarts
// FIXME: When we add this one it breaks the alert :|
var restart_count = batch
|query('SELECT sum("value") AS "sum_value" FROM "primary"."autogen"."restarts" WHERE "pool"=\'p2pnodeweb\' ORDER BY time DESC LIMIT 1')
.period(4h)
.cron(CRON)
.groupBy('pool', time(TIME_PERIOD))
// report 0 when there is no data
.fill(0)
|log()
.prefix('status-4')
.level(LOG_LEVEL)
var fci_percent = batch
|query('SELECT max("value") AS "max_val" FROM "primary"."autogen"."fci-percent" WHERE "pool"=\'p2pnodeweb\' ORDER BY time DESC LIMIT 1')
.period(TIME_PERIOD)
// .every(EVERY)
.cron(CRON)
.groupBy('pool', time(TIME_PERIOD))
// report 0 when there is no data
.fill(0)
|log()
.prefix('status-5')
.level(LOG_LEVEL)
var client_side_errs = batch
|query('SELECT sum("value") AS "sum_value" FROM "primary"."autogen"."client-side-errors" WHERE "pool"=\'p2pnodeweb\' ORDER BY time DESC LIMIT 1')
.period(TIME_PERIOD)
// .every(EVERY)
.cron(CRON)
.groupBy('pool', time(TIME_PERIOD))
// report 0 when there is no data
.fill(0)
|log()
.prefix('status-6')
.level(LOG_LEVEL)
// TODO: ask them to create a cookbook
// if one of the streams has NO data, then the alert wont trigger...
var joined = total_reqs
|join(two_xx_req, four_xx_req, five_xx_req, restart_count, fci_percent, client_side_errs)
// Save the batch queries as these prefixes so we can use them in our alert
.as('total_r', '2xx', '4xx', '5xx', 'restarts', 'fci', 'client_errs')
.tolerance(TIME_PERIOD)
.fill(0)
|log()
.prefix('status-4-joined')
.level(LOG_LEVEL)
// calc percentages. Have to use 100.0 (float) since the values are floats.
|eval(
lambda: "2xx.sum_value" / "total_r.sum_value" * 100.0,
lambda: "4xx.sum_value" / "total_r.sum_value" * 100.0,
lambda: "5xx.sum_value" / "total_r.sum_value" * 100.0,
lambda: "client_errs.sum_value" / "total_r.sum_value" * 100.0
)
.as('2xx_percent', '4xx_percent', '5xx_percent', 'client_err_percent')
// you have to .keep() any vars you want to use after eval(), since using eval will wipe them out all vars.
.keep('2xx_percent', '4xx_percent', '5xx_percent', 'client_err_percent', 'total_r.sum_value', 'restarts.sum_value2', 'fci.max_val')
// TODO: try .info(lambda: TRUE)
var trigger = joined
|alert()
// always alert
.info(lambda: 2 > 1)
.id(idVar)
.idTag(idTag)
.message('[{{.Level}}] PROD Metrics Report for last 4 hours:
*2xx / Total*: {{ index .Fields "2xx_percent" | printf "%0.2f "}}%
*4xx / Total*: {{ index .Fields "4xx_percent" | printf "%0.2f "}}%
*5xx / Total*: {{ index .Fields "5xx_percent" | printf "%0.2f "}}%
*Restarts*: {{ index .Fields "restarts.sum_value2" }}
*FCI % high*: {{ index .Fields "fci.max_val" }}%
*Client errs / Total*: {{ index .Fields "client_err_percent" | printf "%0.2f" }}%
*Total Reqs*: {{ index .Fields "total_r.sum_value" | printf "%0.2f" }}')
.slack()
.channel(SLACK_CHANNEL)
Any thoughts at all? Thank you.