Kapacitor false alarms on new EC2 instances

jpriebe · August 14, 2017, 8:00pm

We are using netdata to feed metrics into influxdb and alerting via kapacitor.

Whenever we spin up a new EC2 instance, for example if an autoscaling group is scaling up, our TICK scripts trigger alerts like “CPU idle is low”. It is low during startup of an EC2 instance, so the alerts aren’t really wrong, but they’re not helpful, so we’d like to eliminate these alerts.

One option would be to delay the startup of netdata for some period (e.g. 5 minutes). That would make the alerts go away. But I would also lose visibility into these machines during a critical transition.

Kapacitor is not aware of the uptime of the various machines it is monitoring. All it knows is that the cpu_idle values tagged with host=“ip-172.32.4.29” is lower than the alert threshold.

Is there an elegant solution to this problem?

var threshold_warn = 13
var threshold_warn_reset = 20
var threshold_crit = 7
var threshold_crit_reset = 13

var metric_identifier = 'cpu_idle'
var metric_description = 'Idle CPU'
var metric_sense = '<'
var period = 120s
var every = 10s
var slack_handler = '/etc/kapacitor/scripts/slack.php'


var data = stream
  |from()
    .measurement('netdata.system.cpu.idle')
    .where(lambda: !strContains("instanceclass", 'wowza-'))
    .groupBy('host')
  |window()
    .period(period)
    .every(every)
  |mean('value')
    .as('stat')

data
  |alert()
    .id('{{ index .Tags "host"}}/' + string(metric_identifier))
    .stateChangesOnly()
    .message('{{ .Level }},{{ index .Tags "host"}},'
        + string(metric_identifier) + ','
        + string(metric_description) + ','
        + string(metric_sense) + ','
        + '{{ index .Fields "stat" }}' + ','
        + string(period) + ','
        + '{{ if eq .Level "CRITICAL" }}' + string(threshold_crit)
        + '{{ else }}' + string(threshold_warn)
        + '{{ end }}')
    .warn(lambda:      "stat" <= threshold_warn)
    .warnReset(lambda: "stat" >= threshold_warn_reset)
    .crit(lambda:      "stat" <= threshold_crit)
    .critReset(lambda: "stat" >= threshold_crit_reset)
    .exec(slack_handler)
    .log('/var/log/kapacitor/kapacitor.txt')

data
  |alert()
    .id('{{ index .Tags "host"}}/' + string(metric_identifier))
    .message('{{ .Level }} {{ .ID }}: {{ index .Fields "stat" }}')
    .stateChangesOnly()
    .crit(lambda:      "stat" <= threshold_crit)
    .critReset(lambda: "stat" >= threshold_crit_reset)
    .victorOps()
    .routingKey('urgent')
    .log('/var/log/kapacitor/kapacitor.txt')

Topic		Replies	Views
Alert on New Data Kapacitor kapacitor	2	690	November 8, 2018
State not changing to OK for all hosts Kapacitor	2	698	May 11, 2018
Kapacitor condition to do not spam alerts Kapacitor influxdb , telegraf , kapacitor	5	1095	May 15, 2020
Need help with Alerts (noob) Kapacitor time-series , kapacitor	2	1095	February 8, 2018
Kapacitor did not sent alert even after thresholds crossed	6	815	May 22, 2019

Kapacitor false alarms on new EC2 instances

Related topics