Hello
I have a two-machine dovecot cluster in failover mode and it’d like to generate alerts based on three different situations:
- dovecot is running in one of the machines - cluster is up
- dovecot is down on both machines - cluster is down
- dovecot is running on both machines - cluster is in split-brain state
I’m using Telegraf’s procstat
plugin with the following configuration, emitting data points every 15s:
[agent]
...
interval = "15s"
flush_interval = "15s"
...
[[inputs.procstat]]
systemd_unit = "dovecot"
[inputs.procstat.tags]
failover = "yes"
cluster = "imapcluster"
Given those tags, I’ve managed to write the following tickscript:
dbrp "telegraf"."monitor"
var message = '({{ .Level }}) {{ index .Tags "cluster" }}: {{ index .Tags "systemd_unit" }} {{ if eq (index .Fields "count") 0 }}is down{{ else if eq (index .Fields "count") 1 }}is up{{ else }}split brain ({{ index .Fields "count" }} instances running){{ end }}'
var data = stream
|from()
.database('telegraf')
.retentionPolicy('monitor')
.measurement('procstat_lookup')
.where(lambda: "failover" == 'yes')
.groupBy(['cluster', 'systemd_unit'])
|window()
.every(30s)
.period(15s)
.align()
|sum('running')
.as('count')
|stateDuration(lambda: "count" == 0)
.unit(1m)
.as('down_minutes')
data
|alert()
.crit(lambda: "down_minutes" > 1 OR "count" == 2)
.stateChangesOnly()
.message(message)
.telegram()
This works mostly fine, when the window data contains one data point for each of the cluster hosts. However, sometimes the window will contains two data points for each host, i.e. four data points total. I think this happens due to small variations in the Telegraf measurement times which cause two consecutive measurements to fall within a single 15s window.
When this happens, my script wrongly generates a split brain alert because my data looks like this:
cluster | host | running | time |
---|---|---|---|
imapcluster | host1 | 1 | t1 |
imapcluster | host2 | 0 | t1 |
imapcluster | host1 | 1 | t2 |
imapcluster | host2 | 0 | t2 |
Thus sum('running')
will result in 2, generating a split brain alert.
I’ve tried to find a way to make the sum based on distinct hosts but I couldn’t figure out how. I’ve also tried to ensure that the window data will always have a single point from each server, but I’m not sure that’s possible. The best solution I’ve found so far was to use |top(2, 'running', 'host')
to ignore any possible extra data in the window, but I dislike that because it always returns the oldest two points, and I’d like to consider the most recent ones.
Is there a better way to achieve this?
Thanks.