We use kapacitor to process heartbeat alerts for the applications we host for our customers. Whenever a last heartbeat received is more that a specified timeout in the past, we generate an alert for that application. What we notice is that our pipeline processing the heartbeats (before it reaches capacitor) has a variable processing delay. This causes frequent false positives with our customers.
We have now increased the timeout, but that reduces the value of the alert in case of true positives. Also, the delays can fluctuate wildly, sometimes up to 15 minutes.
Another solution is to add turn off the heartbeat alert as soon as we detect a processing spike. However, this is less than ideal.
Ideally what we would like is to make the timeout dynamic, based on the running average of the processing delay of the heartbeats received for all applications. I managed to calculate this delay and store it back into influx, but I don’t know how to attach this to the series of last heartbeats. One is a stream and the other is a batch, so I cannot seem to join the two.
Any idea on how to do this? Perhaps there are better suggestions to achieve what I want?