Kapacitor: AlertNode ID OpsGenie Alias Issues

#1

I am having issues with kapacitor and opsgenie alerts in that we are getting tasks firing and adding notes to the wrong opsgenie alert. Egon - “Never Cross the streams”. This has to with the way the OpsGenie API is being used specifically the usage of the “alias” field on opening a new alert.

We are using the old style inline alert handlers rather than alert topics, and we have issues where the alert node ID is the same for two seperate distinct alerts. i.e. we are not setting the .ID property specifically and we are doing a groupby in the batch query. This means that if we have an alert opened already with an alias of say “foobar”, then another alertnode fires that is part of the same groupby, we will fail to open a new alert on the opsgenie side, as the alias already exists. Now if the second node recovers a note would be added to the first alert in opsgenie, indicating recovery, which is incorrect.

It gets even more complicated in that we have tasks defined for “staging” vs “production” similar tasks but having a where clause for the “deployment”. They are running on seperate kapacitor instances for isolation purposes, but would end up having the same “ID”, and thus conflict at the OpsGenie level.

This can all be solved by properly setting the ID in the alert node to something unique. But this is a whole bunch of work, as we have many many tasks that would need to be updated to explicitly set the ID.

I was thinking that i could modify the opsgenie pluging to not use the “alias” field and store the returned AlertID (UUID from opsgenie) as state in the alert node. On recovery the alert node would add notes using the AlertID not alias. This would work in the case of “inline” opsgenie handlers, but not for alert topics. As I write this I am leaning to the side of not something I should do and I should just set the ID in each task. I also believe that the kapacitor alert handlers were designed to be a fire and forget type action, not pulling state back into the alert node from an external service.

Questions:

  • Is there a way to have a ID that would be unique without explicitly setting the ID. It would be nice if kapacitor, defaulted the ID of the alert node to “<kapacitor_node_id>/<task_id>/<alert_node_number>” That way the ID would be unique.
  • Are there any design guidelines for writing a service handler? Best Practices etc. As I think solving my issue using the alert id returned by opsgenie and storing it in the alert node handler state is a no no.

Thanks
Kristopher

#2

Well after writing this and reading it again. I have decided to simply set the alert ID, it is not that much work and is the right thing to do.

#3

@octalthorpe Thanks for the good discussion on the issue. The purpose of the ID for an alert is to determine its uniqueness. So your conclusion to use unique alert IDs will be the best solution as the entire ecosystem of Kapacitor operates on that assumption. By default the alert ID is unique based off the group by tags, but as you are running into its possible for those tags to collide across different tasks. An easy fix would be to set the ID to be {.TaskName}:{.Group} . That would be the default except for some backward incompatible reasons we could not change it. As for staging vs production if you have a tag for that just group by it and it should ensure uniqueness.

#4

Great, thanks for the follow up and confirmation. Things are working as expected now.

Cheers

#5

We have similar issue with alertid and opsgenie. Because opsgenie duplicates alerts with the same alias which is set after alertid if critical comes after warning we don’t receive any notification about it. Is it possible to overwrite opsgenie alias and have alerts auto closed at the same time?