I am having issues with kapacitor and opsgenie alerts in that we are getting tasks firing and adding notes to the wrong opsgenie alert. Egon - “Never Cross the streams”. This has to with the way the OpsGenie API is being used specifically the usage of the “alias” field on opening a new alert.
We are using the old style inline alert handlers rather than alert topics, and we have issues where the alert node ID is the same for two seperate distinct alerts. i.e. we are not setting the .ID property specifically and we are doing a groupby in the batch query. This means that if we have an alert opened already with an alias of say “foobar”, then another alertnode fires that is part of the same groupby, we will fail to open a new alert on the opsgenie side, as the alias already exists. Now if the second node recovers a note would be added to the first alert in opsgenie, indicating recovery, which is incorrect.
It gets even more complicated in that we have tasks defined for “staging” vs “production” similar tasks but having a where clause for the “deployment”. They are running on seperate kapacitor instances for isolation purposes, but would end up having the same “ID”, and thus conflict at the OpsGenie level.
This can all be solved by properly setting the ID in the alert node to something unique. But this is a whole bunch of work, as we have many many tasks that would need to be updated to explicitly set the ID.
I was thinking that i could modify the opsgenie pluging to not use the “alias” field and store the returned AlertID (UUID from opsgenie) as state in the alert node. On recovery the alert node would add notes using the AlertID not alias. This would work in the case of “inline” opsgenie handlers, but not for alert topics. As I write this I am leaning to the side of not something I should do and I should just set the ID in each task. I also believe that the kapacitor alert handlers were designed to be a fire and forget type action, not pulling state back into the alert node from an external service.
Questions:
- Is there a way to have a ID that would be unique without explicitly setting the ID. It would be nice if kapacitor, defaulted the ID of the alert node to “<kapacitor_node_id>/<task_id>/<alert_node_number>” That way the ID would be unique.
- Are there any design guidelines for writing a service handler? Best Practices etc. As I think solving my issue using the alert id returned by opsgenie and storing it in the alert node handler state is a no no.
Thanks
Kristopher