Telegraf redundancy/high availability

It didn’t even occurred to me to use the influxdb plugins for this. In essence they are doing the same thing. But I agree that it is much better since you can use a shared API key.

How would we go about moving configurations in and out of scope? Is using the exec plugin a good solution?

I assume that the aggregator needs to know the other instances defined as a constant, otherwise it will never know is one specific instance is missing. I suggest using a JSON array inside a string constant.

period = 1s
[aggregators.starlark.constants]
  name = "collector_1"
  nodes = '["collector_1","collector_2","data_sink"]'
  timeout = "10s"

To start off, the aggregator must always send out a default metric. Does it need to include an incremental counter? Once it receives signals from another instances it can start to populate the state object. After this it can add the information it knows to the heartbeat metric to inform the other nodes.

Suggestion for state
{
  "nodes": {
    "collector_1": {
      "local": true,
      "online": true,
      "lastseen": 1758801751,
      "activenodes": ["collector_1","collector_2","data_sink"]
    },
    "collector_2": {
      "local": false,
      "online": true,
      "lastseen": 1758801751,
      "activenodes": ["collector_1","collector_2","data_sink"]
    },
    "data_sink": {
      "local": false,
      "online": true,
      "lastseen": 1758801751,
      "activenodes": ["collector_1","collector_2","data_sink"]
    }
  }
}