Telegraf redundancy/high availability

JeroenVH · September 25, 2025, 12:10pm

It didn’t even occurred to me to use the influxdb plugins for this. In essence they are doing the same thing. But I agree that it is much better since you can use a shared API key.

How would we go about moving configurations in and out of scope? Is using the exec plugin a good solution?

I assume that the aggregator needs to know the other instances defined as a constant, otherwise it will never know is one specific instance is missing. I suggest using a JSON array inside a string constant.

period = 1s
[aggregators.starlark.constants]
  name = "collector_1"
  nodes = '["collector_1","collector_2","data_sink"]'
  timeout = "10s"

To start off, the aggregator must always send out a default metric. Does it need to include an incremental counter? Once it receives signals from another instances it can start to populate the state object. After this it can add the information it knows to the heartbeat metric to inform the other nodes.

Suggestion for state

{
  "nodes": {
    "collector_1": {
      "local": true,
      "online": true,
      "lastseen": 1758801751,
      "activenodes": ["collector_1","collector_2","data_sink"]
    },
    "collector_2": {
      "local": false,
      "online": true,
      "lastseen": 1758801751,
      "activenodes": ["collector_1","collector_2","data_sink"]
    },
    "data_sink": {
      "local": false,
      "online": true,
      "lastseen": 1758801751,
      "activenodes": ["collector_1","collector_2","data_sink"]
    }
  }
}

Topic		Replies	Views
High Availability Telegraf Telegraf telegraf , kapacitor , influxdb-cloud-2-0	5	2181	November 22, 2022
Telegraf with multiple outputs: If one is down, no one gets the data Telegraf	10	6303	November 20, 2019
Multiple telegraf configs	23	18851	June 26, 2019
Telegraf custler Telegraf telegraf , performance	2	536	August 8, 2019
Telegraf config help please Telegraf	2	643	February 8, 2023

Telegraf redundancy/high availability

Related topics