Telegraf redundancy/high availability

I have a setup that consists of 2 SCADA servers and 1 database server. On the SCADA servers I would like to run 2 telegraf instances with the disk buffer strategy for when the database is disconnected or is being rebooted.

The suggestions I have read were to just run the telegraf instances at the same time and let the precision of the database handle the duplicate data. However, my problem lies at the inputs. Some of my modbus inputs, can only respond to a single master. So I need a way to determine which instance should poll the endpoint.

What I have made so far:

  • Using the exec output with a windows batch file, I am able to move input configuration files in and out of the telegraf scope. With the file monitor, telegraf automatically restarts.
  • Configure a inputs.mock to generate a hearbeat metric.
  • Configure output.http to transmit the heartbeat to other instances
  • Configure the inputs.http_listener_v2 to receive heartbeats.
Example config
[[inputs.mock]]
  alias = "reduncancy heartbeat"
  metric_name = "telegraf_1"
  interval = "5s"
  log_level = "info"
  [inputs.mock.tags]
    heartbeat = "true"
    internal_block_influxdb = "true"
  [[inputs.mock.step]]
    name = "value"
    start = 0.0
    step = 1.0

[[inputs.http_listener_v2]]
  alias = "reduncancy http listener input"
  service_address = "tcp://127.0.0.1:8087"
  paths = ["/reduncancy"]
  data_format = "influx"
  log_level = "debug"
  [inputs.http_listener_v2.tags]
    internal_reduncancy_received = "true"
    internal_block_influxdb = "true"

[[outputs.http]]
  alias = "reduncancy http output telegraf_1"
  flush_interval = "1s"
  metric_buffer_limit = 1
  url = "http://127.0.0.1:8087/reduncancy"
  method = "POST"
  data_format = "influx"
  use_batch_format = true
  log_level = "info"
  [outputs.http.tagpass]
    heartbeat = ["*"]
  [outputs.http.tagdrop]
    internal_reduncancy_received = ["*"]

[[outputs.http]]
  alias = "reduncancy http output telegraf_2"
  flush_interval = "1s"
  metric_buffer_limit = 1
  url = "http://127.0.0.1:8088/reduncancy"
  method = "POST"
  data_format = "influx"
  use_batch_format = true
  log_level = "info"
  [outputs.http.tagpass]
    heartbeat = ["*"]
  [outputs.http.tagdrop]
    internal_reduncancy_received = ["*"]

[[outputs.http]]
  alias = "reduncancy http output telegraf_3"
  flush_interval = "1s"
  metric_buffer_limit = 1
  url = "http://127.0.0.1:8089/reduncancy"
  method = "POST"
  data_format = "influx"
  use_batch_format = true
  log_level = "info"
  [outputs.http.tagpass]
    heartbeat = ["*"]
  [outputs.http.tagdrop]
    internal_reduncancy_received = ["*"]

[[processors.starlark]]
  alias = "reduncancy"
  script = "test.star"
  log_level = "debug"
  [processors.starlark.constants]
    configured_partners = '["telegraf_1","telegraf_2","telegraf_3"]'
    timeout = "10s"
  [processors.starlark.tagpass]
    internal_reduncancy_received = ["true"]

This all works great. With this setup you can configure multiple telegraf instances to send data to each other (eg: to link 3 instances you configure 3 output.http plugins to the 2 remote instances and 1 to itself to keep configs looking the same).

The next step would be to write a starlark processor to keep track of the last time a heartbeat was received and create actions to send to the exec output This is where I need some new ideas.

  • Can I solve this with only 2 instances?
  • Do I need a third instance to do some “voting” strategy?
  • When using multiple (2 or even more) how could I implement some loadbalancing (inputs have a primary instance but if this is down a backup instance starts these plugins).

I know I am creating a complicated setup, but this is more kind of a “can we solve this” project instead of a “should we” :slight_smile: . The neat part is that all redundancy config can reside in a single file and does not pollute the regular configuration.

Interesting scenario, however it is not quite clear to me why you want to run 2 telegraf instances on the same machine? (Or did I misunderstand?) Usually when wanting to run 2 telegraf instances for HA, that’s mostly on separate machines (ideally in separate DC’s) to have fallback when one of them has issues.

There is no need anymore to move/change config files in order to select certain inputs since the addition of Plugin labels and selectors.

I like the idea of the heartbeat with inputs.mock. I would personally use one of the metrics from inputs.internal instead, as I’m already using that for other analytics (like which input is run on which system).

It can indeed be done with only 2, but higher accuracy can be achieved with 3+ instances.

I’m using config management tools (like Ansible) who are automatically generating the config, where all inputs are spread over the available telegraf instances. (I still have to look into the before mentioned selectors) So this isn’t a real load balancing, but it makes it possible to spread all endpoints over the telegraf agents you have available.

With 2 instances I meant 1 on each scada server, my apologies for the confusion. Also my example config uses 3 times local host, but that is just for development.

I looked into the selectors, but this is not changeable on the fly when running telegraf as a service I presume. You would need to manipulate the telegraf arguments in the service in a way for this to work. It is a neat option for development though. It seems like a method for distributing configs between multiple instances that access the same config files (maybe on the same server).

Does the ansible monitor the running configs and change them when instances become unavailable?

@JeroenVH you are dealing with an interesting problem. :slight_smile: Let me understand the setup and your target a bit better…

From what I read you do have 2 hosts for collecting data from network devices such as Modbus TCP/IP servers (aka slave) and one separate database host acting as a data-sink. You now want to run one or more Telegraf service(s) (on the collection hosts) for gathering the data, buffer it and send it to the database. I also understood that only a single Telegraf instance can be active as some of the devices do not support parallel access.

This is all fine, but I’m not yet sure what your exact target is. You said you want to account for (longer) database downtime using disk-based buffering. What time-scales are you envisioning here? I’m asking because, depending on the number of collected metrics, often memory-based buffering is sufficient and more lightweight.

The original post reads like you want to also account for cases where one of the collection hosts goes down. Is this understanding correct? If so, must the service be migrated automatically to the other side?

Hi @srebhan

You understand correct, 2 collection hosts and 1 data-sink host. The main objective is to cover downtime of any of the three servers (installing updates for example). When the database server is down, the collection hosts will buffer the output. When one of the collection hosts is down, the remaining collection host has to collect all the metrics. The 2 collection hosts are the same servers on which our process control system runs (also a hot standby configuration). I could get a signal from that system to decide on the state of the telegraf’s but I don’t want to entangle these programs.

Because of the limitations of the endpoints (maximum number of connections or don’t cope well with multiple masters) I want each input plugin to be active on only one host. A nice to have could also be that plugins are distributed among the collection hosts (predefined distribution and a fallback scenario). If it were only MQTT or OPCUA it would not really be a problem to run parallel.

I did some research to maybe orchestrate the telegraf instances and came across “Microsoft failover cluster”. This might be a possibility but then you start using features that are only available on windows. If we can make multiple telegraf instances communicate with the http plugins, we can do is crossplatform.

Based on my proposal, how could a starlark processor be structured to get the available telegraf instances to agree on which config files to run on which host. They all receive messages from each other and have to cast a vote on who they think is available. This all can be stored in the state object to transfer data from one metric message to another and the mock input will keep the processor updated at a regular interval. Using an periodic aggregator might be possible also but than you need to create your own watchdog counter, something mock can also solve natively.

In my setup I can run 2 collection instances and 1 monitoring instance (on the data-sink host) to reach a quorum. But to make the code robust we should consider that there might be more then 3 instances (and also mix some operating systems, maybe someone wants to distribute their odds in case a cloud security software pushes a wrong update :smiley: ).

Normal operation distribution
  • Telegraf 1 (collection host)
    • HA-Config.conf
    • 1-EnergyGateway1.conf
    • 1-EnergyGateway2.conf
  • Telegraf 2 (collection host)
    • HA-Config.conf
    • 2-FlowmeterGateway1.conf
    • 2-FlowmeterGateway2.conf
  • Telegraf 3 (data-sink host)
    • HA-Config.conf
Interrupted operation (los of telegraf 2 instance)
  • Telegraf 1 (collection host)
    • HA-Config.conf
    • 1-EnergyGateway1.conf
    • 1-EnergyGateway2.conf
    • 2-FlowmeterGateway1.conf
    • 2-FlowmeterGateway2.conf
  • Telegraf 3 (data-sink host)
    • HA-Config.conf

You basically want 2 active/passive systems in one:
PlantUML diagram

This diagram shows both collectors having both configs/services, but only 1 is running on each. The following diagram shows the state when collector 1 disappears:
PlantUML diagram

Then the EnergyGateway service gets started and so all data is still collected. Similar things happens when Collector 2 stops..

Yes, that is mostly correct. Nice drawing :slight_smile:

Some remarks on the diagram to make it 100% correct:

  • The telegraf instance on the data-sink does not really handle any data (arrow should not go through it), I would just put it there to add to the voting. Or should it do some additional buffering?
  • The storage is located in the data-sink square.
  • The collectors write directly to influxdb and the HA communication is http over an available port.

I picked InfluxDB to communicate as that’s telegraf’s native format. Why specifically HTTP? And which format you want to transport over it?

I updated the diagram:
PlantUML diagram

Having the collector telegraf also send all data to the other collector would improve availability in case of one of the collectors has connectivity issues to the data sink. The sending of data becomes redundant.

The data sent to and from the data sink telegraf is indeed only the heartbeat metric.

Ah, yes the data format for the heartbeats and data output is influxDB. I was confused because you mention tcp ports within the collector itself.

The link between the collectors is also a dotted heartbeat connection.

I found that the input.http_listener and output.http were the easiest to make telegraf instances communicate and also handle the endpoint being down, hence the HTTP.

To be fair, inputs.influxdb_v2_listener and outputs.influxdb_v2 are also simple, if not even more simple.

The dashed connection was to indicate optional-ness.

Now it is just matter of figuring out the aggregators.starlark script on the 2 collectors and optional data sink.

It didn’t even occurred to me to use the influxdb plugins for this. In essence they are doing the same thing. But I agree that it is much better since you can use a shared API key.

How would we go about moving configurations in and out of scope? Is using the exec plugin a good solution?

I assume that the aggregator needs to know the other instances defined as a constant, otherwise it will never know is one specific instance is missing. I suggest using a JSON array inside a string constant.

period = 1s
[aggregators.starlark.constants]
  name = "collector_1"
  nodes = '["collector_1","collector_2","data_sink"]'
  timeout = "10s"

To start off, the aggregator must always send out a default metric. Does it need to include an incremental counter? Once it receives signals from another instances it can start to populate the state object. After this it can add the information it knows to the heartbeat metric to inform the other nodes.

Suggestion for state
{
  "nodes": {
    "collector_1": {
      "local": true,
      "online": true,
      "lastseen": 1758801751,
      "activenodes": ["collector_1","collector_2","data_sink"]
    },
    "collector_2": {
      "local": false,
      "online": true,
      "lastseen": 1758801751,
      "activenodes": ["collector_1","collector_2","data_sink"]
    },
    "data_sink": {
      "local": false,
      "online": true,
      "lastseen": 1758801751,
      "activenodes": ["collector_1","collector_2","data_sink"]
    }
  }
}