High Availability Telegraf

Hello everyone,
Our team created a solution that allows storing time series data from heating controllers into InfluxDB Cloud.
We are currently handling data incoming from thousands of controllers and we are going to be ingesting data from quite a few thousands more.

Our setup is such that we have a process that gets a message with the data from each of the controllers roughly every 10 minutes (data from each controller comes when it comes, it not aligned nor it has to be). For the message, multiple fields are going to be “written” (and tagged with the device unique identifier) to InfluxDb Cloud using client libraries (NodeJs). In reality, the collector does not write directly to InfluxDB but to a single Telegraf instance running in the same Kubernetes cluster. That Telegraf instance performs buffering of the writes but also dual-writes to an instance of Kapacitor running in the same kubernetes cluster that we use for some streaming data pre-processing.

To sum up, a program writes to Telegraf using client libraries and Telegraf buffers some of the writes and performs the dual-writing to InfluxDB Clound and a local Kapacitor.

This Telegraf instance is a single point of failure of our data ingestion.
Should it not be available, no data would be written to InfluxDB (or Kapacitor, for that matter, but that is not our biggest worry).
We have not had any problems with that single instance of Telegraf and its metrics show a healthy instance, but we would like to ensure that data is written to InfluxDB.

Anyone has any suggestions on how we could achieve any kind of high availability for Telegraf in our particular setup?

Thanks in advance

Hey @danielgonnet, thanks for sharing this to us. What you could do is to use multiple Telegraf instances as long as the data is timestamped correctly. This will of course lead to multiple identical writes to the InfluxDB instances, but if you can live with this…

Hi,

Having multiple instances of Telegraf writing the same information to all of our targets may work for InfluxDB (multiplying write operations, but from the data perspective no harm will be done) but for Kapacitor would result in multiple alerts being fired unless something is done on the Kapacitor side or in the alert sink side.

But your suggestion gave us an idea: removing dual writing from Telegraf and have a set of fanned-out instances writing to InfluxDB the same information and another set of load-balanced instances writing to Kapacitor to remove the duplication before it reaches Kapacitor.

Anyone thinks this is a very bad idea?

Maybe a small image would help here. :wink:

Sure.

This is the current solution: a data writer that writes vial client libraries to a single telegraf that dual-writes to InfluxDB cloud and a Kapacitor hosted in the same Kubernetes cluster.
no_ha.current

This is a proposal based on what you suggested in which we merely replicate the single Telegraf and traffic is fanned-out to each replica. This is problematic, because of the possible duplicated alerts coming out from Kapacitor.
multiple_dual_writers.proposal

If instead of fanning-out data-writer traffic to every the dual-writing telegraf we load balanced that trafffic (some telegrafs would get a slice of data and other telegrafs a different slice) we would not have potential duplication of alerts from Kapacitor (in best case scenario that all telegrafs are working as they should) but we would end-up in a situation of potentially data loss if one of the telegrafs dies out without flushing its buffer.

This other proposal that I have in my head after thinking about what you suggested is removing the dual writing capabilities of the the replicated telegraf and, instead, have multiple Telegrafs that write to InfluxDB Cloud and traffic from data-writer is fanned out (all of them write the same points) and have another set of load-balanced Telegrafs that would write to Kapacitor (one of them writes a single, distinct point) and thus avoiding duplication of alerts.

I am not sure if the “higer availability” makes up for the complexity of infrastructure to route traffic with the right pattern.

Hope this clarifies a little more.

I see. Currently, there is no HA-version of Telegraf in the sense that you can run multiple instances with only one doing the write and switch over if one fails. It would also be a bigger change in Telegraf’s architecture, so do not expect this to happen anytime soon. Maybe using two telegrafs with Edge-Data.Replication could help but I think you will then just create another single-point-of-failure.
Let me know if you come up with some solution as this is definitively an interesting problem.