Telegraf Outputs.Health Example?

ShakataGaNai · June 27, 2019, 10:50pm

As of Telegraf 1.11.0 there is a shiny new outputs.health plugin which provides a shiny http endpoint for healthchecks. This is awesome because my Telegraf runs in K8s so setting up a proper healthcheck against that is trivial.

However, there is not much in the way of clear examples of the configuration - specifically around what qualifies as “unhealth”. The reason why I’m looking into this is that I recently found my telegraf containers had gotten into a bad state where they were running, but unable to post data to influxdb (said “client.timeout exceeded while awaiting headers”). Didn’t make sense so I killed the containers and they came back just fine.

The only example on “unhealth” is from the docs:

  ## One or more check sub-tables should be defined, it is also recommended to
  ## use metric filtering to limit the metrics that flow into this output.
  ##
  ## When using the default buffer sizes, this example will fail when the
  ## metric buffer is half full.
  ##
  ## namepass = ["internal_write"]
  ## tagpass = { output = ["influxdb"] }
  ##
  ## [[outputs.health.compares]]
  ##   field = "buffer_size"
  ##   lt = 5000.0
  ##
  ## [[outputs.health.contains]]
  ##   field = "buffer_size"

And honestly that doesn’t make much sense to me. So I’d love to hear from others that may have gotten the jump on this new feature.

daniel · July 1, 2019, 11:18pm

Healthy is defined as ALL checks being true, and unhealthy is any check not being true. Right now we have only the compares and contains checks, but we might add more and ultimately we will integrate Flux here as well. The plugins operates on any metric data, and only metric data, that Telegraf produces.

Like other output plugins it operates independently and as a sibling of other outputs. This means it doesn’t have direct access to the success/failure of the InfluxDB output, which is why in this example it uses the metrics from the internal input plugin to sense the state of the other output. The example configuration here doesn’t fail as soon as Telegraf is unable to write, but only after the metric buffer fills to 50%.

I’d say this plugin is a bit experimental in its design, we don’t have any prior plugins that behave the same. I’ll be interested to hear how well this plugin works or doesn’t work for everyone.

P.S. The timeout error sounds a bit like https://github.com/influxdata/telegraf/issues/5905

ShakataGaNai · July 2, 2019, 1:12am

Interesting. The linked issue does look like exactly the same behavior as I saw.

Topic		Replies	Views
Telegraf output http	4	2041	February 27, 2018
Telegraf - Weird behaviour with JSON-transformation after first collection interval Telegraf telegraf , json	10	1144	February 24, 2023
Confused about telegraf health metrics Telegraf telegraf	6	1358	January 28, 2021
Input data formats [JSON] to InfluxDB issue Telegraf influxdb , telegraf	5	4235	April 10, 2018
Inputs.influxdb_v2_listener can't work properly with "influxdb client java library" because of health check of influxdb Telegraf telegraf	1	232	October 17, 2023

Telegraf Outputs.Health Example?

Related topics