Telegraf Outputs.Health Example?

As of Telegraf 1.11.0 there is a shiny new outputs.health plugin which provides a shiny http endpoint for healthchecks. This is awesome because my Telegraf runs in K8s so setting up a proper healthcheck against that is trivial.

However, there is not much in the way of clear examples of the configuration - specifically around what qualifies as “unhealth”. The reason why I’m looking into this is that I recently found my telegraf containers had gotten into a bad state where they were running, but unable to post data to influxdb (said “client.timeout exceeded while awaiting headers”). Didn’t make sense so I killed the containers and they came back just fine.

The only example on “unhealth” is from the docs:

  ## One or more check sub-tables should be defined, it is also recommended to
  ## use metric filtering to limit the metrics that flow into this output.
  ##
  ## When using the default buffer sizes, this example will fail when the
  ## metric buffer is half full.
  ##
  ## namepass = ["internal_write"]
  ## tagpass = { output = ["influxdb"] }
  ##
  ## [[outputs.health.compares]]
  ##   field = "buffer_size"
  ##   lt = 5000.0
  ##
  ## [[outputs.health.contains]]
  ##   field = "buffer_size" 

And honestly that doesn’t make much sense to me. So I’d love to hear from others that may have gotten the jump on this new feature.

Healthy is defined as ALL checks being true, and unhealthy is any check not being true. Right now we have only the compares and contains checks, but we might add more and ultimately we will integrate Flux here as well. The plugins operates on any metric data, and only metric data, that Telegraf produces.

Like other output plugins it operates independently and as a sibling of other outputs. This means it doesn’t have direct access to the success/failure of the InfluxDB output, which is why in this example it uses the metrics from the internal input plugin to sense the state of the other output. The example configuration here doesn’t fail as soon as Telegraf is unable to write, but only after the metric buffer fills to 50%.

I’d say this plugin is a bit experimental in its design, we don’t have any prior plugins that behave the same. I’ll be interested to hear how well this plugin works or doesn’t work for everyone.

P.S. The timeout error sounds a bit like https://github.com/influxdata/telegraf/issues/5905

Interesting. The linked issue does look like exactly the same behavior as I saw.