As of Telegraf 1.11.0 there is a shiny new outputs.health plugin which provides a shiny http endpoint for healthchecks. This is awesome because my Telegraf runs in K8s so setting up a proper healthcheck against that is trivial.
However, there is not much in the way of clear examples of the configuration - specifically around what qualifies as “unhealth”. The reason why I’m looking into this is that I recently found my telegraf containers had gotten into a bad state where they were running, but unable to post data to influxdb (said “client.timeout exceeded while awaiting headers”). Didn’t make sense so I killed the containers and they came back just fine.
The only example on “unhealth” is from the docs:
## One or more check sub-tables should be defined, it is also recommended to ## use metric filtering to limit the metrics that flow into this output. ## ## When using the default buffer sizes, this example will fail when the ## metric buffer is half full. ## ## namepass = ["internal_write"] ## tagpass = { output = ["influxdb"] } ## ## [[outputs.health.compares]] ## field = "buffer_size" ## lt = 5000.0 ## ## [[outputs.health.contains]] ## field = "buffer_size"
And honestly that doesn’t make much sense to me. So I’d love to hear from others that may have gotten the jump on this new feature.