I am just doing some load testing on our InfluxDB Telegraf setup. Things seem to be going well, we have a two layer setup. One layer reads from UDP and sends to Kafka, another layer reads from Kafka and writes to Influx.
I noted yesterday afternoon, I started getting some non-zero readings fro “gather_errors” on the layer that reads from Kafka. I’m running debug mode for the logs and don’t see anything, but there is just a background level of 317 values in the internal agent metrics for gather errors.
Any thoughts? Is there a way I can get more insight into what is going on here?
There should be a log message each time this value increments, I don’t know of a way it would increment without writing an error. The value is not reset until Telegraf restarts, maybe the error happened far in the past?
What I’m observing is that I am getting a value of 317 for all hosts for gather_errors. I am not seeing anything in the log file even though I have debug = true.
We have quite a sophisticated setup with good monitoring, and the data seems to be getting through to influx, but just this lack of understanding of what these are bothers me.