Currently, from our environment we sometimes have a very short windows of trouble writing to a influxdb instance in the cloud. However, it does not seem Telegraf is recovering / retrying this after the initial failure.
2017-11-22T12:01:29Z E! Error writing to output [influxdb]: Could not write to any InfluxDB server in cluster
2017-11-22T12:03:09Z E! Error: statsd message queue full. We have dropped 1 messages so far. You may want to increase allowed_pending_messages in the config …
… and we slowly see this rising to millions and millions, until we restart the Telegraf instance, which instantly works. Ideally, I’m looking for a method for the instance to recover (and send any data not lost in the overflow of the queue), but failing that, I’d like a way to detect this failing health of the instance and restart it. Due to the fact I’m running this in a Docker with logging straight via stdout & stderr to a remote location via Gelf, tailing a log and restarting based on that is quite complicated. Is there some API call I can do to the current running Telegraf instance to determine this (or other) issues? What are people using to determine not only the health of the inputs, but also the outputs of a running Telegraf process?