Telegraf recover from - or detect - temporary failure

telegraf
#1

Currently, from our environment we sometimes have a very short windows of trouble writing to a influxdb instance in the cloud. However, it does not seem Telegraf is recovering / retrying this after the initial failure.

2017-11-22T12:01:29Z E! Error writing to output [influxdb]: Could not write to any InfluxDB server in cluster
2017-11-22T12:03:09Z E! Error: statsd message queue full. We have dropped 1 messages so far. You may want to increase allowed_pending_messages in the config …

… and we slowly see this rising to millions and millions, until we restart the Telegraf instance, which instantly works. Ideally, I’m looking for a method for the instance to recover (and send any data not lost in the overflow of the queue), but failing that, I’d like a way to detect this failing health of the instance and restart it. Due to the fact I’m running this in a Docker with logging straight via stdout & stderr to a remote location via Gelf, tailing a log and restarting based on that is quite complicated. Is there some API call I can do to the current running Telegraf instance to determine this (or other) issues? What are people using to determine not only the health of the inputs, but also the outputs of a running Telegraf process?

[solved] Telegraf should reconnect after influxdb-timeouts
#2

You can use the internal plugin to keep an eye on Telegraf, but you will need a deadman’s switch to deal with a complete loss of metrics, I believe you can use Kapacitor to do this.

Telegraf ought to reconnect though, is there a log line that starts with E! InfluxDB Output Error:?

#3

A more complete error list as follows:

2017-11-22T11:59:21Z E! InfluxDB Output Error: Post https://<servername>.influxcloud.net:8086/write?db=<dbname>: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2017-11-22T11:59:21Z E! Error writing to output [influxdb]: Could not write to any InfluxDB server in cluster
2017-11-22T11:59:25Z E! InfluxDB Output Error: Response Error: Status Code [503], expected [204], [<nil>]
2017-11-22T11:59:26Z E! InfluxDB Output Error: Response Error: Status Code [503], expected [204], [<nil>]
2017-11-22T11:59:26Z E! Error writing to output [influxdb]: Could not write to any InfluxDB server in cluster
2017-11-22T11:59:26Z E! InfluxDB Output Error: Response Error: Status Code [503], expected [204], [<nil>]
2017-11-22T11:59:29Z E! InfluxDB Output Error: Response Error: Status Code [503], expected [204], [<nil>]
2017-11-22T11:59:29Z E! InfluxDB Output Error: Response Error: Status Code [503], expected [204], [<nil>]
2017-11-22T11:59:29Z E! InfluxDB Output Error: Response Error: Status Code [503], expected [204], [<nil>]
2017-11-22T11:59:29Z E! Error writing to output [influxdb]: Could not write to any InfluxDB server in cluster
2017-11-22T11:59:40Z E! Error writing to output [influxdb]: Could not write to any InfluxDB server in cluster

… etc. However, due to the nature of our setup / auto discovery of endpoints, this (same) endpoint can be configured up to 3 times in total, can this be related recovery-wise?

Note the intermittent errors is most likely on our side, and that I have seen Telegraf recover from similar situations, just not always. A kind of heartbeat like scenario (send a metric with the current autogenerated node-name, kill (or better alternative?) the node if the heartbeat has not arrived for more than X minutes is possible, but a bit elaborate to set up. A method where the the internal metrics only could be written fo file or socket to inspect locally would probably be preferred, but I do not see an option to limit the output to this to only metrics gathered with internal, is there such an option?

#4

Probably I will try a SIGHUP on failure BTW to see if that works, although https://github.com/influxdata/telegraf/issues/2679 states there will be dropped data, if the SIGHUP comes early enough to not drop too much data I’m fine with it. Luckily I’m using this just for statistics where dropping data is perfectly acceptable for a short amount of time, just not as long as we’re sometimes experiencing :slight_smile:

#5

Does the InfluxDB Output Error message repeat even when the InfluxDB server is up again? Another interesting test would be if when this occurs again, can you exec into the docker container and try to contact the InfluxDB server directly with tools like curl.

A method where the the internal metrics only could be written fo file or socket to inspect locally would probably be preferred, but I do not see an option to limit the output to this to only metrics gathered with internal, is there such an option?

The way to do this is to add a file output that includes internal*, you can also exclude them from your InfluxDB output if desired:

[[outputs.file]]
  files = ["/tmp/telegraf.out"]
  namepass = {"internal*"]

[[outputs.influxdb]]
  namedrop = ["internal*"]
#6
  • The actual error message about output does not repeat when the server is back up again, but the “statsd message queue full. We have dropped X messages” error does keep repeating & incrementing (and no new data is being sent).
  • Next time this happens I’ll be sure to do that, however I seem to remember having tried this and succeeding in contacting the server. The memory is fuzzy though, so next time it happens (which is unpredictable…) I can have a more definitive answer there, and I’ll try to see what a SIGHUP at that time does as well, and if that fails a restart, in that order.

And for some reason I managed to miss the quite clear namepass in the documentation which is on me :slight_smile:
I’ll fiddle with it somewhat more, and will get back with the extra data once it happens again.

#7

Of course, this still not happened again… Still waiting for the next time to reproduce, but luckily it’s rare :smile: