I have some docker-engine-nodes with telegraf running natively and one influxdb-container in docker - which is configured as output for telegraf.
The problem is: When the influxdb-container is not available for a short time, telegraf does not try to reconnect again. The logging of telegraf als stopped at that moment. As a result, the metrics and logs of telegraf within the last 10 days are missing.
Is this true no matter how long the influxdb is not available? This time we had a storage-problem for some hours and after fixing it (and the influxdb can be connected via http) the telegrafs running on ubuntu were not reconnecting.
Or, the other way round: What is the time I have to wait for a reconnection? I cannot find a value for that in the telegraf config.
Yes, it will retry forever. Currently the way it works is that Telegraf attempts to write either after the metric_batch_size new metrics have been received or after flush_interval, whichever comes first. These are the same rules that are used for all writes and it will reconnect during the write.
I have the same behavior. However I noticed that the docker container uptime is the same as when the logging stopped, like if docker would have restarted the container when the error log is generated. However, I do not see telegraf loading its inputs/outputs as when I manually restart it.
I am using telegraf ping input to monitor local device and internet availability in a high latency/packet loss environment, where the influxdb is hosted centrally in a datacenter, so it happens quite often.
I tried to enable the debug logs but did not see anything except that telegraf could not write to it’s outputs.influxdb.