[solved] Telegraf should reconnect after influxdb-timeouts

telegraf
influxdb

#1

Hey there,

I have some docker-engine-nodes with telegraf running natively and one influxdb-container in docker - which is configured as output for telegraf.

The problem is: When the influxdb-container is not available for a short time, telegraf does not try to reconnect again. The logging of telegraf als stopped at that moment. As a result, the metrics and logs of telegraf within the last 10 days are missing.

The only solution is to restart telegraf - which is working very fine.
This seems to be similar to Telegraf recover from - or detect - temporary failure

and is happening with

telegraf --version
Telegraf v1.4.5 (git: release-1.4 8385206e6851a212e04b355e3bf0b95421ed0e69)

Is there a way to get telegraf reconnected after an influxdb timeout again?

//edited: crosslink https://github.com/influxdata/telegraf/issues/2679#issuecomment-354802213


#2

Telegraf should retry automatically, if you can provide reproduction instructions then please open an issue.


#3

Thanks for that. Cannot reproduce that with

# telegraf --version
Telegraf v1.5.1 (git: release-1.5 0605af7c)

anymore.


#4

Telegraf should retry automatically

Is this true no matter how long the influxdb is not available? This time we had a storage-problem for some hours and after fixing it (and the influxdb can be connected via http) the telegrafs running on ubuntu were not reconnecting.

Or, the other way round: What is the time I have to wait for a reconnection? I cannot find a value for that in the telegraf config.


#5

Yes, it will retry forever. Currently the way it works is that Telegraf attempts to write either after the metric_batch_size new metrics have been received or after flush_interval, whichever comes first. These are the same rules that are used for all writes and it will reconnect during the write.