[solved] Telegraf should reconnect after influxdb-timeouts

Hey there,

I have some docker-engine-nodes with telegraf running natively and one influxdb-container in docker - which is configured as output for telegraf.

The problem is: When the influxdb-container is not available for a short time, telegraf does not try to reconnect again. The logging of telegraf als stopped at that moment. As a result, the metrics and logs of telegraf within the last 10 days are missing.

The only solution is to restart telegraf - which is working very fine.
This seems to be similar to Telegraf recover from - or detect - temporary failure

and is happening with

telegraf --version
Telegraf v1.4.5 (git: release-1.4 8385206e6851a212e04b355e3bf0b95421ed0e69)

Is there a way to get telegraf reconnected after an influxdb timeout again?

//edited: crosslink telegraf dropped/purged/truncated its output buffer on SIGHUP · Issue #2679 · influxdata/telegraf · GitHub

Telegraf should retry automatically, if you can provide reproduction instructions then please open an issue.

Thanks for that. Cannot reproduce that with

# telegraf --version
Telegraf v1.5.1 (git: release-1.5 0605af7c)

anymore.

Telegraf should retry automatically

Is this true no matter how long the influxdb is not available? This time we had a storage-problem for some hours and after fixing it (and the influxdb can be connected via http) the telegrafs running on ubuntu were not reconnecting.

Or, the other way round: What is the time I have to wait for a reconnection? I cannot find a value for that in the telegraf config.

Yes, it will retry forever. Currently the way it works is that Telegraf attempts to write either after the metric_batch_size new metrics have been received or after flush_interval, whichever comes first. These are the same rules that are used for all writes and it will reconnect during the write.

I have the same behavior. However I noticed that the docker container uptime is the same as when the logging stopped, like if docker would have restarted the container when the error log is generated. However, I do not see telegraf loading its inputs/outputs as when I manually restart it.
I am using telegraf ping input to monitor local device and internet availability in a high latency/packet loss environment, where the influxdb is hosted centrally in a datacenter, so it happens quite often.
I tried to enable the debug logs but did not see anything except that telegraf could not write to it’s outputs.influxdb.

I have the problem again:

# kubectl -n my-subdomain-production logs -f minio-3 telegraf

2019-05-16T09:03:55Z E! [agent] Error writing to output [influxdb]: could not write any address
2019-05-16T09:04:05Z E! [outputs.influxdb] when writing to [https://my-subdomain.influxdb.masked-company.com]: Post https://my-subdomain.influxdb.masked-company.com/write?db=telegraf: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2019-05-16T09:04:05Z E! [agent] Error writing to output [influxdb]: could not write any address
2019-05-16T09:04:15Z E! [outputs.influxdb] when writing to [https://my-subdomain.influxdb.masked-company.com]: Post https://my-subdomain.influxdb.masked-company.com/write?db=telegraf: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2019-05-16T09:04:15Z E! [agent] Error writing to output [influxdb]: could not write any address
2019-05-16T09:04:25Z E! [outputs.influxdb] when writing to [https://my-subdomain.influxdb.masked-company.com]: Post https://my-subdomain.influxdb.masked-company.com/write?db=telegraf: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2019-05-16T09:04:25Z E! [agent] Error writing to output [influxdb]: could not write any address
^C

# kubectl -n my-subdomain-production exec -ti minio-3 -c telegraf bash
root@minio-3:/# ps faux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        18  0.0  0.0  18244  3144 pts/0    Ss   09:05   0:00 bash
root        24  0.0  0.0  36640  2860 pts/0    R+   09:05   0:00  \_ ps faux
root         1  0.0  0.4 457088 52284 ?        Ssl  May15   0:45 telegraf
root@minio-3:/# telegraf --config /etc/telegraf/telegraf.conf --test
2019-05-16T09:05:34Z I! Starting Telegraf 1.10.0
root@minio-3:/# curl -v https://my-subdomain.influxdb.masked-company.com/ping
*   Trying <masked-IP>...
* TCP_NODELAY set
* Connected to my-subdomain.influxdb.masked-company.com (<masked-IP>) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.2 (OUT), TLS header, Certificate Status (22):
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=my-subdomain.influxdb.masked-company.com
*  start date: Apr  3 18:43:56 2019 GMT
*  expire date: Jul  2 18:43:56 2019 GMT
*  subjectAltName: host "my-subdomain.influxdb.masked-company.com" matched cert's "my-subdomain.influxdb.masked-company.com"
*  issuer: C=US; O=Let's Encrypt; CN=Let's Encrypt Authority X3
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x56082749ae80)
> GET /ping HTTP/1.1
> Host: my-subdomain.influxdb.masked-company.com
> User-Agent: curl/7.52.1
> Accept: */*
> 
* Connection state changed (MAX_CONCURRENT_STREAMS updated)!
< HTTP/2 204 
< content-type: application/json
< date: Thu, 16 May 2019 09:05:54 GMT
< request-id: cf3806b6-77b9-11e9-b79b-7eb3d86ef42e
< x-influxdb-build: OSS
< x-influxdb-version: 1.7.6
< x-request-id: cf3806b6-77b9-11e9-b79b-7eb3d86ef42e
< 
* Curl_http_done: called premature == 0
* Connection #0 to host my-subdomain.influxdb.masked-company.com left intact

Killing PID 1 (telegraf) in the container -> reconnect is working as expected.
Any ideas to debug / fix this?

https://github.com/influxdata/telegraf/issues/5905