Influx 1.8.3
Telegraf 1.18
This is a cross-post from another thread but I believe it’s worth posting with a different header and some other details.
I’ve been running with an updated telegraf.conf file for about a week and performance has improved but I’ve had a couple incidents where things have gummed up and data has been lost for several hours at a time.
First, here’s the agent portion of my telegraf.conf file:
[agent] hostname = "host" round_interval = true metric_batch_size = 1000 metric_buffer_limit = 250000 collection_jitter = "10s" flush_interval = "10s" flush_jitter = "0s" precision = "" interval = "15s" omit_hostname = false debug = true logfile = "/var/log/telegraf/telegraf.log"
I am using influxdb_listener to ingest the data and here’s when the ingest stopped:
2021-04-02T04:09:11Z D! [outputs.influxdb] Wrote batch of 31 metrics in 6.549749ms
2021-04-02T04:09:11Z D! [outputs.influxdb] Buffer fullness: 0 / 250000 metrics
2021-04-02T04:09:16Z D! [outputs.influxdb] Buffer fullness: 0 / 250000 metrics
2021-04-02T04:09:22Z D! [outputs.influxdb] Wrote batch of 28 metrics in 7.716756ms
2021-04-02T04:09:22Z D! [outputs.influxdb] Buffer fullness: 0 / 250000 metrics
2021-04-02T04:09:28Z D! [outputs.influxdb] Buffer fullness: 0 / 250000 metrics
2021-04-02T04:09:35Z D! [outputs.influxdb] Wrote batch of 30 metrics in 6.455603ms
2021-04-02T04:09:35Z D! [outputs.influxdb] Buffer fullness: 0 / 250000 metrics
2021-04-02T04:09:41Z D! [outputs.influxdb] Buffer fullness: 0 / 250000 metrics
2021-04-02T04:09:48Z D! [outputs.influxdb] Wrote batch of 26 metrics in 6.167207ms
2021-04-02T04:09:48Z D! [outputs.influxdb] Buffer fullness: 1 / 250000 metrics
2021-04-02T04:09:49Z D! [inputs.influxdb_listener] Error parsing the request body: read tcp 172.XX.XX.XX:8186->166.XX.XX.XX:11098: i/o timeout
2021-04-02T04:09:51Z D! [outputs.influxdb] Buffer fullness: 0 / 250000 metrics
2021-04-02T04:30:20Z W! [agent] ["outputs.influxdb"] did not complete within its flush interval
The ingest gap lasted for almost four (4) hours until the server rebooted itself on a system overload trigger. The parsing error and i/o timeout seemed to start the whole issue so I’m looking for a way to respond to that event to quickly restart the ingest process.
The line from the log file excerpt above containing “Error parsing the request body…” contains two IP addresses. The address 172.XX.XX.XX is the address of the server running InfluxDB and Telegraf with the influxdb_listener ingesting on port 8186. That same line has another IP address, 166.XX.XX.XX, that I do not recognize and seems to be an AT&T mobile device. We are using AT&T SIM cards in our IoT devices that post data to be ingested so I wonder is this is a bad data packet from one of those devices. Basically, I do not understand the part:
read tcp 172.XX.XX.XX:8186->166.XX.XX.XX:11098: i/o timeout