Influxdb_listener, Error parsing the request body

Influx 1.8.3
Telegraf 1.18

This is a cross-post from another thread but I believe it’s worth posting with a different header and some other details.

I’ve been running with an updated telegraf.conf file for about a week and performance has improved but I’ve had a couple incidents where things have gummed up and data has been lost for several hours at a time.

First, here’s the agent portion of my telegraf.conf file:

    [agent]
     hostname = "host"
      round_interval = true
      metric_batch_size = 1000
      metric_buffer_limit = 250000
      collection_jitter = "10s"
      flush_interval = "10s"
      flush_jitter = "0s"
      precision = ""
      interval = "15s"
      omit_hostname = false
      debug = true
      logfile = "/var/log/telegraf/telegraf.log"

I am using influxdb_listener to ingest the data and here’s when the ingest stopped:

2021-04-02T04:09:11Z D! [outputs.influxdb] Wrote batch of 31 metrics in 6.549749ms
2021-04-02T04:09:11Z D! [outputs.influxdb] Buffer fullness: 0 / 250000 metrics
2021-04-02T04:09:16Z D! [outputs.influxdb] Buffer fullness: 0 / 250000 metrics
2021-04-02T04:09:22Z D! [outputs.influxdb] Wrote batch of 28 metrics in 7.716756ms
2021-04-02T04:09:22Z D! [outputs.influxdb] Buffer fullness: 0 / 250000 metrics
2021-04-02T04:09:28Z D! [outputs.influxdb] Buffer fullness: 0 / 250000 metrics
2021-04-02T04:09:35Z D! [outputs.influxdb] Wrote batch of 30 metrics in 6.455603ms
2021-04-02T04:09:35Z D! [outputs.influxdb] Buffer fullness: 0 / 250000 metrics
2021-04-02T04:09:41Z D! [outputs.influxdb] Buffer fullness: 0 / 250000 metrics
2021-04-02T04:09:48Z D! [outputs.influxdb] Wrote batch of 26 metrics in 6.167207ms
2021-04-02T04:09:48Z D! [outputs.influxdb] Buffer fullness: 1 / 250000 metrics
2021-04-02T04:09:49Z D! [inputs.influxdb_listener] Error parsing the request body: read tcp 172.XX.XX.XX:8186->166.XX.XX.XX:11098: i/o timeout
2021-04-02T04:09:51Z D! [outputs.influxdb] Buffer fullness: 0 / 250000 metrics
2021-04-02T04:30:20Z W! [agent] ["outputs.influxdb"] did not complete within its flush interval

The ingest gap lasted for almost four (4) hours until the server rebooted itself on a system overload trigger. The parsing error and i/o timeout seemed to start the whole issue so I’m looking for a way to respond to that event to quickly restart the ingest process.

The line from the log file excerpt above containing “Error parsing the request body…” contains two IP addresses. The address 172.XX.XX.XX is the address of the server running InfluxDB and Telegraf with the influxdb_listener ingesting on port 8186. That same line has another IP address, 166.XX.XX.XX, that I do not recognize and seems to be an AT&T mobile device. We are using AT&T SIM cards in our IoT devices that post data to be ingested so I wonder is this is a bad data packet from one of those devices. Basically, I do not understand the part:

read tcp 172.XX.XX.XX:8186->166.XX.XX.XX:11098: i/o timeout

Hello @dean.nederveld,
Have you gotten an answer where you cross posted?
Or should I share with the Telegraf team?
Thank you.

Hello @Anaisdg,

No, I haven’t received a reply and would greatly appreciate you sharing this with the Telegraf team.
Regards…

1 Like

Hello Dean,

It looks like the request timed out when influxdb_listener was reading the metrics, possibly due to network interruption. This will result in the client retrying the message.

I’ve reviewed the source code around this and I don’t see anything in Telegraf that would cause it to stop taking requests after receiving an error. Is it possible there was some other network or hardware issue at the same time?

Any idea what caused the system overload trigger to fire?

it’s very weird that the buffer fullness and flush messages would also stop. That would seem to point to an issue with system resources; maybe a lack of cpu allocation or death via swap.

Let me know what you think.

Cheers,
Steven

The issue ended up being with the _internal database. Once I disabled it I haven’t had any downtime or overload issues.