Influx OSS locks up

We have been using influx OSS for about 18 months to ingest data from 15-20 IoT devices. The influx (v 1.8.3) setup runs on a AWS EC2 Linux Ubuntu instance. The IoT devices post data to influx at 30 second intervals, but the devices are not synchronized so it’s possible the data posting could occur one after another or even almost simultaneously.

Most often the data is ingested without issue but sometimes influx seems to lock up or freeze, causing the data to not be ingested. In this situation the IoT devices are unable to connect to the Linux instance via the http connection used to post the data. Eventually the Linux instance will again start allowing data to post. Sometimes the Linux instance has to be forcibly restarted

The situation described above seems to happen randomly, sometimes for brief 5-10 minute periods, and sometimes for a few hours.

What can I research to see what might be the problem (logs?) and what can be done to optimize influx for ingesting data to minimize or eliminate the freezing issue?

I don’t have much to offer for proving this is actually the problem you’re facing, but it’s worth adding flush jitter and even collection jitter in some cases as a best-practice. They are used for exactly the reason you suspect is causing all the agents to pile up writing to the server at the same time .

Full detail:

Summary copied:

  • flush_jitter : Default flush jitter for all outputs. This jitters the flush interval by a random amount. This is primarily to avoid large write spikes for users running a large number of telegraf instances. ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s.

  • collection_jitter : Collection jitter is used to jitter the collection by a random interval. Each plugin will sleep for a random time within jitter before collecting. This can be used to avoid many plugins querying things like sysfs at the same time, which can have a measurable effect on the system.

let us know if one or both of these settings improves things.

1 Like

Thanks for the suggestion. It prompted me to review my telegraf.conf file and I discovered I was still using the default telegraf configuration. I believe this means the IoT devices are not at all routing through telegraf. Each of the IoT devices is sending data via an HTTP POST command and I do not have a telegraf input plugin configured to handle the communication. Therefore, I believe the devices are all trying to write directly to the database.

I plan to research the influxdb_listener input plugin and see if using it improves performance. I will also add an output plugin to write to the influxdb database.

It’s a bit of an adjustment, but it will be worth it in the end

Am I correct in believing without the influxdb_listener input plugin in the telegraf.conf file the IoT devices are bypassing Telegraf altogether and writing directly to the influx database?

If so, then adding/changing the jitter settings in the telegraf.conf will not have any impact unless the input plugin is added, correct?

I’ve never configured a setup like you have , but yeh in principle you’d set up a telegraf receiver that the IOT devices are pointed at, and then it pushes the data onto influxdb . I don’t know if or how the jitter parameters would get used in this type of setup.

Just in case it is something else, I’d suggest verifying it’s definitely all agents writing at the same time, eg using packet capture on the influxdb sever for tcp port 8086, and confirming

  1. All the devices are writing at the same time,
  2. The lockup on the dB happens at the same time too.

Ie. Fixing #1 is worth doing. But #1 might not be causing #2

So, it’s been a while since I’ve been able to get back to this, but this past week I was able to finally get my Telegraf configuration set so that packets are coming in on port 8186 via influxdb_listener input plugin and routing to the database via the influxdb output plugin. One hangup I had was pasting the configuration into the telegraf.conf file causing a TOML syntax error that seemed indiscernible. Typing in the configuration made all the difference.

Since running the new configuration I have not had any lockups and have been able to increase the frequency of the ingest from each IoT device from 30 secs per POST to 5 secs.

Now I am attempting to have inputs from one input plugin or set of plugins route to one database via the influxdb output pluging, and have the influxdb_listener plugin route to a different database. I’ll probably start a new thread for help on that.

Influx 1.8.3
Telegraf 1.18

So I’ve been running with an updated telegraf.conf file for about a week and performance has improved but I’ve had a couple incidents where things have gummed up and data has been lost for several hours at a time.

First, here’s the agent portion of my telegraf.conf file:

     hostname = "host"
      round_interval = true
      metric_batch_size = 1000
      metric_buffer_limit = 250000
      collection_jitter = "10s"
      flush_interval = "10s"
      flush_jitter = "0s"
      precision = ""
      interval = "15s"
      omit_hostname = false
      debug = true
      logfile = "/var/log/telegraf/telegraf.log"

I am using influxdb_listener to ingest the data and here’s when the ingest stopped:

2021-04-02T04:09:11Z D! [outputs.influxdb] Wrote batch of 31 metrics in 6.549749ms
2021-04-02T04:09:11Z D! [outputs.influxdb] Buffer fullness: 0 / 250000 metrics
2021-04-02T04:09:16Z D! [outputs.influxdb] Buffer fullness: 0 / 250000 metrics
2021-04-02T04:09:22Z D! [outputs.influxdb] Wrote batch of 28 metrics in 7.716756ms
2021-04-02T04:09:22Z D! [outputs.influxdb] Buffer fullness: 0 / 250000 metrics
2021-04-02T04:09:28Z D! [outputs.influxdb] Buffer fullness: 0 / 250000 metrics
2021-04-02T04:09:35Z D! [outputs.influxdb] Wrote batch of 30 metrics in 6.455603ms
2021-04-02T04:09:35Z D! [outputs.influxdb] Buffer fullness: 0 / 250000 metrics
2021-04-02T04:09:41Z D! [outputs.influxdb] Buffer fullness: 0 / 250000 metrics
2021-04-02T04:09:48Z D! [outputs.influxdb] Wrote batch of 26 metrics in 6.167207ms
2021-04-02T04:09:48Z D! [outputs.influxdb] Buffer fullness: 1 / 250000 metrics
2021-04-02T04:09:49Z D! [inputs.influxdb_listener] Error parsing the request body: read tcp [[IP ADDR 1]]:8186->[[IP ADDR 2]]:11098: i/o timeout
2021-04-02T04:09:51Z D! [outputs.influxdb] Buffer fullness: 0 / 250000 metrics
2021-04-02T04:30:20Z W! [agent] ["outputs.influxdb"] did not complete within its flush interval

The ingest gap lasted for almost four (4) hours until the server rebooted itself on a system overload trigger.

The parsing error and i/o timeout seemed to start the whole issue so I’m looking for a way to respond to that event to quickly restart the ingest process.