InfluxDB crashing without any visible reasons

Hello everyone! I’m having trouble with InfluxDB and couldn’t find anything on the net about it.

My system configuration is as follows: industrial PC from SE https://www.se.com/ru/ru/product/HMIBSCEA53D1L01/iiot-edge-box-8гб-emmc-linux/ NodeRed, influxDB and Grafana were deployed with the help of balena-engine. All three containers use the same volume. Data is taken from the field device using NodeRed, transferred to influxDB for storage and visualized using Graphana.

Everything seems to be working well - I can write / read data from the database, but after a certain period of time, either when requesting data from graphana, or simply by itself, without requests, influx stops responding to write requests from NodeRed. At the same time, the processor load of my PC is approaching 100%, of which 60-80% of the resources are allocated to io. All containers stop working properly and I can only restart the PC to restore work. I looked at the influxdb container logs and errors appear at the time of the problem:
“JSON encoding error” log_id-02JOM001000 error=“stan: connection closed” OR next time, when influxDB crashed after query, i saw: “stan: connection lost due to PING failure”

It is my first experience with Influx and I can’t fix this problem. I hope someone can help me with it.

Daniil

Hi @DaniilB, welcome to the community!
Great to hear you’re using InfluxDB in an industrial setting. That was my previous background. Could you let me know what version of InfluxDB you are deploying? Also, could you send me a sample of the data you’re sending to InfluxDB?

On another note, I would advise using different volumes for each container unless there is a need for files to be passed between containers. This helps to improve the durability of your deployment.

look forward to hearing back from you.

Thanks,
Jay

Thank you @Jay_Clifford for fast answer!

I deploy InfluxDB 2.1.1, latest image from docker.hub I believe.
Sample of data:
[{“measurement”:“Tm172”,“fields”:{“Скорость”:15,“Температура”:749,“Ток”:45,“Влажность”:209,“Напряжение”:836,“Мощность”:418,“ОЕЕ”:30},“timestamp”:“2022-01-28T11:21:10.571Z”}]

There was a need for one volume before, for sharing files between containers, now there is no need, i will try use different volumes.

Thanks,
Daniil

I have a little update, so i recreated three containers with different volumes. My system crashed again.

This time, i think, and i see it in logs, my source of data stopped working and influx worked well for one day. But after one day, i see it in logs of the container, i started getting errors:
“Recorder handler error” log_id=0ZK9gZiW000 error=timeout
“JSON encoding error” log_id=0ZK9gZiW000 error=“stan: connection closed”
“STREAM : [Client:nats-subscriber-4222] Failed sending to subid=1, subject=promTarget, seq=10223, err=nats: outbound buffer limit exceeded” log_id=0ZK9gZiW000 service=nats nats_level=error

I don’t understand why this errors appear - my source of data is down and therefore no data is being passed to the db, it’s just chilling and waiting for new data. In my view, errors should not appear and influx should be in working state. But as a result, the influx container is not available

Hi @DaniilB,
Do you see this message occurring when accessing the Grafana dashboards? Are the Grafana dashboards set to auto-refresh? I am just trying to gauge whether this might be a stress on your system. As I note you are running: node-red, InfluxDB and Grafana on the IoT Edge box.

Are you using an SD card to expand the storage or just using the onboard eMMC? I will flag the error with also with the internal team aswell so they can take a look :slight_smile:.

Thank you for your answer, @Jay_Clifford!

Do you see this message occurring when accessing the Grafana dashboards?

No, they appear just after some time of work, random time, I haven’t been able to identify the dependency yet.
Previously, i could repeat error when i sent heavy query from Grafana, but now i changed my querys, so now they are quick and less heavy, like that:

from(bucket: "StatMonitor")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => 
    r._measurement == "Tm172" and
    r._field =~ /${parameter_name:pipe}/)
  |> aggregateWindow(every: 5s, fn: mean, createEmpty: false)

Are the Grafana dashboards set to auto-refresh?

No, manual refresh

As I note you are running: node-red, InfluxDB and Grafana on the IoT Edge box

Node-Red, InfluxDB and Grafana containers with Balena-engine on the IoT Edge box

Are you using an SD card to expand the storage or just using the onboard eMMC

Onboard eMMC

For note, now the system is working for about three hours, everything is just fine: processor load 25 % max, no errors in logs. I’m trying to catch error moment/repeat error.

The ‘stan’ error makes me think something with NATS is acting up. NATS is only used for prometheus scrapers in OSS - do you have any of those? We’ve seen issues for example if you delete a bucket that a scraper is configured to write to.

See also Delete an InfluxDB scraper | InfluxDB OSS 2.1 Documentation

@Samuel_Arnold thank you for your answer!
There is actually one, but i didn’t configure it before, i guess it was added automatically when i created bucket? I deleted it, will see…

I got update. I deleted scraper, everything seemed working fine, i tried submit query from influx UI and after that i saw that my processor recources begun allocating to io. After that i saw two new processes - kswapd0 and mmcqd0. And containers stopped responding, everything just like before. After the reboot, I checked the logs and saw that immediately after / during the occurrence of the error, a JSON ENCODING ERROR entry appeared in the influx logs, however, the data continued to be written to the database for some time. The question is the following - the cause of the memory problems is this error or is it a consequence of the memory problems that have arisen

kswapd sounds like memory issues - the fact that you are swapping indicates something is probably overloaded.