Hi,
After migration to influxDB 2 we are experiencing recurring crashes. Every few weeks, the database crashes and becomes unresponsive. Every request ends up with a status 500 internal error and stays in that state until manual restart is performed, which always restores normal operation. Debug logs don’t provide any additional valuable info. The crashes consistently occur on Mondays at ~2:10 a.m. UTC, a time when we have relatively low system load.
We are running InfluxDB version 2.7.9 on ARM-based instances. We have two separate environments, both experiencing similar problems, but the more heavily used one encounters the problem more frequently. It doesn’t run out of resources, CPU and RAM remain around ~20% and disk space usage is at most 70%, which should be enough for at least a few months assuming write rate doesn’t change. We have analyzed metrics, but couldn’t find any clear trends. During crashes InfluxDB uses barely any resources, minutes before it happens usage looks normal.
We are using both InfluxQL and FluxQL, but mostly InfluxQL. Every hour, a task runs to calculate averages for the last few hours. There is only one application writing to Influx, but it runs multiple instances. More than one application reads from the database. The write rate is stable.
We collected pprof profile during an accident, but it didn’t indicate any specific component. We are writing to the database in batches, but not grouped by series. Multiple measurements from one device may be saved in different batches.
Do you have any suggestions on what could cause such a problem?
Hello @Jakub,
First I’d like to ask what is inspiring the migration to InfluxDB 2 over 3? Or staying at 1?
V2 is quite different from V1 or V3, almost a different product.
The number of shards could be causing memory pressure during compaction operations. Since your crashes happen at a specific time, it might coincide with scheduled compaction or retention policy enforcement. Do you have scheduled tasks? That could be contributing as well. It could also be a memory leak I suppose. Or compaction operations coinciding?
InfluxDB 1 had stability issues, especially memory leaks. InfluxDB 3 wasn’t released yet, we have migrated over a year ago. InfluxDB 2 was working fine for a few months, but starting December it’s been crashing on Mondays every few weeks.
Currently we don’t have any data retention set, so shard compaction seems much more probable. We have a task scheduled to run every hour, but it finishes consistently in under a minute and the db crashes ~10 minutes later. I can try to tweak config for shard compaction to limit concurrency and see if that helps. What’s confusing is that if shard compaction is causing the crashes, I would expect to see resources usage spike, but it stats stable until database crashes. When there is a Monday without a crash resources usage stays low and well below any limits..