Migrated Data from CentOS 7 to Alma 9, Influx DB crashes and will not re-start after a week or so

I migrated servers. InfluxDB2 version was the same at the time. I am now on 2.6.1-1 on both systems.

Copied everything from here to new system:
/etc/influxdb/*
/var/lib/influxdb/*

Everything started and worked well for about a week. Then it crashed, and will not accept any new data, or allow queries. I am able to take the data, back to the old CentOS 7 server, and it will start successfully. I had also been able to move the data back to the Alma 9 server from the CentOS system after this process, and things would work as well.

This also happens if I don’t move the data, start with an empty DB, and just let the system run for about a week. By a week, I mean, I “fix” it on Monday, and Sunday it dies.

Any tips for troubleshooting this more? I enabled Debug logging which allowed me to find that the max files ulimit was getting hit, so I made that unlimited, and was hoping that would fix things… It has not.

Thank you,
Matt

How did you correct the ulimit?

I have this in my ~/.bash_profile to make the limit huge:
ulimit -n 10000

What are the errors you’re getting?

I adjusted it in /etc/security/limits.conf

* soft nofile 65535
* hard nofile 65535

ulimit -a for the influxdb user:
real-time non-blocking time (microseconds, -R) unlimited
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 126186
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 65535
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 126186
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

I can not upload an attachment of the full log, but it is basically just doing this a lot:

2023-03-20T13:27:32.159410Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.043ms}
2023-03-20T13:27:32.159459Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.094ms}
2023-03-20T13:27:32.735149Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.033ms}
2023-03-20T13:27:32.735319Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.206ms}
2023-03-20T13:27:33.286301Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.036ms}
2023-03-20T13:27:33.286430Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.175ms}
2023-03-20T13:27:33.298214Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.034ms}
2023-03-20T13:27:33.298376Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.198ms}
2023-03-20T13:27:33.336490Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.036ms}
2023-03-20T13:27:33.336628Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.176ms}
2023-03-20T13:27:33.842236Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.047ms}
2023-03-20T13:27:33.842406Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.221ms}
2023-03-20T13:27:33.854436Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.037ms}
2023-03-20T13:27:33.854584Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.188ms}
2023-03-20T13:27:34.284432Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.036ms}
2023-03-20T13:27:34.284517Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.123ms}
2023-03-20T13:27:34.769970Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.024ms}
2023-03-20T13:27:34.770129Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.185ms}
2023-03-20T13:27:34.780038Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.023ms}
2023-03-20T13:27:34.780187Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.175ms}
2023-03-20T13:27:35.152877Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.031ms}
2023-03-20T13:27:35.152931Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.086ms}
2023-03-20T13:27:35.161613Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.049ms}
2023-03-20T13:27:35.161659Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.097ms}
2023-03-20T13:27:35.611592Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.044ms}
2023-03-20T13:27:35.611644Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.100ms}
2023-03-20T13:27:35.620467Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.021ms}
2023-03-20T13:27:35.620511Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.068ms}
2023-03-20T13:27:35.628992Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.020ms}
2023-03-20T13:27:35.629029Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.059ms}
2023-03-20T13:27:35.934655Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.033ms}
2023-03-20T13:27:35.934770Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.152ms}
2023-03-20T13:27:35.943481Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.023ms}
2023-03-20T13:27:35.943582Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.128ms}
2023-03-20T13:27:36.427358Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.026ms}
2023-03-20T13:27:36.427401Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.078ms}
2023-03-20T13:27:36.436024Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.041ms}
2023-03-20T13:27:36.436071Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.090ms}
2023-03-20T13:27:37.038328Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.026ms}
2023-03-20T13:27:37.039201Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.902ms}
2023-03-20T13:27:37.538971Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 491.061ms}
2023-03-20T13:27:37.539036Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 491.130ms}
2023-03-20T13:27:37.547697Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.033ms}
2023-03-20T13:27:37.547743Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.082ms}
2023-03-20T13:27:37.556452Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.033ms}
2023-03-20T13:27:37.556487Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.070ms}
2023-03-20T13:27:38.058359Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.033ms}
2023-03-20T13:27:38.058423Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.099ms}
2023-03-20T13:27:38.066745Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.041ms}
2023-03-20T13:27:38.066797Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.095ms}
2023-03-20T13:27:38.669666Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.034ms}
2023-03-20T13:27:38.669716Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.087ms}
2023-03-20T13:27:38.678330Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.021ms}
2023-03-20T13:27:38.678371Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.065ms}
2023-03-20T13:27:38.988955Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.026ms}
2023-03-20T13:27:38.989010Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.084ms}
2023-03-20T13:27:38.997738Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.031ms}
2023-03-20T13:27:38.997777Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.072ms}
2023-03-20T13:27:39.006178Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.021ms}
2023-03-20T13:27:39.006219Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.064ms}
2023-03-20T13:27:39.437555Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.022ms}
2023-03-20T13:27:39.437607Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.079ms}
2023-03-20T13:27:39.445874Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.025ms}
2023-03-20T13:27:39.445917Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.070ms}
2023-03-20T13:27:39.680443Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.038ms}
2023-03-20T13:27:39.680515Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.114ms}
2023-03-20T13:27:40.519510Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.035ms}
2023-03-20T13:27:40.519569Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.097ms}
2023-03-20T13:27:41.485524Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.033ms}
2023-03-20T13:27:41.485684Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.197ms}
2023-03-20T13:27:41.869769Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.028ms}
2023-03-20T13:27:41.869841Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.103ms}
2023-03-20T13:27:41.881666Z debug urm find {log_id: 0ggo3mK0000, store: new, took: 0.056ms}
2023-03-20T13:27:41.881723Z debug find permission for user {log_id: 0ggo3mK0000, store: new, took: 0.117ms}

I can get a specific part of the output if you’d like, and can point me to it?

Thank you for your help.

Matt

For more info, It presented similar to this initially: Influxdb2 does not start - error: "InfluxDB API at http://localhost:8086/ready unavailable after 219 attempts..." - #4 by h5py

Hello @magordon,
A coworker took a look and replied:

My initial reaction is that maybe the bolt database that contains user and shard info didn’t get copied or has incorrect permissions (or both). That would line up with the system dying after about a week, because that’s when a new shard group would need to be created, which requires writing to the bolt database.

So maybe try updating your permissions? Or recreating the shard? I’m not sure.

Hello @Anaisdg

Permissions look fine, and it was cloned with the data originally:


-rw-r-----. 1 influxdb influxdb   122880 Mar  6 08:30 influxd.sqlite
drwxr-xr-x. 6 influxdb influxdb       65 Mar 20 09:22 engine
-rw-r-----. 1 influxdb influxdb        7 Mar 20 09:38 influxd.pid
-rw-------. 1 influxdb influxdb 19988480 Mar 21 17:56 influxd.bolt
ID	Name			    Retention	Shard group duration  Organization ID	 Schema Type
1	_monitoring		    168h0m0s	24h0m0s			      b3	             implicit
2	_tasks			    72h0m0s		24h0m0s			      b3	             implicit
3	telegraf		    2160h0m0s	168h0m0s		      b3	             implicit
4	telegraf_archive	infinite	168h0m0s		      b3	             implicit

Thank you,
Matt