Dear influxdata community,
We have a Docker container that as been running both Influxdb 2.2 and Grafana as entrypoints,
these worked without issue for quite a while.
After rebooting the server hosting these, the only problem is that the process for influxdb is stuck in a loop:
2023-10-17 01:15:09,813 INFO spawned: ‘influxdb’ with pid 9778
2023-10-17 01:15:10,850 INFO success: influxdb entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-10-17 01:15:32,888 INFO exited: influxdb (exit status 1; not expected)
2023-10-17 01:15:33,890 INFO spawned: ‘influxdb’ with pid 9785
2023-10-17 01:15:34,914 INFO success: influxdb entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-10-17 01:15:53,569 INFO exited: influxdb (exit status 1; not expected)
2023-10-17 01:15:54,571 INFO spawned: ‘influxdb’ with pid 9792
2023-10-17 01:15:55,597 INFO success: influxdb entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-10-17 01:16:18,180 INFO exited: influxdb (exit status 1; not expected)
2023-10-17 01:16:18,181 INFO spawned: ‘influxdb’ with pid 9799
2023-10-17 01:16:19,192 INFO success: influxdb entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
I’m looking at steps to troubleshoot this, would there be a checklist to help out?
To note is that this is part of a terraform deployment,
so help locating standard errors logs or other possible verifications is appreciated.
Troubleshooting a stuck InfluxDB process can be a bit challenging, but there are several steps you can take to diagnose and resolve the issue. Since you mentioned that this is part of a Terraform deployment, you may also want to check the Terraform configuration for any issues. Here’s a checklist to help you troubleshoot:
- Check InfluxDB Logs:
- Look for InfluxDB logs to get more details about why it’s failing. The logs you provided show that InfluxDB is exiting with an exit status of 1, which indicates an error.
- Typically, InfluxDB logs are located in
/var/log/influxdb/ or in the directory specified in your InfluxDB configuration file. Check the
logfile setting in your InfluxDB configuration.
- Configuration File Check:
- Ensure that your InfluxDB configuration file (usually
influxdb.conf) is correctly configured. Check for any syntax errors or incorrect settings.
- Resource Usage:
- Check the server’s resource usage (CPU, memory, disk space) to ensure there are no resource constraints causing InfluxDB to fail.
- Database Corruption:
- InfluxDB databases can become corrupted. Check the integrity of your InfluxDB database by running
influxd inspect verify /path/to/your/database.
- Dependency Check:
- Ensure that all dependencies for InfluxDB are installed and up to date. InfluxDB may rely on specific libraries or system tools.
- Port Availability:
- Ensure that the port InfluxDB is configured to listen on is available and not being used by another process.
- Check Terraform Configuration:
- Review your Terraform configuration files to ensure there are no errors or misconfigurations that could be affecting the server or InfluxDB deployment.
- Check for Third-party Services:
- If your InfluxDB deployment relies on other services or databases, make sure those services are also running and accessible.
- Update InfluxDB:
- If you are not using the latest version of InfluxDB, consider updating to the latest stable version as it may include bug fixes and improvements.
- Check for Custom Scripts:
- If you have any custom scripts or automation that interact with InfluxDB, review them for any issues.
- Check for Database Corruption:
- Run InfluxDB’s built-in repair tool to check for and repair any database corruption:
influxd repair -config /etc/influxdb/influxdb.conf
- Check for Disk Space Issues:
- Ensure that the disk where InfluxDB stores its data has enough free space. A lack of disk space can cause issues.
- Review System Logs:
- Check system logs (
/var/log/messages, or equivalent) for any system-level errors or issues that might be affecting InfluxDB.
- Firewall and Security Rules:
- Review firewall rules and security settings to make sure they are not blocking InfluxDB traffic.
- Reprovision with Terraform:
- If you suspect that the Terraform deployment might be causing the issue, try reprovisioning the server and InfluxDB using Terraform. Ensure that Terraform is correctly configuring the server and InfluxDB.
We’ve looked more closely at the custom config for this project and found a clue as to the effect on influxdb of rebooting our server:
replicationq: too many open files
ts=2023-11-23T13:35:35.597634Z lvl=error msg=“Failed to open shard” log_id=0lg67IWl000 service=storage-engine service=store op_name=tsdb_open db_shard_id=90 error=“[shard 90] open /var/lib/influxdb2/engine/data/944142f2f3a4127d/autogen/90/index/0/MANIFEST: too many open files”
ts=2023-11-23T13:35:35.598118Z lvl=error msg=“Failed to open shard” log_id=0lg67IWl000 service=storage-engine service=store op_name=tsdb_open db_shard_id=94 error=“[shard 94] open /var/lib/influxdb2/engine/data/944142f2f3a4127d/autogen/94/index/0/MANIFEST: too many open files”
ts=2023-11-23T13:35:35.637878Z lvl=info msg=“Open store (end)” log_id=0lg67IWl000 service=storage-engine service=store op_name=tsdb_open op_event=end op_elapsed=13373.426ms
ts=2023-11-23T13:35:35.638079Z lvl=info msg=“Starting retention policy enforcement service” log_id=0lg67IWl000 service=retention check_interval=30m
ts=2023-11-23T13:35:35.638209Z lvl=info msg=“Starting precreation service” log_id=0lg67IWl000 service=shard-precreation check_interval=10m advance_period=30m
ts=2023-11-23T13:35:35.639040Z lvl=error msg=“Failed to open replications service” log_id=0lg67IWl000 error=“open /var/lib/influxdb2/engine/replicationq: too many open files”
Error: open /var/lib/influxdb2/engine/replicationq: too many open files
See ‘influxd -h’ for help
ts=2023-11-23T13:35:38.676986Z lvl=info msg=“Welcome to InfluxDB” log_id=0lg68Ii0000 version=v2.2.0 commit=a2f8538837 build_date=2022-04-06T17:36:40Z
Could you help us navigate this aspect of influxdb? Is it a limit when restarting influxdb service with a large dataset?
For reference, i’ve found this doc concerning v2 install on Mac which mentions this error:
Do you know of equivalent steps for a linux distro using Docker? (Ubuntu 18.04)