Hi
We have an Influxdb 2.0 OSS installation with about 2000 telegraf agents connected.
It runs on a Rhel 9 server with 12 GB memory and 4 cpu’s. Swap at 2gb, but now increased to 8 gb, but still running out of memory and swap.
It’s a default installation, no custom config.
Retention is 7 days on the buckets.
Server has now been running fine for 10 months, no issues at all. But last 7 day’s the influxdb process has stopped every night around 01:20. This is in Norway, so UTC time + 1.
The Influxd service will the be restarted by Linux and it works fine again, except for 1 time where the server stopped, and we had to force a reboot.
We have’t done any changes directly to the Linux server or the Influxdb setup lately.
Only thing new to the database is that we have used the Starlark plugin to rename some fields.
E.g. we have renamed field “used” to “used_mb” for cosmetic and informational reasons.
We have introduced this change to about 200 servers for now.
So 1 server can now have 2 fields, used and used_mb, but only used_mb have metrics after the name change.
So the question is, can this have som impact on the built in “cleaning, retention, sync” jobs that runs regulary, and specifically is there a job that only run 1 time every 24 hours?
It feels it must be ha specific Influxdb job that causes this, since it happens on the same time every day…
I have also enabled some debug today:
flux-log-enabled = true
log-level = “warn”
storage-max-index-log-file-size = 2097152
So I will see tomorrow if there are any new info printed by the debug.
Here is some of the dump from the message log that shows the dump of the process. Also attached as a file to the case.
So I need some help here to pinpoint the root cause here.
Any help will be appriciated!
Feb 1 01:20:02 p0okularapp01 systemd[382046]: Starting D-Bus User Message Bus Socket…
Feb 1 01:20:02 p0okularapp01 systemd[382046]: Starting Create User’s Volatile Files and Directories…
Feb 1 01:20:02 p0okularapp01 systemd[382046]: Finished Create User’s Volatile Files and Directories.
Feb 1 01:20:02 p0okularapp01 systemd[382046]: Listening on D-Bus User Message Bus Socket.
Feb 1 01:20:02 p0okularapp01 systemd[382046]: Reached target Sockets.
Feb 1 01:20:02 p0okularapp01 systemd[382046]: Reached target Basic System.
Feb 1 01:20:02 p0okularapp01 systemd[382046]: Reached target Main User Target.
Feb 1 01:20:02 p0okularapp01 systemd[382046]: Startup finished in 104ms.
Feb 1 01:20:02 p0okularapp01 systemd[1]: Started User Manager for UID 0.
Feb 1 01:20:02 p0okularapp01 systemd[1]: Started Session 17745 of User root.
Feb 1 01:20:02 p0okularapp01 systemd[1]: Started Session 17746 of User root.
Feb 1 01:20:27 p0okularapp01 kernel: HangDetector invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
Feb 1 01:20:28 p0okularapp01 kernel: CPU: 0 PID: 1035 Comm: HangDetector Kdump: loaded Not tainted 5.14.0-503.19.1.el9_5.x86_64 #1
Feb 1 01:20:28 p0okularapp01 kernel: Hardware name: VMware, Inc. VMware20,1/440BX Desktop Reference Platform, BIOS VMW201.00V.21805430.B64.2305221830 05/22/2023
Feb 1 01:20:28 p0okularapp01 kernel: Call Trace:
Feb 1 01:20:28 p0okularapp01 kernel:
Feb 1 01:20:28 p0okularapp01 kernel: dump_stack_lvl+0x34/0x48
Feb 1 01:20:28 p0okularapp01 kernel: dump_header+0x4a/0x213
Feb 1 01:20:28 p0okularapp01 kernel: oom_kill_process.cold+0xb/0x10
Feb 1 01:20:28 p0okularapp01 kernel: out_of_memory+0xed/0x2e0
Feb 1 01:20:28 p0okularapp01 kernel: __alloc_pages_slowpath.constprop.0+0x6bc/0x970
Regards
Kjetil Klevengen