HI all, I’ve been trying to troubleshoot a maybe once every 45-60 day out of memory situation for InfluxDB, but haven’t been able to find anything concrete. It does typically seem to happen overnight, so no users would be running dashboard queries or the like - just Kapacitor running stream queries.
We do run daily incremental backups, with a full occuring on the weekend. This crash happened about 3 hours after the incremental, so seemingly was unrelated. The incremental takes about 2 minutes to complete.
According to the OSS hardware sizing docs, I believe I should be in good shape for 2 CPU/4GB Mem requirements (see attached image), so I didn’t really want to start messing with random configs settings. I definitely could just throw more memory at it, but would like to try to understand a bit more about what’s going on before I did so in case there was something obvious I’m missing.
Any suggestions would be greatly appreciated! Let me know if I can provide any other diagnostics info!
4 GB Mem
Running on AWS gp2 storage at 1200 IOPS
Converted to TSI database
Only config changes in influxdb.conf are to change storage paths & change to TSI.
64.24k writes/min = ~1079 writes per second
~94 GB DB size
Only sending Telegraf data at the moment, which has a 400d retention period
2020-12-16T08:53:42 through :45 - Level 1, 2, 3, full tsm1 compaction occurs.
2020-12-16T08:53:46 through 54:11 - Telegraf 204 success codes on write to DB.
2020-12-16T08:54:11 - First instance of a 500 timeout on DB write:
“POST /write?consistency=any&db=telegraf HTTP/1.1” 500 20 “-” “Telegraf/1.13.2” 43d8c9c2-3f7c-11eb-a797-0ab3985957e0 10915754
ts=2020-12-16T08:54:22.088914Z lvl=error msg=" - “timeout”" log_id=0QNSgWXG000 service=httpd
… bunch of groupings where ~20 Telegraf writes will complete, but mostly dominated by 500 code timeouts …
2020-12-16T09:08:39 - Last log written before out of memory error:
ts=2020-12-16T09:08:39.203868Z lvl=error msg=" - \"timeout\"" log_id=0QNSgWXG000 service=httpd fatal error: runtime: out of memory runtime stack: runtime.throw(0x166acb8, 0x16) /usr/local/go/src/runtime/panic.go:774 +0x72 runtime.sysMap(0xc1d8000000, 0x4000000, 0x35aec78) /usr/local/go/src/runtime/mem_linux.go:169 +0xc5 runtime.(*mheap).sysAlloc(0x3595c80, 0x2000, 0x2000, 0x7fffbefb8c60) /usr/local/go/src/runtime/malloc.go:701 +0x1cd runtime.(*mheap).grow(0x3595c80, 0x1, 0xffffffff) /usr/local/go/src/runtime/mheap.go:1255 +0xa3 runtime.(*mheap).allocSpanLocked(0x3595c80, 0x1, 0x35aec88, 0xc00004b820) /usr/local/go/src/runtime/mheap.go:1170 +0x266 runtime.(*mheap).alloc_m(0x3595c80, 0x1, 0x7f4788ea0011, 0x45d0fa) /usr/local/go/src/runtime/mheap.go:1022 +0xc2 runtime.(*mheap).alloc.func1() /usr/local/go/src/runtime/mheap.go:1093 +0x4c runtime.systemstack(0x0) /usr/local/go/src/runtime/asm_amd64.s:370 +0x66 runtime.mstart() /usr/local/go/src/runtime/proc.go:1146