[InfluxDB 1.8] Out of memory every 45-60 days

HI all, I’ve been trying to troubleshoot a maybe once every 45-60 day out of memory situation for InfluxDB, but haven’t been able to find anything concrete. It does typically seem to happen overnight, so no users would be running dashboard queries or the like - just Kapacitor running stream queries.

We do run daily incremental backups, with a full occuring on the weekend. This crash happened about 3 hours after the incremental, so seemingly was unrelated. The incremental takes about 2 minutes to complete.

According to the OSS hardware sizing docs, I believe I should be in good shape for 2 CPU/4GB Mem requirements (see attached image), so I didn’t really want to start messing with random configs settings. I definitely could just throw more memory at it, but would like to try to understand a bit more about what’s going on before I did so in case there was something obvious I’m missing.

Any suggestions would be greatly appreciated! Let me know if I can provide any other diagnostics info!

System stats:
2 vCPU
4 GB Mem
Running on AWS gp2 storage at 1200 IOPS
Converted to TSI database
Only config changes in influxdb.conf are to change storage paths & change to TSI.
64.24k writes/min = ~1079 writes per second
20,128 series
~94 GB DB size

Only sending Telegraf data at the moment, which has a 400d retention period

chronograf-1.8.5-1.x86_64
kapacitor-1.5.4-1.x86_64
influxdb-1.8.2-1.x86_64

Timeline:
2020-12-16T08:53:42 through :45 - Level 1, 2, 3, full tsm1 compaction occurs.
2020-12-16T08:53:46 through 54:11 - Telegraf 204 success codes on write to DB.
2020-12-16T08:54:11 - First instance of a 500 timeout on DB write:
“POST /write?consistency=any&db=telegraf HTTP/1.1” 500 20 “-” “Telegraf/1.13.2” 43d8c9c2-3f7c-11eb-a797-0ab3985957e0 10915754
ts=2020-12-16T08:54:22.088914Z lvl=error msg="[500] - “timeout”" log_id=0QNSgWXG000 service=httpd
… bunch of groupings where ~20 Telegraf writes will complete, but mostly dominated by 500 code timeouts …
2020-12-16T09:08:39 - Last log written before out of memory error:

ts=2020-12-16T09:08:39.203868Z lvl=error msg="[500] - \"timeout\"" log_id=0QNSgWXG000 service=httpd
fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x166acb8, 0x16)
        /usr/local/go/src/runtime/panic.go:774 +0x72
runtime.sysMap(0xc1d8000000, 0x4000000, 0x35aec78)
        /usr/local/go/src/runtime/mem_linux.go:169 +0xc5
runtime.(*mheap).sysAlloc(0x3595c80, 0x2000, 0x2000, 0x7fffbefb8c60)
        /usr/local/go/src/runtime/malloc.go:701 +0x1cd
runtime.(*mheap).grow(0x3595c80, 0x1, 0xffffffff)
        /usr/local/go/src/runtime/mheap.go:1255 +0xa3
runtime.(*mheap).allocSpanLocked(0x3595c80, 0x1, 0x35aec88, 0xc00004b820)
        /usr/local/go/src/runtime/mheap.go:1170 +0x266
runtime.(*mheap).alloc_m(0x3595c80, 0x1, 0x7f4788ea0011, 0x45d0fa)
        /usr/local/go/src/runtime/mheap.go:1022 +0xc2
runtime.(*mheap).alloc.func1()
        /usr/local/go/src/runtime/mheap.go:1093 +0x4c
runtime.systemstack(0x0)
        /usr/local/go/src/runtime/asm_amd64.s:370 +0x66
runtime.mstart()
        /usr/local/go/src/runtime/proc.go:1146

Hi kevin,

my 2 cents on this is make sure you have only influxdb installed on this server since influxdb has the habit to reserve all the memory for itself even if not used. Install your kapacitor, chronograf and telegraf on an other server, at least you will know that the memory usage will not be affected by those or they won’t ran out of memory to run becuse of influx.
Regards