OOM issues with Influx

#1

Hello

We’ve had Influx running in production for a few months and recently started having issues with OOM-killer shutting down the influx service.

The server is decently powered, dual quad core CPUs/w 128GB of RAM and lots of disk. We have about 800 measurements with roughly 800K series. Is there any documentation I can look at that will give me some areas to look at to determine the health of our installation?

Not sure where to start to troubleshoot this.

thanks
Garry

#2

@gcyre do you have a graph of memory utilization on the box? Also are there any other processes running on that host?

#3

@jackzampolin

Here are 2 graphs I’m looking at

#4

the only services running on this server are Influxdb, telegraf and kapacitor but we haven’t implemented any tick scripts yet.

#5

I’ve been digging into the issue more and created some graphs based on the influx internal metrics, from what I can see there hasn’t really been any pressure on memory and cpu. I’m beginning to think there isn’t an issue and the problems we noticed are an isolated issue.

Is there a way to limit the information being logged to influxd.log? its being filled with [httpd] messages, all I would really want to see is any errors that are happening

thanks
Garry

#6

@gcyre you can always use some grep-foo to look for 500’s: journalctl -u influxdb | grep -v " 500 "