Influxdb high CPU usage

We’ve been running influxdb in production for about 6-8months and in dealing with some memory issues I noticed the CPU usage has been rising steadily this week to the point where performance on our Grafana graphs is being impacted. According to htop cpu usage is always over 80% and load average is constantly over 7

The server is a Dell with dual quad core, 128GB RAM, 3.2TB disk. There are 9 databases, the 2 largest databases have ~200K series each, the rest are small with less than 5K series.

The server itself is dedicated to influx but does have telegraf and kapacitor (no tick scripts running) services installed.

I read somewhere high cpu could be caused by garbage collection but don’t know what metrics to look at

any help would be appreciated.

thanks
Garry

Can you pull a CPU profile from the instance and post the results here?:

for i in block goroutine heap profile; do curl -sL -o influxdb-debug-$i.txt "http://localhost:8086/debug/pprof/$i?debug=1"; done && for i in $(seq 6); do curl -sL -o influxdb-debug-vars-${i}0s.txt "http://localhost:8086/debug/vars"; sleep 10; done && tar -cvzf influxdb-debug.$(hostname).$(date +%s).tar.gz influxdb-debug-*.txt && rm influxdb-debug-*.txt

thanks for the reply, I’m getting an error uploading the file. Unauthorized file ext.

thanks
Garry

@gcyre Sorry about that! I’ve added .txt, tar.gz and gz to the allowed extensions. Can you try again?

thanks @jackzampolin, now I’m getting a file size limit error. I’ve broken the file up into 2 filesinfluxdb-debug.pc-influxdb-004.1495244369.tar.gz (303.4 KB)

Second file cannot be uploaded, still getting file size error

@jackzampolin did you have an opportunity to look at these?

thanks

@jackzampolin do you have any suggestions?

@gcyre Just noticed I had a message in our private conversation I haven’t sent you. Can you repost the google drive link here? I have to get one of our core engineers to come take a peek at this.

Sorry for the slow response here. This fell off of my radar.

@jackzampolin no worries, appriciate your time looking into this. Here’s the link

https://drive.google.com/file/d/0Bw0v620SnsMUNWNXaHlBRUUtRUE

thanks
Garry

@jackzampolin have you had a chance to look at this?

thanks
Garry

Sorry about the back and forth but we can’t read a profile without knowing the software version that you are running.

Offhand, I can think of a few reasons for seeing the server slow down.

  1. You may be low on physical memory and the process swaps or can’t maintain an efficient cache of hot data. You can see this by looking at available memory at the process / OS level.
  2. You may be running queries that scale to the number of series being read; as you read more series, the queries grow slower. You could run the queries manually and time them to understand their basic performance, perhaps?
  3. You are writing data in a way that causes adverse compaction behavior (for example backfilling with new data or overwriting (updating) existing data that has been compacted).
  4. You are sending data inefficiently. For example, you write a lot of old points or write only a few points per batch.

Also of note is that InfluxDB will do file I/O per database, per retention policy. If you have multiple hot databases, this can be expensive.

Completing the information for this bug reporting template (from GitHub/influxdb issues) would be helpful in trying to figure out what’s happening:

Bug report

System info: [Include InfluxDB version, operating system name, and other relevant details]

Steps to reproduce:

  1. [First Step]
  2. [Second Step]
  3. [and so on…]

Expected behavior: [What you expected to happen]

Actual behavior: [What actually happened]

Additional info: [Include gist of relevant config, logs, etc.]

Also, if this is an issue of for performance, locking, etc the following commands are useful to create debug information for the team.

curl -o profiles.tar.gz "http://localhost:8086/debug/pprof/all?cpu=true"

curl -o vars.txt "http://localhost:8086/debug/vars"
iostat -xd 1 30 > iostat.txt

Please note It will take at least 30 seconds for the first cURL command above to return a response.
This is because it will run a CPU profile as part of its information gathering, which takes 30 seconds to collect.
Ideally you should run these commands when you’re experiencing problems, so we can capture the state of the system at that time.

If you’re concerned about running a CPU profile (which only has a small, temporary impact on performance), then you can set ?cpu=false or omit ?cpu=true altogether.

Please run those if possible and link them from a gist or simply attach them as a comment to the issue.

1 Like

influxdb-debug.1497383714.tar.gz (2.7 MB)

iostat.txt (7.2 KB)
vars.txt (801.1 KB)

hi @ryan

Thanks for the response, here’s the relevant information, I received an error when running ‘curl -o profiles.tar.gz “http://localhost:8086/debug/pprof/all?cpu=true”’ (Unknown profile: all). I ran the command that Jack asked earlier in the thread.influxdb-debug.1497383714.tar.gz (2.7 MB)

System Information:
InfluxDB version: InfluxDB v1.1.1 (git: master e47cf1f2e83a02443d7115c54f838be8ee959644)
OS version: Linux 3.19.0-68-generic #76~14.04.1-Ubuntu
CPU: 2 X Intel Xeon E5-2609 (4 cores per CPU)
RAM: 128GB
Disks: 8x 500GB SAS

Databases:
10 including _internal
Series: 8 databases 3-5K series
1 database ~250K series
1 database ~500K series

Actual behaviour
CPU usage increased from 30% average to 70% average, increase happened on May 18 and has stayed steady at 70% since. InfluxDB service has been restarted a couple of times.

Thanks.

InfluxDB version: InfluxDB v1.1.1 (git: master e47cf1f2e83a02443d7115c54f838be8ee959644) is a pretty old version.

There are few memory leak and performance fixes in 1.1.2, 1.1.4. And substantial performance improvements on the 1.2.x branch as well.

Does anything prevent you from upgrading?

Thanks Ryan,

I’ll look at upgrading to 1.4.2

Garry

hi @ryan

I upgraded Influx to 1.2.4 over the weekend, CPU usage has dropped slightly and has been stable. Are there any stats I can look at to let me know the system is healthy?

thanks
Garry

I’m running InfluxDB v1.3.7 on an ARM embedded computer (BeagleBone). InfluxDB is the highest CPU usage. Can you please look into the debug log to see what wrong in my installation? influxdb-debug.phen.1509444957.tar.gz