Strange query issues when database reaches ~140G

#1

Influxdb Version 1.2.0

I am witnessing severe query degradation when my database gets to a certain size (~140G as measured in the influxdb/data directory).

My database is quite simple. Series cardinality is around 5900. Retention is 7 days only.

What I am seeing is the following:

  • Select queries in some time-ranges respond within 2 seconds or less.
  • Select queries in other time-ranges either take many seconds to complete or never complete.
  • For the queries that take many seconds to complete, I can improve response times significantly by reducing the result set to return a single field (as opposed to all tags and fields).
  • trace-logging and query-log both enabled but no interesting messages are shown, except for slow query notifications.

This behavior makes no sense to me and if anyone has any insights then I would be grateful to hear them.

#2

Based on the behavior you’re describing, my guess would be that you’re running into a similar issue to 7447, whose fixes are slated for the 1.3.0 release. Are you backfilling data (i.e. rewriting historical points)? If so, you’re probably backfilling into the time ranges where you’re seeing stalled queries.

If you can attach the logs as requested in the issue template, that will help us identify whether it’s the same issue. However, if it’s the same issue, the HTTP requests likely won’t return. In that case, run with environment variables GODEBUG=gctrace=1 and GOTRACEBACK=crash and send a SIGQUIT (or press ctrl-backslash) while the server is stalled, and attach the logs here.

#3

Mark, thank you for this information. I will try and collect the logs as instructed.

One further question: Our application that is filling Influxdb is one that reads different files, each of which contain time series data. The nature of the application is that it is bound to often back fill. For instance, it may process one file that covers one 30 minute period of a day, and then process another file which contains data for a day earlier. Is it, in general, not good for Influx to constantly be adding data out of time order?

#4

I have attached the logs, as instructed, to https://gist.github.com/billevans1963/e4c0ad09ce457c78dde6573eed303924

1 Like
#5

The system is designed to best handle data arriving in time-ascending order. That being said, backfilling, in and of itself, isn’t always an issue. It can become problematic when you have a (typically historic) shard that’s gone through rounds of compactions and then you insert new points in the same time range – that will effectively negate the compaction work, because the new data will need to be inserted into a file that’s been highly read optimized. The server has to do even more work when you overwrite points that are highly compacted, because the query engine needs to consult the original point and the new points to de-duplicate (but after a new snapshot and compaction, the query should be responsive again). The fixes for 7447 linked above include some optimizations for that case, so 1.3.X should perform better than 1.2.X in that situation.

Sorry to ask for the logs again, but I didn’t specify that we need to see the logs while the server is hanging on a long-running query. If the server is stalled in the same way as issue 7447 while executing a query, HTTP endpoints may be unresponsive and you may need to send a SIGQUIT.

#6

Mark,
Thank you for your explanation. It makes sense.

I have recreated the hanging query issue and taken new logs. I overwrote the other ones, so that they new ones are also at https://gist.github.com/billevans1963/e4c0ad09ce457c78dde6573eed303924.

Thanks again for your help in this matter.

#7

Mark,
I was wondering if you had been able to discover anything in the logs that I sent you. Is my issue the same as issue 7447 or is it something else?

Thanks,
Bill

#8

Hi Bill,

Sorry for my delayed response. It isn’t clear to me from those logs whether it’s the same issue.

Are you able to build from the master branch to see if the upcoming changes in 1.3.0 fix the slow query issue you’re experiencing?

#9

Mark,
Yes, I should be able to do that. Is there a wiki page I can go to for instructions?

#10

@bill,

To build InfluxDB from master first have a working golang installation and your $GOPATH set up. Then pull down the repo:

# Pull down the influxdb repo
$ git clone git@github.com:influxdata/influxdb.git $GOPATH/src/github.com/influxdata/influxdb
$ cd $GOPATH/src/github.com/influxdata/influxdb

# Install all the tools required to build (gdm)
$ make tools

# Pull down dependancies
$ go get -u -t -f -v ./...

# Peg dependances to the proper version
$ gdm restore

# Build InfluxDB binaries (influxd, influxd, influx_stress, influx_inspect, influx_tsm)
$ go build ./cmd/...
#11

@jackzampolin

When I install the tools, I get an error. Is this a problem?

make tools
go get github.com/remyoudompheng/go-misc/deadcode
package go/types: unrecognized import path "go/types"
make: *** [tools] Error 1

*** FORGET THIS *** - I was using an old version of go. :slight_smile:

#12

@jackzampolin
I have followed your instructions and all the commands ran without error. However, now I am not sure how to install.
I am running on Debian and would like to replace the stable 1.2 release I have installed with the one I have compiled. What would be the best way to do this.

Thanks again for all your help.

#13

@bill Glad that’s working for you! You need to replace the old influxd binary with the one you just compiled. First I would advise checking out the CHANGELOG to see if you need to make any configuration changes (/etc/influxdb/influxdb.conf). Once that is completed you can replace the influxd binary that was installed with your package. That binary normally lives at /usr/bin/influxd. Once that is complete you will need to restart the process:

$ mv /path/to/compiled/influxd $(which influxd)
$ sudo systemctl restart influxdb
#14

@jackzampolin
One slight problem: I can’t find the ‘influxd’ executable that the ‘go build’ command was supposed to build.

#15

@bill have you checked in $GOPATH/bin/influxd?

#16

@jackzampolin
I don’t have an influxd directory:

ls -al $GOPATH/bin
total 39848
drwxr-xr-x 2 billevans billevans 4096 Mar 27 07:49 .
drwxr-xr-x 5 billevans billevans 4096 Mar 27 07:48 …
-rwxr-xr-x 1 billevans billevans 5021202 Mar 27 07:49 aligncheck
-rwxr-xr-x 1 billevans billevans 4991000 Mar 27 07:48 deadcode
-rwxr-xr-x 1 billevans billevans 5087196 Mar 27 07:49 errcheck
-rwxr-xr-x 1 billevans billevans 7993307 Mar 27 07:49 gdm
-rwxr-xr-x 1 billevans billevans 2445668 Mar 27 07:48 gocyclo
-rwxr-xr-x 1 billevans billevans 5184276 Mar 27 07:49 golint
-rwxr-xr-x 1 billevans billevans 5033905 Mar 27 07:49 structcheck
-rwxr-xr-x 1 billevans billevans 5029739 Mar 27 07:49 varcheck

#17

@bill Try to build just the influxd binary explicitly:

$ go build -o ./influxd ./cmd/influxd

It should then appear in your current directory.

#18

Success! I shall now see if it fixes the problem. :slight_smile:

1 Like
#19

@jackzampolin
@mark

As indicated, I managed to compile from the master branch. After that I replaced the 1.2 version of ‘influxd’ with the newly compiled version and re-ran my test. Unfortunately, I get very similar behavior to before:

  • My database has 5 days worth of data: 2017 March 21, 22, 23, 24 & 27.
  • It occupies 151Gb (according to ‘du’ display)
  • 1000 rows of data can be retrieved within 2-3 seconds but only if retrieved from days 21, 22, 23 and 27.
  • If I try and select all tags & fields for any time period on March 24, it never returns and memory/cpu on the influx machine gradually increases.
  • If I try doing a “select count(field)” for the same time period, I can (after about 10 seconds) get a count number back. However, it does not work for all fields, only some.
  • Furthermore, if I do a select on only one field for the same time period it can work, but again only after several seconds.
#21

Something here sounds off to me. It’s 5,900 series, but only 4 days of data and the size on disk is 140GB? How many fields are you writing? How many values/sec are in each series (or what’s your sampling interval)? If you have 140GB of data in a four day period I’m guessing your query could be churning through a ton of data. Do you have the specific queries you’re running?