I am witnessing severe query degradation when my database gets to a certain size (~140G as measured in the influxdb/data directory).
My database is quite simple. Series cardinality is around 5900. Retention is 7 days only.
What I am seeing is the following:
Select queries in some time-ranges respond within 2 seconds or less.
Select queries in other time-ranges either take many seconds to complete or never complete.
For the queries that take many seconds to complete, I can improve response times significantly by reducing the result set to return a single field (as opposed to all tags and fields).
trace-logging and query-log both enabled but no interesting messages are shown, except for slow query notifications.
This behavior makes no sense to me and if anyone has any insights then I would be grateful to hear them.
Based on the behavior youâre describing, my guess would be that youâre running into a similar issue to 7447, whose fixes are slated for the 1.3.0 release. Are you backfilling data (i.e. rewriting historical points)? If so, youâre probably backfilling into the time ranges where youâre seeing stalled queries.
If you can attach the logs as requested in the issue template, that will help us identify whether itâs the same issue. However, if itâs the same issue, the HTTP requests likely wonât return. In that case, run with environment variables GODEBUG=gctrace=1 and GOTRACEBACK=crash and send a SIGQUIT (or press ctrl-backslash) while the server is stalled, and attach the logs here.
Mark, thank you for this information. I will try and collect the logs as instructed.
One further question: Our application that is filling Influxdb is one that reads different files, each of which contain time series data. The nature of the application is that it is bound to often back fill. For instance, it may process one file that covers one 30 minute period of a day, and then process another file which contains data for a day earlier. Is it, in general, not good for Influx to constantly be adding data out of time order?
The system is designed to best handle data arriving in time-ascending order. That being said, backfilling, in and of itself, isnât always an issue. It can become problematic when you have a (typically historic) shard thatâs gone through rounds of compactions and then you insert new points in the same time range â that will effectively negate the compaction work, because the new data will need to be inserted into a file thatâs been highly read optimized. The server has to do even more work when you overwrite points that are highly compacted, because the query engine needs to consult the original point and the new points to de-duplicate (but after a new snapshot and compaction, the query should be responsive again). The fixes for 7447 linked above include some optimizations for that case, so 1.3.X should perform better than 1.2.X in that situation.
Sorry to ask for the logs again, but I didnât specify that we need to see the logs while the server is hanging on a long-running query. If the server is stalled in the same way as issue 7447 while executing a query, HTTP endpoints may be unresponsive and you may need to send a SIGQUIT.
Mark,
I was wondering if you had been able to discover anything in the logs that I sent you. Is my issue the same as issue 7447 or is it something else?
# Pull down the influxdb repo
$ git clone git@github.com:influxdata/influxdb.git $GOPATH/src/github.com/influxdata/influxdb
$ cd $GOPATH/src/github.com/influxdata/influxdb
# Install all the tools required to build (gdm)
$ make tools
# Pull down dependancies
$ go get -u -t -f -v ./...
# Peg dependances to the proper version
$ gdm restore
# Build InfluxDB binaries (influxd, influxd, influx_stress, influx_inspect, influx_tsm)
$ go build ./cmd/...
@jackzampolin
I have followed your instructions and all the commands ran without error. However, now I am not sure how to install.
I am running on Debian and would like to replace the stable 1.2 release I have installed with the one I have compiled. What would be the best way to do this.
@bill Glad thatâs working for you! You need to replace the old influxd binary with the one you just compiled. First I would advise checking out the CHANGELOG to see if you need to make any configuration changes (/etc/influxdb/influxdb.conf). Once that is completed you can replace the influxd binary that was installed with your package. That binary normally lives at /usr/bin/influxd. Once that is complete you will need to restart the process:
As indicated, I managed to compile from the master branch. After that I replaced the 1.2 version of âinfluxdâ with the newly compiled version and re-ran my test. Unfortunately, I get very similar behavior to before:
My database has 5 days worth of data: 2017 March 21, 22, 23, 24 & 27.
It occupies 151Gb (according to âduâ display)
1000 rows of data can be retrieved within 2-3 seconds but only if retrieved from days 21, 22, 23 and 27.
If I try and select all tags & fields for any time period on March 24, it never returns and memory/cpu on the influx machine gradually increases.
If I try doing a âselect count(field)â for the same time period, I can (after about 10 seconds) get a count number back. However, it does not work for all fields, only some.
Furthermore, if I do a select on only one field for the same time period it can work, but again only after several seconds.
Something here sounds off to me. Itâs 5,900 series, but only 4 days of data and the size on disk is 140GB? How many fields are you writing? How many values/sec are in each series (or whatâs your sampling interval)? If you have 140GB of data in a four day period Iâm guessing your query could be churning through a ton of data. Do you have the specific queries youâre running?