Debug extremely slow querie

#1

I have set up influxdb on two machines, both roughly same capacity (8 cores, 32GB+ ram, SSD), but I get extremely different query results. The same query takes 13 seconds on one, and 16 minutes on the other machine, and I am trying to figure out why. On them both I ingest the same data. It less than 10 timeseries, but with 25 points per second. Shard duration is 9000 weeks (so only one shard). Since I just want to import everything and then query afterwards I have the following settings:
engine = "tsm1"
cache-snapshot-write-cold-duration = "1m0s"
compact-full-write-cold-duration = “1m”

When done compaction both of the databases are 14GB. On both machines I ingested in the same way; using a single import-process pushing in the sensors chronologically per series (so each series come in order, but one series at a time).

The query is a “SELECT count(Value) from … where WHERE time >= 1286253145320ms and time <= 1291527145000ms AND Tag3=Value5 AND Tag1=Value1”. The count is 130 934 871 for both instances, while the count for the whole series (no time bounds) is 564 584 233.

Other queries show similar performance differences between the two machines.

Using “explain analyze” I see nothing interesting across the instances. The only difference is that on the slow instance I get float_blocks_size_bytes: 696102433" while on the fast I get float_blocks_size_bytes: 700494097" but this seems like an insignificant difference.

On both instances there is basically no IO, everything seems to be cached in RAM.

The tsm files are not identical on both instances, but there is equally many of them (41), and they are all pretty much the same size (±2MB). I was wondering if there was an issue where maybe blocks were overlapping on the slow instance, but I wrote a python program to check (using influx_inspect dumptsm), and all seems good.

I am now completely out of ideas. 13 seconds is kind of slow, 16min is unusable, and before I understand why it performs as it does it is hard to proceed adopting influx on a larger scale. Anyone has any idea?

Edit: I filed a bugreport: #9476