Query optimization

We are using InfluxDb 1.5.1 on LInux (Hardware: Azure Standard_B8ms with 8 vCPU’s and 32GB of RAM) with a database storing about 1 billion points in ca. 100000 series. Our points have a couple of tags (f_l0, f_l1, …, t_old_data in the query below) and an additional timestamp (f_changed) stored as a field. We need to perform a query of the form

SELECT * FROM tsm_data WHERE f_l0 = '{298C6346-A3A9-4E0B-8DFD-B67C79823E88}' AND f_l1 = '2' AND f_l2 = '{41ADF721-C378-4EED-9137-382FECDFD0A1}' AND f_l3 = '' AND ((f_changed >= 1521737293555000000 AND f_changed <= 1521784770080999900) AND (time >= -9223372036854775800 AND time <= 9223372036854775800)) AND t_old_data = 'True'

i.e., we need to select data based on the timestamp and the f_changed field. Our queries of this form take about 2 - 3 seconds even with empty results and no write load, which is too long for our use case. Is there something we can do with regards to schema optimization or is this simply expected behavior for this type of query and database size? We haven’t seen much of a change when replacing the SELECT * by a a more restricted SELECT, but I guess this would pose an opportunity for optimization?

Hi there,

Using the EXPLAIN command may help you to understand what is slowing down your query.

As per the documentation:
The query engine contains a way to estimate the approximate cost of reading an iterator using EXPLAIN.
To explain the potential costs of a query, use a select statement and prefix it with EXPLAIN. For example:

EXPLAIN SELECT count(value) FROM cpu

This will output a single column with human-readable text.

The output of explain will check the potential costs of reading an iterator from disk. Explain will only output what iterators get created from the engine. It does not capture any other information within the query engine like how many points will actually be processed. It just gives a rough overview of what would happen if you read every point from a cursor. While the output can be useful to determine what might be happening, it is not authoritative and the metrics returned can easily be misinterpreted.

For each iterator that is created, the expression used to create the iterator will be printed along with any auxiliary fields that will be associated with the points returned.

  • The number of shards
  • The number of non-unique series
  • The number of files that will be scanned (the same file will show up multiple times for different series)
  • The largest potential number of blocks that may need to be read to exhaust the cursor within the time range
  • The largest potential size of the blocks that may need to be read to exhaust the cursor within the time range

Additionally, there is the command EXPLAIN ANALYZE, which runs the query and counts the number of blocks and points decoded, along with some additional timings.

I hope this may provide some insight.