Reduce large query time

Our query is actually running via the influxdb python library but I’ve reproduced the same query time via the CLI client:

influx --host 10.0.5.183 --precision RFC3339 -execute "select * from turbine_ops.permanent.turbine_interval where ((turbine_id = 'NKWF-T15' or turbine_id = 'NKWF-T41' or turbine_id = 'NKWF-T23' or turbine_id = 'NKWF-T19' or turbine_id = 'NKWF-T51' or turbine_id = 'NKWF-T14' or turbine_id = 'NKWF-T42' or turbine_id = 'NKWF-T26' or turbine_id = 'NKWF-T39' or turbine_id = 'NKWF-T49' or turbine_id = 'NKWF-T38') and time >= '2019-05-01')">/dev/null

Influx is running on a r5.large EC2 instance with EBS a general purpose SSD (gp2) volume, the CLI is on an EC2 in the same subnet, and the above query takes about 20 seconds, returning 747120 rows, each having 1 tag (turbine_id) and 5 fields (all decimal values). Does this seem normal?

We need to reduce the 20s significantly. Via htop on the influx host I see no significant change in RAM usage, a brief CPU spike that lasts ~1s at the start of query, and then no subsequent CPU activity.

Shard duration is set to 1 year. Please let me know any optimizations we may be able to make.

show series exact cardinality on turbine_ops
name: turbine_interval
count
-----
11

I tried scaling the influxdb host to r5.8xlarge and the query time did not change at all.

explain select * from turbine_ops.permanent.turbine_interval where ((turbine_ = 'NKWF-T15' or turbine_id = 'NKWF-T41' or turbine_id = 'NKWF-T23' or turbine_id = 'NKWF-T19' or turbine_id = 'NKWF-T51' or turbine_id = 'NKWF-T14' or turbine_id = 'NKWF-T42' or turbine_id = 'NKWF-T26' or turbine_id = 'NKWF-T39' or turbine_id = 'NKWF-T49' or turbine_id = 'NKWF-T38') and time >= '2019-05-01')

QUERY PLAN

EXPRESSION:
AUXILIARY FIELDS: active_power::float, “duration”::integer, rotor_rpm::float, turbine_id::tag, wind_speed::float, yaw_direction::float
NUMBER OF SHARDS: 1
NUMBER OF SERIES: 10
CACHED VALUES: 0
NUMBER OF FILES: 150
NUMBER OF BLOCKS: 3515
SIZE OF BLOCKS: 12403470

explain analyze select * from turbine_ops.permanent.turbine_interval where ((turbine_ = 'NKWF-T15' or turbine_id = 'NKWF-T41' or turbine_id = 'NKWF-T23' or turbine_id = 'NKWF-T19' or turbine_id = 'NKWF-T51' or turbine_id = 'NKWF-T14' or turbine_id = 'NKWF-T42' or turbine_id = 'NKWF-T26' or turbine_id = 'NKWF-T39' or turbine_id = 'NKWF-T49' or turbine_id = 'NKWF-T38') and time >= '2019-05-01')

EXPLAIN ANALYZE

.
└── select
├── execution_time: 1.442047426s
├── planning_time: 2.105094ms
├── total_time: 1.44415252s
└── build_cursor
├── labels
│ └── statement: SELECT active_power::float, “duration”::integer, rotor_rpm::float, turbine_id::tag, wind_speed::float, yaw_direction::float FROM turbine_ops.permanent.turbine_interval WHERE turbine_ = ‘NKWF-T15’ OR turbine_id::tag = ‘NKWF-T41’ OR turbine_id::tag = ‘NKWF-T23’ OR turbine_id::tag = ‘NKWF-T19’ OR turbine_id::tag = ‘NKWF-T51’ OR turbine_id::tag = ‘NKWF-T14’ OR turbine_id::tag = ‘NKWF-T42’ OR turbine_id::tag = ‘NKWF-T26’ OR turbine_id::tag = ‘NKWF-T39’ OR turbine_id::tag = ‘NKWF-T49’ OR turbine_id::tag = ‘NKWF-T38’
└── iterator_scanner
├── labels
│ └── auxiliary_fields: active_power::float, “duration”::integer, rotor_rpm::float, turbine_id::tag, wind_speed::float, yaw_direction::float
└── create_iterator
├── labels
│ ├── cond: turbine_ = ‘NKWF-T15’ OR turbine_id::tag = ‘NKWF-T41’ OR turbine_id::tag = ‘NKWF-T23’ OR turbine_id::tag = ‘NKWF-T19’ OR turbine_id::tag = ‘NKWF-T51’ OR turbine_id::tag = ‘NKWF-T14’ OR turbine_id::tag = ‘NKWF-T42’ OR turbine_id::tag = ‘NKWF-T26’ OR turbine_id::tag = ‘NKWF-T39’ OR turbine_id::tag = ‘NKWF-T49’ OR turbine_id::tag = ‘NKWF-T38’
│ ├── measurement: turbine_interval
│ └── shard_id: 1584
├── cursors_ref: 0
├── cursors_aux: 50
├── cursors_cond: 0
├── float_blocks_decoded: 2812
├── float_blocks_size_bytes: 12382380
├── integer_blocks_decoded: 703
├── integer_blocks_size_bytes: 21090
├── unsigned_blocks_decoded: 0
├── unsigned_blocks_size_bytes: 0
├── string_blocks_decoded: 0
├── string_blocks_size_bytes: 0
├── boolean_blocks_decoded: 0
├── boolean_blocks_size_bytes: 0
└── planning_time: 1.624627ms

My suspicions were confirmed that influx itself was not the culprit here, when I curled the HTTP API directly and got a ~3s response. I’m not sure why the CLI or python DataFrameClient are adding so much overhead but I got to a Pandas dataframe in 3.78s using this:

import urllib
import pandas as pd
from io import BytesIO

data = {}
data['db']='turbine_ops'
data['precision']='s'
data['q']="select * from turbine_ops.permanent.turbine_interval where ((turbine_id = 'NKWF-T15' or turbine_id = 'NKWF-T41' or turbine_id = 'NKWF-T23' or turbine_id = 'NKWF-T19' or turbine_id = 'NKWF-T51' or turbine_id = 'NKWF-T14' or turbine_id = 'NKWF-T42' or turbine_id = 'NKWF-T26' or turbine_id = 'NKWF-T39' or turbine_id = 'NKWF-T49' or turbine_id = 'NKWF-T38') and time >= '2019-05-01')"
url_values=urllib.parse.urlencode(data)
url="http://10.0.5.183:8086/query?" + url_values
request = urllib.request.Request(url, headers={'Accept':'application/csv'})
response = urllib.request.urlopen(request)
response_bytestr = response.read()
df = pd.read_csv(BytesIO(response_bytestr), sep=",")

This is a good start, I’d still like to get it lower if possible.

This was super helpful.
I just started using Influxdb-python and was seeing long response times as well. Based on your supplied code with urllib, I am able to get right to a Pandas dataframe in a reasonable amount of time. Thanks!

You should be able to squeeze a bit more out of influx by using a regex list instead of a long list of "OR"s

I believe under the hood this influx will do a single check instead of the 10+ checks it needs to do for your "OR"ed example.

Something like:
where turbine_id =~ /NKWF-T15|NKWF-T41|NKWF-T23|add|more|here/ and time >= ‘2019-05-01’

Can you summarize the data at all? Returning 750,000 rows will talk some time to return/serialize/convert(json/csv).

I you don’t need all the raw data and can summarize (sum, count, min, max…) and then group by turbine_id. Then you will get back a single row for each tag. That will be crazy fast.

Thanks,

I think we need all the rows, but maybe this is actually doable inside influx (right now we do it in Pandas)

Each turbine ID should have rows at the same set of timestamps. I need the subset of timestamps for which turbine_rpm (field) is greater than some integer (call it p) for all those turbine IDs at the same timestamp. This corresponds to the “Turbines ON” condition. I also need all the timestamps for which turbine_rpm is less than p for all turbine IDs at like timestamps (“Turbines OFF” condition). Lastly I need all the timestamps where there’s a mix of greater than / less than p condition.

That’s a bit of a simplification, the condition is actually a bit more complex (two numerical comparisons ANDed together).

Then for the groups of Turbine ON, OFF, and MIXED timestamps I still need all the fields attached to those datapoints belonging to one of the turbine_IDs.

But in this example that’s 1/11th the payload size so presumably also a fraction of the query time.