Reduce large query time

davegravy · August 29, 2019, 6:43pm

Our query is actually running via the influxdb python library but I’ve reproduced the same query time via the CLI client:

influx --host 10.0.5.183 --precision RFC3339 -execute "select * from turbine_ops.permanent.turbine_interval where ((turbine_id = 'NKWF-T15' or turbine_id = 'NKWF-T41' or turbine_id = 'NKWF-T23' or turbine_id = 'NKWF-T19' or turbine_id = 'NKWF-T51' or turbine_id = 'NKWF-T14' or turbine_id = 'NKWF-T42' or turbine_id = 'NKWF-T26' or turbine_id = 'NKWF-T39' or turbine_id = 'NKWF-T49' or turbine_id = 'NKWF-T38') and time >= '2019-05-01')">/dev/null

Influx is running on a r5.large EC2 instance with EBS a general purpose SSD (gp2) volume, the CLI is on an EC2 in the same subnet, and the above query takes about 20 seconds, returning 747120 rows, each having 1 tag (turbine_id) and 5 fields (all decimal values). Does this seem normal?

We need to reduce the 20s significantly. Via htop on the influx host I see no significant change in RAM usage, a brief CPU spike that lasts ~1s at the start of query, and then no subsequent CPU activity.

Shard duration is set to 1 year. Please let me know any optimizations we may be able to make.

davegravy · August 29, 2019, 7:25pm

show series exact cardinality on turbine_ops
name: turbine_interval
count
-----
11

I tried scaling the influxdb host to r5.8xlarge and the query time did not change at all.

davegravy · August 29, 2019, 9:23pm

explain select * from turbine_ops.permanent.turbine_interval where ((turbine_ = 'NKWF-T15' or turbine_id = 'NKWF-T41' or turbine_id = 'NKWF-T23' or turbine_id = 'NKWF-T19' or turbine_id = 'NKWF-T51' or turbine_id = 'NKWF-T14' or turbine_id = 'NKWF-T42' or turbine_id = 'NKWF-T26' or turbine_id = 'NKWF-T39' or turbine_id = 'NKWF-T49' or turbine_id = 'NKWF-T38') and time >= '2019-05-01')

QUERY PLAN

EXPRESSION:
AUXILIARY FIELDS: active_power::float, “duration”::integer, rotor_rpm::float, turbine_id::tag, wind_speed::float, yaw_direction::float
NUMBER OF SHARDS: 1
NUMBER OF SERIES: 10
CACHED VALUES: 0
NUMBER OF FILES: 150
NUMBER OF BLOCKS: 3515
SIZE OF BLOCKS: 12403470

explain analyze select * from turbine_ops.permanent.turbine_interval where ((turbine_ = 'NKWF-T15' or turbine_id = 'NKWF-T41' or turbine_id = 'NKWF-T23' or turbine_id = 'NKWF-T19' or turbine_id = 'NKWF-T51' or turbine_id = 'NKWF-T14' or turbine_id = 'NKWF-T42' or turbine_id = 'NKWF-T26' or turbine_id = 'NKWF-T39' or turbine_id = 'NKWF-T49' or turbine_id = 'NKWF-T38') and time >= '2019-05-01')

EXPLAIN ANALYZE

.
└── select
├── execution_time: 1.442047426s
├── planning_time: 2.105094ms
├── total_time: 1.44415252s
└── build_cursor
├── labels
│ └── statement: SELECT active_power::float, “duration”::integer, rotor_rpm::float, turbine_id::tag, wind_speed::float, yaw_direction::float FROM turbine_ops.permanent.turbine_interval WHERE turbine_ = ‘NKWF-T15’ OR turbine_id::tag = ‘NKWF-T41’ OR turbine_id::tag = ‘NKWF-T23’ OR turbine_id::tag = ‘NKWF-T19’ OR turbine_id::tag = ‘NKWF-T51’ OR turbine_id::tag = ‘NKWF-T14’ OR turbine_id::tag = ‘NKWF-T42’ OR turbine_id::tag = ‘NKWF-T26’ OR turbine_id::tag = ‘NKWF-T39’ OR turbine_id::tag = ‘NKWF-T49’ OR turbine_id::tag = ‘NKWF-T38’
└── iterator_scanner
├── labels
│ └── auxiliary_fields: active_power::float, “duration”::integer, rotor_rpm::float, turbine_id::tag, wind_speed::float, yaw_direction::float
└── create_iterator
├── labels
│ ├── cond: turbine_ = ‘NKWF-T15’ OR turbine_id::tag = ‘NKWF-T41’ OR turbine_id::tag = ‘NKWF-T23’ OR turbine_id::tag = ‘NKWF-T19’ OR turbine_id::tag = ‘NKWF-T51’ OR turbine_id::tag = ‘NKWF-T14’ OR turbine_id::tag = ‘NKWF-T42’ OR turbine_id::tag = ‘NKWF-T26’ OR turbine_id::tag = ‘NKWF-T39’ OR turbine_id::tag = ‘NKWF-T49’ OR turbine_id::tag = ‘NKWF-T38’
│ ├── measurement: turbine_interval
│ └── shard_id: 1584
├── cursors_ref: 0
├── cursors_aux: 50
├── cursors_cond: 0
├── float_blocks_decoded: 2812
├── float_blocks_size_bytes: 12382380
├── integer_blocks_decoded: 703
├── integer_blocks_size_bytes: 21090
├── unsigned_blocks_decoded: 0
├── unsigned_blocks_size_bytes: 0
├── string_blocks_decoded: 0
├── string_blocks_size_bytes: 0
├── boolean_blocks_decoded: 0
├── boolean_blocks_size_bytes: 0
└── planning_time: 1.624627ms

davegravy · August 30, 2019, 2:57am

My suspicions were confirmed that influx itself was not the culprit here, when I curled the HTTP API directly and got a ~3s response. I’m not sure why the CLI or python DataFrameClient are adding so much overhead but I got to a Pandas dataframe in 3.78s using this:

import urllib
import pandas as pd
from io import BytesIO

data = {}
data['db']='turbine_ops'
data['precision']='s'
data['q']="select * from turbine_ops.permanent.turbine_interval where ((turbine_id = 'NKWF-T15' or turbine_id = 'NKWF-T41' or turbine_id = 'NKWF-T23' or turbine_id = 'NKWF-T19' or turbine_id = 'NKWF-T51' or turbine_id = 'NKWF-T14' or turbine_id = 'NKWF-T42' or turbine_id = 'NKWF-T26' or turbine_id = 'NKWF-T39' or turbine_id = 'NKWF-T49' or turbine_id = 'NKWF-T38') and time >= '2019-05-01')"
url_values=urllib.parse.urlencode(data)
url="http://10.0.5.183:8086/query?" + url_values
request = urllib.request.Request(url, headers={'Accept':'application/csv'})
response = urllib.request.urlopen(request)
response_bytestr = response.read()
df = pd.read_csv(BytesIO(response_bytestr), sep=",")

This is a good start, I’d still like to get it lower if possible.

Joe_Wheeldon · August 30, 2019, 8:41pm

This was super helpful.
I just started using Influxdb-python and was seeing long response times as well. Based on your supplied code with urllib, I am able to get right to a Pandas dataframe in a reasonable amount of time. Thanks!

Larry_Walker · September 4, 2019, 12:06am

You should be able to squeeze a bit more out of influx by using a regex list instead of a long list of "OR"s

I believe under the hood this influx will do a single check instead of the 10+ checks it needs to do for your "OR"ed example.

Larry_Walker · September 4, 2019, 12:23am

Can you summarize the data at all? Returning 750,000 rows will talk some time to return/serialize/convert(json/csv).

I you don’t need all the raw data and can summarize (sum, count, min, max…) and then group by turbine_id. Then you will get back a single row for each tag. That will be crazy fast.

davegravy · September 4, 2019, 12:54am

Thanks,

I think we need all the rows, but maybe this is actually doable inside influx (right now we do it in Pandas)

Each turbine ID should have rows at the same set of timestamps. I need the subset of timestamps for which turbine_rpm (field) is greater than some integer (call it p) for all those turbine IDs at the same timestamp. This corresponds to the “Turbines ON” condition. I also need all the timestamps for which turbine_rpm is less than p for all turbine IDs at like timestamps (“Turbines OFF” condition). Lastly I need all the timestamps where there’s a mix of greater than / less than p condition.

That’s a bit of a simplification, the condition is actually a bit more complex (two numerical comparisons ANDed together).

Then for the groups of Turbine ON, OFF, and MIXED timestamps I still need all the fields attached to those datapoints belonging to one of the turbine_IDs.

But in this example that’s 1/11th the payload size so presumably also a fraction of the query time.

Topic		Replies	Views
Slow Query Times influxdb	3	6885	April 4, 2017
InfluxDB: Response time changes for the same query influxdb , time , query	0	1155	July 5, 2018
Inefficient queries when moving from INFLUXQL to FLUX flux , tasks	2	33	January 29, 2025
InfluxDB Query Time Spikes - Seeking Insights InfluxDB 2	0	203	September 4, 2023
Debug extremely slow querie	0	1269	February 21, 2018

Reduce large query time

QUERY PLAN

EXPLAIN ANALYZE

Related topics