Query times for "load all the data" queries

hey - using influx 1.2 on a ubuntu 1604 box w/ dual xeons + 128gb ram.

I just ran a query (using influxdb python DataFrameClient) that pulled all the data from a series from the last 24 hours. It returned 1,307,746 rows x 22 columns (1 tag, 21 fields) in 1min, 33sec. The retention policy is infinity so this should be from one shard, if I understand it right. The db is under load of 800-1000 writes/sec. There are a relatively high proportion null values in the query result. This time is representative of other queries for the same measurement.

My expectation was that queries with aggregation/filtering/etc might take a while but I was surprised that this “load all the data” query is taking such a long time.

My question is - is this expected under the circumstances? Is there something I should be doing differently in terms of schema design? It seems like the write performance I’m getting is great but read is pretty slow.

@jstrong Generally aggregates and filtering, especially tag WHERE filters are quite performant. A “load all the data” query isn’t. This is because all the data must be decompressed from the shards and then marshalled to JSON and sent over the network. This ends up being very expensive. Less data over the network is much quicker.

I would suggest adding more tags to differentiate the data further to make filtering easier.

Unfortunately in this case I’m not pulling all the data to filter it downstream, I’m using it all.

Is the bottleneck the JSON round trip or pulling from the influx data storage engine? In my case I’m running the db on my workstation so there is no real network latency.

@jstrong Bottleneck is generally JSON marshalling. Also pulling all that data from the storage engine is expensive. We are more optimized for returning aggregates and running filters over the data.

What kind of resource usage are you seeing while the query is running?

hard to say on resource usage from query as computer is being pushed fairly hard cpu-wise from other processes.

out of curiosity - as someone familiar with the codebase, on a scale of ludicrous to piece of cake, what would it take to get data from influx to python (or Go) without the JSON serialization?

@jstrong We recommend that you run the process in an isolated environment. InfluxDB requires quite a bit of cpu and it is not recommended to run it on a multitenant host. I think that fixing your resource contention issues will improve query speed much more than the JSON serialization changes.

I hear what you’re saying on cpu - however the machine is running other things very quickly - it’s not like the cpus are maxed out all the time or something.

Do you know how Kapacitor compares for processing large swarths of a measurement? Like if my tickscript was basically to just read through 1-2m rows - would that be 1-2 minutes? Asking as I haven’t been able to play around w/ Kapacitor yet.

@jstrong influxd is a multithreaded process and at the time when you run that query will max out your CPU cores. The contention introduced by the other processes could significantly reduce query performance. How many cores do your two xeon have?

Kapacitor pulls data out via the API or through streaming Subscriptions and would not be any faster.

two eight-core xeons w/ hyperthreading so 16/32 depending on how you’re counting. I have a chart of the cpu usage by core up while I’m working and have not seen it max out while I’m running any of these queries fwiw. I’m working to put the db on another server but ran into a different problem, which I’m about to post a separate thread about.

1 Like

Bumping this issue one more time as the problem has grown more acute since I originally posted. At this point, retrieving data from influx is the single biggest bottleneck in the “backtesting” system I’m building that replays the program’s logic from historical data to simulate real-time processing.

To give you a sense of the scale: a query that retrieves 20m rows from the last 28 days takes about 9 mins, 52 sec to return (using influx python client). I have a table in postgres with 23m rows and “select * from [table_name]” took 61 seconds.

I have to say, it’s disappointing to find there’s a gotcha like this waiting when you get to a larger scale. In quite a bit of research I never came across anything that said pulling out your data for several weeks is going to be extremely costly. I understand that influx is fast for aggregation, but there are many things that are too complicated to represent in a query.

To get around this, I’m looking at needing to save a second copy of all the data in daily json or csv files to allow quick read access when needed. Obviously it would be preferable if I could just use the database to store the data.

@jackzampolin, earlier in this thread you said json serialization was a bottleneck on the influx side. I find it hard to believe that to be the case as I can get in and out of json on 20m objects in far less time than it takes me to pull from influx. I just benchmarked a round-trip from the 20m Python objects to json and back – one at a time, no less – and it took 2 mins, 13 sec. If json is the problem it should be an easy fix.

also, fwiw I have influx running on its own server as you recommended, ssd, 64gb ram w/ 2x six-core xeons, but this has done nothing to improve things.

Influx stores its data in a columnar format whereas postgres stores things in a row oriented format. A select * is going to be more efficient in row oriented DBs as all the columns are contiguous and stored together. In a columnar format, all the data for a single column is stored together so a select * needs to read from the location of all the columns so it’s can be more expensive.

It’s very possible there is some performance bug your are running into with your dataset. JSON marshaling of large datasets is not very efficient currently, but it’s hard to say if that is the issue in your case. Would you mind opening an issue and including profiles (instructions in template) while the query is running?

1 Like