Querying against large dataset and large time series

manishv · April 26, 2017, 7:15pm

I’m trying to figure out if influxdb is a good way to store the time series data we are generating and if it can do the queries we want. One common paradigm of query we have is to select all elements from a table where a particular field’s value is a member of a largish set of other items.

For example, one set of data that we might store is a record of all files accessed on a system. I want to be able to get all such records where the file is one of 10,000 or so files in another set. (e.g., all files on a system that have known security issues).

What is the best way to do this? There doesn’t seem to be any good query as a group by on the time series data would result in a possible enormous list of enitites (especially if the data has been logged for a long time). Running 10k+ queries with different where clauses also seems less than ideal since you’d really want to check against the set of files (which can probably be memory resident) while the time series data point is resident in memory.

Any good solutions to this problem?

jackzampolin · April 26, 2017, 9:08pm

@manishv This sounds like you want to join relational data and timeseries together. The best way I have seen this done in the past is to query the relational data and then construct your Influx query using one of our client libraries. Is this an option for you?

manishv · April 26, 2017, 9:24pm

Yes that is an option, can the client libraries do things the QL cannot? In particular, I do not want to extract all the data from influx and transmit it over the network, as over time that will get way too large. I can co-locate the client code on the influx server, but I don’t know how that interacts with clustering, which we will ultimately be interested in. Now, if the client library can ask the influx server to run some code, and that this will scale as expected, that is perfect.

manishv · April 26, 2017, 10:20pm

Looks like the following:

python where.py | influx --database mydb

Works for about 10k where clauses but fails at 50k because the request headers are too large. Is there a way to get around that limitation?

where.py:

x=[]
for i in range(1,10000):
    x.append("region='" + str(i) + "'")

print "select * from cpu where region='us_west' or ", ' or '.join(x)

Manish

Topic		Replies	Views
How to Optimize InfluxDB Performance for Large Time Series Data Sets?	1	229	June 1, 2024
Database solution Store influxdb	1	558	February 24, 2020
InfluxDB as long time storage Store influxdb , time-series	0	706	October 24, 2018
Querying large datasets InfluxDB 2 query , flux , performance	5	1619	April 21, 2023
Perfomance compare between Mysql and influxdb	2	1888	December 15, 2019

Querying against large dataset and large time series

Related topics