Querying against large dataset and large time series

influxql
#1

I’m trying to figure out if influxdb is a good way to store the time series data we are generating and if it can do the queries we want. One common paradigm of query we have is to select all elements from a table where a particular field’s value is a member of a largish set of other items.

For example, one set of data that we might store is a record of all files accessed on a system. I want to be able to get all such records where the file is one of 10,000 or so files in another set. (e.g., all files on a system that have known security issues).

What is the best way to do this? There doesn’t seem to be any good query as a group by on the time series data would result in a possible enormous list of enitites (especially if the data has been logged for a long time). Running 10k+ queries with different where clauses also seems less than ideal since you’d really want to check against the set of files (which can probably be memory resident) while the time series data point is resident in memory.

Any good solutions to this problem?

#2

@manishv This sounds like you want to join relational data and timeseries together. The best way I have seen this done in the past is to query the relational data and then construct your Influx query using one of our client libraries. Is this an option for you?

#3

Yes that is an option, can the client libraries do things the QL cannot? In particular, I do not want to extract all the data from influx and transmit it over the network, as over time that will get way too large. I can co-locate the client code on the influx server, but I don’t know how that interacts with clustering, which we will ultimately be interested in. Now, if the client library can ask the influx server to run some code, and that this will scale as expected, that is perfect.

#4

Looks like the following:

python where.py | influx --database mydb

Works for about 10k where clauses but fails at 50k because the request headers are too large. Is there a way to get around that limitation?

where.py:

x=[]
for i in range(1,10000):
    x.append("region='" + str(i) + "'")

print "select * from cpu where region='us_west' or ", ' or '.join(x)

Manish