Large Filter Clause Performance

I have a list of 10,000+ series that I need to be able to get the last value for. Currently I am either doing them one at a time (not the most performant) or I am batching them into groups of 10-100 and querying that way (performance is a little better). The communication between the API and Influx is main thing slowing things down so id like to minimize the amount calls as much as possible.

Ive read that there are performance issues with the contains function so I don’t think that is a great option. Ive also read that I can create multiple streams and union them together however with that comes the overhead of programmatically creating the flux script for very dynamic series/measurements/tag keys/fields. it will also most likely result in batching because of the amount of filters ill need to pull this off in my specific case.

I’m wondering if there have been any improvements in the past month that I might be able to take advantage of here. I’m using the .net InfluxDB API to interface with Influx. Is anyone else doing this or know of a good way of doing this? Is there a way to pass multiple flux scripts at one time instead of only passing one at a time? That would at least eliminate the amount of calls needed from Client to Server which would improve performance quite a bit.

@scott ive found this suggestion you made and am in the process of attempting it. however as ive been working on it ive realized just how much filtering ill still need and I’m starting to think this may not work for my specific case. Ill need at least one filter per series (if not more) and the last time I tried putting that many filters in a flux script it did not work. Are you aware of any improvements to this and or have any other suggestions for me to try? Would love to test some new ideas and see if I can help simplify this area of filtering.

@ticchioned Do you have an example query of how you’re currently querying the last values from each series? Do you need the last values in all series, or specific series?

@scott There are times where I get all of the series however that is rather simple and not really what I’m referring to here. I am struggling to find a good way to get specific series. Here is a rather small and simple example. Currently I return all of the fields, however I really only need two per series. And again, I’m batching these calls depending on how much data is being requested. I find that the main thing bogging the calls down is the amount of time it takes for the client and server to communicate, not the query itself. The query is rather quick, I just need a lot of them which results in a lot of communication between the server and client.

from(bucket: "Database")
|> range(start: 0)
|> filter(fn: (r) => (r._measurement == "LightIsOn") or (r._measurement == "AirTemp" and r.EngUnit == "Deg F") or (r._measurement == "PumpRPM") or (r._measurement == "AquaPurePct" and r.EngUnit == "%") or (r._measurement == "FilterPumpIsOn") or (r._measurement == "PoolTemp" and r.EngUnit == "Deg F") or (r._measurement == "IsChlorineBoosting") or (r._measurement == "Salt" and r.EngUnit == "PPM") or (r._measurement == "PhaseA.Energy") or (r._measurement == "PhaseA.Power"))
|> last()

Ok, I think I understand. So you’re trying to find a way around batching to minimize client-server interaction? What you could do here is utilize named yields (named results included in a single result set). You could include all “batches” in a single query where each uses yield() to name the tables specific to that “batch”. For example:

// Batch 1
from(...) |> range(...) |> filter(...) |> yield(name: "batch1")

// Batch 2
from(...) |> range(...) |> filter(...) |> yield(name: "batch2")

// Batch 3
from(...) |> range(...) |> filter(...) |> yield(name: "batch3")

In the raw annotated CSV that gets returned from the query, there is a result column that identifies the name of the yield the row comes from: For example, annotated CSV returned from the example above would look something like this:

#group,false,false,true,true,false,false,true,true
#datatype,string,long,dateTime:RFC3339,dateTime:RFC3339,dateTime:RFC3339,double,string,string
#default,mean,,,,,,,
,result,table,_start,_stop,_time,_value,_field,_measurement
,batch1,0,1970-01-01T00:00:00Z,2023-06-20T15:30:00Z,2023-01-01T00:52:00Z,15.43,mem,m
,batch2,1,1970-01-01T00:00:00Z,2023-06-20T15:30:00Z,2023-01-01T00:52:00Z,59.25,mem,m
,batch3,2,1970-01-01T00:00:00Z,2023-06-20T15:30:00Z,2023-01-01T00:52:00Z,52.62,mem,m

Once you have the result back client-side, you could use the yield/result name to further process the data.

This looks like it could work. I will give it a go and let you know the results. Out of curiosity, do you know how many batches flux would be able to handle in one script?

As far as I know, there is no upper limit on the number of yields you can have in a query.

It looks like this has sped things up by about 30 seconds which is a wonderful improvement for the time being. Thank you!