Flux Filter Order Differences

Hey Team,

Do you know if Flux Query 1 and Flux Query 2 will have any performance differences based on the differences in filter statements shown?

Flux Query 1:

from(bucket: "rand_bucket")	
|> range(start: -7d, stop: now())	
|> filter(fn: (r) => r._measurement == "rand_measurement" and r._field == "rand_field")	
|> filter(fn: (r) => r.id == "rand_id")

Flux Query 2:

from(bucket: "rand_bucket")	
|> range(start: -7d, stop: now())	
|> filter(fn: (r) => r._measurement == "rand_measurement")	
|> filter(fn: (r) => r._field == "rand_field")	
|> filter(fn: (r) => r.id == "rand_id")

cc: @grant1, @Anaisdg , @scott, @fercasjr tagging for attention please :pray:

Hey guys, I realise that I’ve actually asked this before and @Anaisdg mentioned that she didn’t expect any differences and @fercasjr mentioned that it depends on the schema. Link here: FLUX Optimisation: Pushdowns

Is there anything technical that’s worth knowing about using a filter that narrows the most at the beginning and thoughtfully applying filters thereafter?

@ajetsharwin It shouldn’t matter. The Flux query planner merges whatever filters it can into a single logical filter statement at query time. So there shouldn’t be any difference between the two queries you have. Where there could potentially be a difference is if you use the or operator in predicate expressions. That can force the query planner into using multiple query statements, but still, there shouldn’t be a huge performance difference.

Thanks for your message @scott .

I didn’t quite understand your point here about:

“Where there could potentially be a difference is if you use the or operator in predicate expressions. That can force the query planner into using multiple query statements, but still, there shouldn’t be a huge performance difference.”

What exactly do you mean here please?

As an example, say the query is


from(bucket: "rand_bucket")	
|> range(start: -7d, stop: now())	
|> filter(fn: (r) => r._measurement == "rand_measurement")	
|> filter(fn: (r) => r._field == "rand_field")	
|> filter(fn: (r) => r.id == "rand_id_1" or r.id == "rand_id_2" or r.id == "rand_id_3" or r.id == "rand_id_4")

how does your point relate to the 3rd filter statement which has multiple ‘or operators’?

based on my experience It depends on the amount of data available to query

query 2 gets all data points in that _measurement, and then gets from those results the ones that have the specified field.

if it happens to be way more _fields you will have all those _fields queried first, then you drop all the ones that doesn’t match.

and chances are that you have more _fields than _measurements.

by the other hand if you query first the _field then the _measurement you queried less datapoints first.

I am not sure if using “AND” and the criteria on a single filter takes that into account.

however, if you have only one _measurement on that bucket it makes no difference at all, and you can remove that filter entirely.

same applies for all other tags.

I have spent many many of hours trying to optimize many of my queries, and I have a HUGE data set, that’s why I know that stuff.