Best way to compose a where query that matches multiple tag values?

Hi there, i was wondering what the best way is to compose a WHERE clause that matches multiple conditions. I was in the impression that i could solve this using regex pattern but i seem to hit a wall. There is to many data returned in my query…

In my case i have several measurements that have an ‘location_id’ as a value.
When is create a query using a where clause like below i get data back that is not correct. Probably to my misunderstanding on how to use the regex pattern or even it is is possible…

My data is as follows

time                 cpu  location_id
----                 ---- -----------
2017-11-27T07:00:00Z 159  2
2017-11-27T15:00:00Z 154  27
2017-11-27T23:00:00Z 117  7
2017-11-28T07:00:00Z 160  7
2017-11-28T15:00:00Z 167  27
2017-11-28T23:00:00Z 170  27

When i execute a query i only want the locations back with the value of ‘7’.
But when i use a query like below the data from location_id 27 is also returned…

SELECT * FROM “measurement” WHERE location_id =~ /7|2|104|45/;

My goal is that i would like to indicate that the location_id should be in a list of values. Is this even possible with regex? Or should i use AND clauses?

(graaagh, says the zombie…)

The regex there needs to anchor to start and end of the value: location_id =~ /^(7|2|104|45)$/

Note that in the case of tags, you’re actually better off using OR clauses:
SELECT ... WHERE (location_id = 7 OR location_id = 2 OR ...) AND anything_else
As I’ve learned only recently, regex matching in the tag index is inefficient.

@hackery I’m curious about this; do you have any rough benchmarks (or anecdotal results) you can share?

We have quite a few cases where we want to match on multiple tags values, and sometimes use regex (tag_key =~ /^(A|B)$/) while in other cases have explicit (tag_key = 'A' OR tag_key = 'B') clauses. The former is obviously less verbose and arguably more readable, so would prefer to just standardize on that across the board - but if there’s a performance hit due to the regex then it’s probably not the right approach.

I haven’t done any benchmarking yet, but input from the community would be welcome!

(and yes, it would be great to just have an IN (....) operators, but that’s clearly not on the cards: Can we have IN filter? · Issue #2157 · influxdata/influxdb · GitHub)

I don’t have any benchmarks that I can share at the moment. It’s just something that came up in conversation with InfluxData support staff; it makes perfects sense in hindsight, but I can’t find any documentation that explicitly describes it.

The tag index is specifically there for performance purposes - every query that filters by tags does a lookup in the tags index to narrow down the series data; if you use regex terms in the WHERE clause, it has to scan the data rather than using the fast lookup in the tags index.

I’ve since found this change from 2016 though, where they explicitly made an optimization for just the case of some regex forms which resolve to exact matches. I’ve not looked at the code in detail, but if the regex is fully anchored and either a single string or alternates, I think it rewrites to a sequence of lookups, in which case it’s a win.

1 Like