Performance of contains() is very bad compared to equivalent alternatives. Same thing for regex.compile()

alexitheodore · June 21, 2024, 6:57am

I have many queries that I use in Grafana where I need to match based on a list of measurement names. In order to do this, I use a query like this:

measurement_names = [
  "some_measurement_name"
, ...
]

from(bucket: "flywheel")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => contains(value: r._measurement, set: measurement_names))
  |> group(columns: ["_measurement"])
  |> aggregateWindow(every: 1h, fn: sum)

However, the performance is terrible. Just to demonstrate…

this takes many seconds to run and if run on more than a couple days, will time out

measurement_names = [
  "some_measurement_name" 
]

from(bucket: "flywheel")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => contains(value: r._measurement, set: measurement_names))
  |> group(columns: ["_measurement"])
  |> aggregateWindow(every: 1h, fn: sum)

Where this runs in milliseconds. The only difference is the use of contains()

from(bucket: "flywheel")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "some_measurement_name")
  |> group(columns: ["_measurement"])
  |> aggregateWindow(every: 1h, fn: sum)

Likewise… this query takes a long time

import "regexp"

measurement_names = "/some_measurement_name/"

from(bucket: "flywheel")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] =~ regexp.compile(v: measurement_names))
  |> group(columns: ["_measurement"])
  |> aggregateWindow(every: 1h, fn: sum)

And this (seemingly exactly same) query takes a short time. The only difference is the use of regexp.compile

from(bucket: "flywheel")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] =~ /some_measurement_name/)
  |> group(columns: ["_measurement"])
  |> aggregateWindow(every: 1h, fn: sum)

The reason why I include the regex.compile() approach is that it could technically also be used to provide a list or to match on multiple options. I am trying to use variables for the sake of code clarity and portability. Though its a much hackier solution that I don’t prefer, especially if it isn’t performant.

Is there a performant way to do this that isn’t hacky? The first method with an array of values is the most ideal, aside from the terrible performance.

alexitheodore · July 3, 2024, 10:55pm

@Anaisdg would you happen to have any insight into this?

Ultimately, what I’d like is to be able to specify a list of measurements and for it to work at the same speed (in the same way?) as if I had named them “directly”.

I.e.

measurement_names = [
   "my_measurement.1"
,  "my_measurement.2"
,  "my_measurement.3"
]

from(bucket: "some_bucket")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => contains(value: r._measurement, set: measurement_names))
  |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)
  |> yield(name: "mean")

to execute the same as

from(bucket: "some_bucket")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "clean_history.archive" 
    or r["_measurement"] == "my_measurement.1"
    or r["_measurement"] == "my_measurement.2"
    or r["_measurement"] == "my_measurement.3"
    )
  |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)
  |> yield(name: "mean")

Is there any way?

Anaisdg · July 8, 2024, 8:13pm

@alexitheodore,
First thing I’m noticing is that the group is redundant. By default it will be grouped by measurement name.
Does the number of items in your list vary? You could reference the items directly.

  |> filter(fn: (r) => r["_measurement"] == "clean_history.archive" 
    or r["_measurement"] == measurement_names[0]
    or r["_measurement"] == measurement_names[1]
    or r["_measurement"] == measurement_names[2]
    )

other than that i’m not aware of an alternative. @scott am I missing something?

scott · July 8, 2024, 8:43pm

No, this has been a long-standing problem in Flux. The poor performance of contains is a known issue. There is a link in that issue to a Grafana thread that may help to solve your issue @alexitheodore:

Topic		Replies	Views
Impact of contains() on performance InfluxDB 2 influxdb , flux , performance	7	3155	October 15, 2021
Which one is fast, contains vs equal expression? InfluxDB 2	1	606	March 22, 2022
Flux performance compared to similar influxQL query Fluxlang performance	5	577	December 9, 2022
Passing large number of guids to a regex runs super slow	3	1365	February 23, 2018
Contains query performance (Finding an alternative) influxdb , query , flux , performance	0	574	February 9, 2023

Performance of contains() is very bad compared to equivalent alternatives. Same thing for regex.compile()

Related topics