Types library isType and isNumeric Performance

While the new types library is pretty helpful, I was wondering if it could be made any more performant using this flux code. Without the isNumeric() function, this query runs in about .03s but with it it takes 1.2s. That sadly wont scale well for my application, so for the time being I don’t think I can use it. @scott @Anaisdg

Here is my flux:

import "date"
import "types"

offset = duration(v: string(v: uint(v: 2023-07-06T04:00:00.0000000Z) - uint(v: date.truncate(t: 2023-07-06T04:00:00.0000000Z, unit: 86400s))) + "ns")

MinFunction = (measurement, field) =>
from(bucket: "LPGen2")
|> range(start: 2023-07-06T04:00:00.0000000Z, stop: 2023-07-07T04:00:00.0000000Z)
|> filter(fn: (r) => r._measurement == measurement)
|> filter(fn: (r) => r._field == field)
|> filter(fn: (r) => types.isNumeric(v: r._value))
|> window(every: 86400s, offset: offset, createEmpty: true)
|> min()

DigitalSetFunction = (measurement) =>
from(bucket: "LPGen2")
|> range(start: 0, stop: 2023-07-07T04:00:00.0000000Z)
|> filter(fn: (r) => r._measurement == measurement)
|> filter(fn: (r) => r._field == "DigitalSet")
|> last()

union(tables: [MinFunction(measurement: "Test", field: "Value"), DigitalSetFunction(measurement: "Test")]) |> yield(name: "Test,,Value")
union(tables: [MinFunction(measurement: "Test", field: "Value"), DigitalSetFunction(measurement: "Test")]) |> yield(name: "Test,Plant=A,Value")
union(tables: [MinFunction(measurement: "Test1", field: "Value"), DigitalSetFunction(measurement: "Test1")]) |> yield(name: "Test1,,Value")
union(tables: [MinFunction(measurement: "Test10", field: "Value"), DigitalSetFunction(measurement: "Test10")]) |> yield(name: "Test10,,Value")
union(tables: [MinFunction(measurement: "Test2", field: "Value"), DigitalSetFunction(measurement: "Test2")]) |> yield(name: "Test2,,Value")
union(tables: [MinFunction(measurement: "Test3", field: "Value"), DigitalSetFunction(measurement: "Test3")]) |> yield(name: "Test3,,Value")
union(tables: [MinFunction(measurement: "Test4", field: "Value"), DigitalSetFunction(measurement: "Test4")]) |> yield(name: "Test4,,Value")
union(tables: [MinFunction(measurement: "Test5", field: "Value"), DigitalSetFunction(measurement: "Test5")]) |> yield(name: "Test5,,Value")
union(tables: [MinFunction(measurement: "Test6", field: "Value"), DigitalSetFunction(measurement: "Test6")]) |> yield(name: "Test6,,Value")
union(tables: [MinFunction(measurement: "Test7", field: "Value"), DigitalSetFunction(measurement: "Test7")]) |> yield(name: "Test7,,Value")
union(tables: [MinFunction(measurement: "Test8", field: "Value"), DigitalSetFunction(measurement: "Test8")]) |> yield(name: "Test8,,Value")
union(tables: [MinFunction(measurement: "Test9", field: "Value"), DigitalSetFunction(measurement: "Test9")]) |> yield(name: "Test9,,Value")
union(tables: [MinFunction(measurement: "TestAlt", field: "Value"), DigitalSetFunction(measurement: "TestAlt")]) |> yield(name: "TestAlt,,Value")
union(tables: [MinFunction(measurement: "TestDigAlt", field: "Value"), DigitalSetFunction(measurement: "TestDigAlt")]) |> yield(name: "TestDigAlt,,Value")
union(tables: [MinFunction(measurement: "Test3", field: "Description"), DigitalSetFunction(measurement: "Test3")]) |> yield(name: "Test3,,Description")

Hello @ticchioned,
Unfortunately performance enhancements to Flux is not engineerings focus right now with the roll out of InfluxDB 3.0.
I’m sorry it’s not working for you as you expect but I encourage you to possibly try out 3.0 and take advantage of DataFusion and Apache Arrow Flight SQL (under the hood) for querying and transporting large datasets (including pandas dataframes) over network interfaces.

@ticchioned fyi: the reason your performance is low is because |> filter(fn: (r) => types.isNumeric(v: r._value)) is not a push-down query, meaning you are losing the aggregation at the backing database.

This means that the entire data set is being pulled into memory and then the aggregate is being run in memory.

I am assuming you are doing that filter because your call to min() is failing on string values.

Unfortunately, the only proper way of resolving this is to add a tag to the measurement at insertion time so that you can do a filter on that tag (filtering tags is a pushdown query).

Fyi, running basicstats at the telegraf level will correctly drop non-numeric values, so one option would be to run a very short aggregate at the telegraf level (something that windows every 15s the last value) and the filter by that tag.

I should add that this is a design problem. If the underlying database implementation doesn’t support running a minimum aggregate on strings, it will not won’t fix this issue to get a system that can move things faster: what you want to do is filter out values that aren’t min aggregateable.