Design question! Bad performance with many signals and last()

I’m reaching out because we’re experiencing serious performance issues with our current setup. Most problems are within one tenant, containing ~8000 signals. New values every minute.

Schema:
org: environment
bucket: tenant
_measurement: tenant
_field:
_value: <float|bool|string>
signal:

InfluxDB instance:
2 CPU, 4GB Memory

We migrated from Influx 1.8, where this was not a problem at all. With InfluxDB 2.x, we hit massive performance issues.

Querying a single signal over 1h, via:
from(bucket: bucket)
|> range(start: <from>, stop: <to>)
|> filter(fn: (r) => r._measurement == <tenant> and r._field == <signal>)
takes around >60s

We have already improved by using the tag:
from(bucket: bucket)
|> range(start: <from>, stop: <to>)
|> filter(fn: (r) => r._measurement == <tenant> and r.signal == <signal>)
takes around 1s

Additionally, we have special signal handling. We expect, for every given signal & timestamp, the last known value, which we call the “predecessor”. This means, whenever we want to query for a time range:
[from, to]
We also have to query the predecessor, last value < from, which in InfluxDB2.x results in:
[0, from]

Simplified, this is the resulting query, which has a devastating performance:
from(bucket: bucket)
|> range(start: 0, stop: <from>)
|> filter(fn: (r) => r._measurement == <tenant> and r.signal == <signal>)
|> last()

There is no other way to get this value. I have no idea if the last value was written yesterday or a week ago. This is machine data. I have no control over it.

Finally, I have to call for around 80 Signals this way. I pack all of them in union-batches of max 20 signals, and send those union request to my instance.

We have other accounts with ~1500 Signals, working totally fine without any notice. Loading times for everything ~1s.

Questions:

  • How to query for those “predecessors” efficiently
  • Is it a bad design for 8000 Signals? How can we improve?
  • All of this wasn’t a problem back to InfluxDB 1.8, the predecessor queries were pretty fast.
  • Any other suggestions how to handle that?
    • Reduce request size? More requests? But we already have downtimes on the instance. The queries seem to be super heavy for it to handle.

Thanks for any help & for any hints.

Best regards
Philipp

Since 2.X runs InfluxQL, , you could try your 1.8 InfluxQL query and see if the performance issue is the Flux query or the underlying database. Let us know how InfluxQL performs.

Thanks for that hint. I need to write that ticket to my devops, to enable InfluxQL queries. The old predecessor query was pretty easy and fast:

SELECT LAST(\"<signal>\") AS \"<signal>\" FROM <measurement> WHERE time < <start>ms"

The query is easy. The question is, if the underlying database has access to that “LAST”, or how this last function is implemented. If it is again iterating all data, then this query will also be super slow. Will test it, thx!