How to efficiently query the latest data timestamp for pagination without knowing the data time range?

I’m implementing a web frontend with pagination for InfluxDB data, but I’m facing a performance challenge when dealing with buckets that have unknown data time ranges.

I’m using anchor-based pagination with the following approach:

string flux = $"from(bucket: \"{bucket}\") " +
    $"|> range(start: 2020-01-01T00:00:00Z, stop: {anchorTime:yyyy-MM-ddTHH:mm:ssZ}) " +
    $"{filterExpression} " +
    "|> pivot(rowKey: [\"_time\"], columnKey: [\"_field\"], valueColumn: \"_value\") " +
    "|> group() " +
    "|> sort(columns: [\"_time\"], desc: true) " +
    $"|> limit(n: {limit})";

For the first page, I use now() as the anchor time, and for subsequent pages, I use the timestamp of the last record from the previous page as the new anchor time.

The Problem

1,Performance Issue: Using a fixed start time like 2020-01-01T00:00:00Z causes the query to scan potentially years of data, making it very slow for large datasets.

2,Empty Results Problem: If I try to optimize by setting the start time closer to the anchor time (e.g., start: -30d), I risk getting empty results when the latest data in the bucket is older than 30 days.

3,Unknown Data Range: I’m dealing with various buckets where I don’t know when the latest data was written. Some buckets might have data from yesterday, others from months ago.

What I’ve Tried

1,Using relative time ranges like -30d, -365d etc., but this fails when data is older than the range.

2,Progressive range expansion (try -1d, then -7d, then -30d, etc.), but this requires multiple queries.

3,Using last() function, but it still requires a reasonable time range to be effective.

Questions

1,Is there a fast way to get the timestamp of the latest record in a bucket without specifying a time range? Something like a metadata query that doesn’t scan the actual data?

2,Am I approaching pagination wrong for InfluxDB? Should I be using a different strategy altogether?

3,Are there any InfluxDB-specific best practices for implementing pagination when the data time range is unknown?

Use Case

Frontend web application displaying time-series data Multiple buckets with varying data freshness (some recent, some very old) Need to show the latest data first, then allow users to paginate backwards in time Performance is critical - first page should load in under 500ms.

Environment

InfluxDB version: 2.x

Using Flux queries via the InfluxDB Client API

C# backend, JavaScript frontend

@shijianoo Looking at your query, one of the reasons it’s performance is hindered is because you’re pivoting the data immediately after filtering it. pivot() breaks what we call the “pushdown chain.” Pushdowns are functions or chains of functions that can actually operate at the storage tier (much faster) rather than having to load the data into memory and operate on it there. pivot() can’t be pushed down, so everything returned from your filter expression has to be loaded into memory and the rest of the query operates on it there (slowly). You can learn more about pushdowns in Optimize queries - Start queries with pushdowns.

In the case of your query, I’d actually limit the records immediately after filtering them. limit() applies to each table in a string of tables, so it should still work the way it was before. So essentially you limit the number of rows per series (by default data is measurement, tag set, and field key), pivot fields into columns, ungroup everything into a single table, and then ensure the data is sorted by time descending:

string flux = $"from(bucket: \"{bucket}\") " +
    $"|> range(start: 2020-01-01T00:00:00Z, stop: {anchorTime:yyyy-MM-ddTHH:mm:ssZ}) " +
    $"{filterExpression} " +
    $"|> limit(n: {limit})" +
    "|> pivot(rowKey: [\"_time\"], columnKey: [\"_field\"], valueColumn: \"_value\") " +
    "|> group() ";
    "|> sort(columns: [\"_time\"], desc: true) " +

However, applying limit() immediately after the filter expression doesn’t give me the correct results in my case. The reason is that limit() operates on the rows as they come out of the storage engine (which are time-ascending by default). So after filtering and before sorting, limit(n: …) just takes the earliest rows, not the latest.

What I actually need is:
Starting from a given anchorTime,
return the most recent N records in descending order,
then pivot them into a wide table.
If I use your suggested query, I only get the oldest N records.

Right now, the only way I’ve found to get the correct data is something like this:

from(bucket: "{bucket}")
  |> range(start: {start:2000-01-01T00:00:00Z}, stop: {anchorTime:yyyy-MM-ddTHH:mm:ssZ})
  {filterExpression}
  |> keep(columns: ["_time", "_field", "_value"])
  |> sort(columns: ["_time"], desc: true)
  |> limit(n: {limit})

This produces the correct latest rows, but the performance is still poor, The previous writing method was to directly time out the exception,because the sort has to load a very large dataset (hundreds of thousands of rows) before applying the limit.

So my question is:
Is there a pushdown-friendly way to apply limit() from the newest records instead of the oldest?
Or is there another pattern to efficiently retrieve only the last N rows (relative to anchorTime) that still works when followed by a pivot()?

It doesn’t matter whether pivot() is executed or not. I just want to quickly get the latest 100 data from the minimum time to the current time (because I don’t know when the latest data time in the bucket is).

Thank you very much for your answer :grinning_face: