Very poor performance when execute query with pivot

we have 2.8G records (each records contains 7 fields). When we try to execute query and using pivot to compose records, we encountered very poor performance. The query is as below:

from(bucket: "zillonare")
  |> range(start: 2022-02-15T01:31:00Z, stop: 2022-02-15T07:00:01Z)
  |> filter(fn: (r) => r["_measurement"] == "stock_min1")
  |> pivot(columnKey: ["_field"], rowKey: ["_time"], valueColumn: "_value")
  |> keep(columns: ["frame","open","high","low","close","volume","amount","factor","code","_time"])
  |> sort(columns: ["_time"], desc: false)

this will cost us almost 10 minutes, without pivot, it can return in almost 10 seconds.

Wonder is this normal? Or the query clause can be polished?

@aaron_yang I think there are a few things you can do to optimize this query.

  1. What’s the resolution of the data? How many points per minute for example? If it’s at a 1s resolution, this could be querying a lot of data depending on the fields in the stock_min1 measurement. If you don’t need the high resolution data, you can downsample to provide per minute or per hour summaries and reduce the amount of data being returned.

  2. Right now, you’re pivoting EVERYTHING in the stock_min1 measurement, then using keep to essentially do two things 1) Keep the time column 2) drop all the field columns you don’t want. It would be more efficient to actually filter on your fields rather that querying everything, pivoting, then dropping the fields you don’t want.

  3. If you don’t use keep(), you also shouldn’t need to resort the data. This will also save some time.

So something like this:

from(bucket: "zillonare")
  |> range(start: 2022-02-15T01:31:00Z, stop: 2022-02-15T07:00:01Z)
  |> filter(fn: (r) => r["_measurement"] == "stock_min1")
  |> filter(fn: (r) => r["_field"] == "frame" or r["_field"] == "open" or r["_field"] == "high" or r["_field"] == "low" or r["_field"] == "close" or r["_field"] == "volume" or r["_field"] == "amount")
  |> aggregateWindow(every: 1h, fn: last)
  |> pivot(columnKey: ["_field"], rowKey: ["_time"], valueColumn: "_value")

the resolution is 1m for this measurement, however we set the precision as ‘1s’ when save data points. Is there any way to change the resolution now?

about “sort”, I don’t find document that states how the query result will be returned. Will it always be returned in ascending order by “_time”?

“keep” is used because we don’t want “_start” and “_stop” to be returned. No matter which query we used, if we don’t use “keep”, the two fields will always be returned.

The timestamp precision doesn’t really matter. Flux treats everything as nanosecond precision. How often you’re writing and exactly how many points you’re querying is what matters here.

This could definitely be clearer in the documentation and, currently, it’s a little all over the place. When querying data from InfluxDB, it returns data sorted by time in ascending order by default. I don’t know if this is always true, but functions that change the group key of input data no longer guarantee sort order and output data may need to be re-sorted.

I see. In this case, drop() would be less verbose than keep():

from(bucket: "zillonare")
  |> range(start: 2022-02-15T01:31:00Z, stop: 2022-02-15T07:00:01Z)
  |> filter(fn: (r) => r["_measurement"] == "stock_min1")
  |> aggregateWindow(every: 1h, fn: last)
  |> pivot(columnKey: ["_field"], rowKey: ["_time"], valueColumn: "_value")
  |> drop(columns: ["_start", "_stop"]

very helpful and many thanks!

by the way, when we write data to influxdb using v2.x API, the precision it supported only contains “s”, “us”, “ms” and “ns”. So don’t bother with “h” and “m”, it only exist in 1.x api I guess?