Aggregate by count?

Nicolas_Carrara · April 6, 2022, 4:35pm

Hi,

I do not find a way to aggregate data in chunks of equal sizes, you can only do it by time. Is there an easy way to do it, other than ready the whole time-serie in python and splitting it (taking forever !).

For example the serie: [x0, x1, …, xn-1] , for chunks of size 10, I would like to get [x0 … x9] [x10 … x19] … [xn-10 … xn-1].

Maybe it does not make sense on influxdb, because that would mean indexing by count and not by time.

Anaisdg · April 7, 2022, 5:20pm

Hello @Nicolas_Carrara,
Welcome!!
You could create a dummy table with a time index and then group based off of that time index.
I would expect this to work but it’s not. I’m looking into it but placing these here in the meantime:

data
  |> map(fn: (r) => ({ r with _time_index: 10000000000 }))
  |> cumulativeSum(columns: ["_time_index"])
  |> map(fn: (r) => ({ r with _time_index: time(v: r._time_index) }))
  |> window(every: 100s, timeColumn: "_time_index")
  |> yield(name: "grouped by count")

Anaisdg · April 7, 2022, 9:51pm

@Nicolas_Carrara
This will do it:

import "experimental"
pointsPerGroup = 10

data = from(bucket: "noaa")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "average_temperature")
  |> filter(fn: (r) => r["_field"] == "degrees")
  |> filter(fn: (r) => r["location"] == "coyote_creek")
  |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)


data
  |> map(fn: (r) => ({ r with i: 1 }))
  |> cumulativeSum(columns: ["i"]) 
  |> map(fn: (r) => ({ r with i: r.i / pointsPerGroup }))
  |> experimental.group(mode: "extend", columns: ["i"])

Nicolas_Carrara · April 8, 2022, 8:33am

Thanks for the help!

The solution is working, but the problem is my navigator crashes when I ask to process a too large time interval. When querying from python, the socket ends up timing out.

When doing the same with aggregate by time, it works like a charm.

Maybe the problem is that “cumulativeSum” is a sequential operation and cannot be processed in parallel?

I don’t see the big deal of aggregating by count if we assume the serie is stored as an array internally but maybe it is not.

Note that I must aggregate 100 millions datapoints at once.

Anaisdg · April 8, 2022, 8:17pm

@Nicolas_Carrara,
yes most likely and also that from |> range |> filter |> aggWindow is a pushdown pattern.
Flux is able to query data efficiently because some functions push down the data transformation workload to storage rather than performing the transformations in memory. Combinations of functions that do this work are called pushdown patterns.

Do you have to perform an aggregation of that scale continuously? Or just once and then you can do it incrementally?/on less points?

Thanks

Nicolas_Carrara · April 10, 2022, 8:05am

@Anaisdg Thanks for your reply.

Ideally, I would like to call it at will, with calculations on the spot, because each time the parameters can change. In less than 1s for 100 millions datapoints.

As a workaround, I do the split with pre-defined parameters (like the size of the interval), save it on influx in a new measurement (where the tag are parameters), and then I can query the intervals instantly. It is not perfect but it does the job for now. I know it wont scale tho.

Anaisdg · April 12, 2022, 6:29pm

@Nicolas_Carrara,
Out of curiosity, what’s your use case? You’ve got me intrigued.

Nicolas_Carrara · April 12, 2022, 9:04pm

@Anaisdg I am using high-frequency market data for machine learning x)

Topic		Replies	Views
Grouping data by fields influxdb , flux , aggregate	2	650	August 29, 2022
Select Into problems Telegraf chronograf	9	1981	March 1, 2019
Group data by time and get most repeated value influxdb	0	247	September 18, 2023
Advice on aggregating to ohlc data InfluxDB 2	3	3749	March 8, 2021
Aggregate continuous queries for counters influxdb , time	3	2109	December 7, 2017

Aggregate by count?

Related topics