Hello!
I have the following error message:
panic: arrow/array: index out of range
When I am trying to join data in large dataset.
I have 2 measurements from the same bucket, first one is called: cycles, and the second called series. For every point in cycles, I have a set of points in series, so that I can provide different level of resolution of data, as well as I can identify each series from cycle using join function. Here is my flux query:
cycles = from(bucket: vBucket)
|> range(start: vStart, stop: vStop)
|> filter(fn: (r) => r["_measurement"] == "cycle")
// the time as string links one cycle for muliple series points
|> map(fn: (r) => ({ r with cycle: string(v: r._time) })) series
|> group()
|> yield(name: "cycles")
series = from(bucket: vBucket)
|> range(start: vStart, stop: vStop)
|> filter(fn: (r) => r["_measurement"] == "series")
|> group()
|> yield(name: "series")
join.inner(
left: cycles,
right: series,
on: (l, r) => l.cycle == r.cycle,
as: (l, r) => ({r with cycle_result: l._value}))
|> yield(name: "join")
From cycles and series before calling group(), I drop all the unnecessary columns. My problem is the following. When the timeRange is large enough: especially when series has 36k+ rows, the join function fails with the error message above. With less points, the join performs well.
I have experienced this error multipe time in the past with similar scenarios when I am using join function to build up 1:N relations.
Using group() without columns, I can achive equals groupKeys.
In the cycles, I have a field called ‘result’ that I want to add all cycle-related series so that I can filter out points from series based on value originated from cycles.
In the schema design phase, my intention was to store the result field in the cycles because it is needless to add the same value for all series points to one specific cycle. My plan was to use join function to filter the points.
When I made some tests, I have realized that using join() is very time consuming compared to filter() function.
My questions are the following:
- Why does the join function fails for large set of data?
- If I can make above join function to work, for larger amount of data, the query time will be significantly higher that may also leads for poor performance.
- What is the recommended schema desing practice to handle 1:N relation? Join vs. redundant tagging? Should I forget my concers about storage-requirements just to gain permormant queries?
Thank you very much for your reply!