Hi,
i hope somebody has some tips or guidance for my problem.
i have database with about 40GiB of data in it (spans about 1.5-2 years), now i want to downsample that (since i am only really interested in the more granular data older than 6mo or so)
now i created a ‘naive’ task that simply tried to downsample everything into a new bucket with something like this:
from(bucket: "source")
|> range(start: -24mo)
|> aggregateWindow(fn: mean, every: 5m, createEmpty: false)
|> to(bucket: "target", org: "my-org")
i quickly noticed i ran into two problems:
- i have some string data where ‘mean’ does not make sens
- influxdb consumed a lot of memory and ultimately was killed by the oom killer of the kernel
(the machine has 16GiB of memory)
for the first i now have split into multiple queries that does ‘last’ for string fields and ‘mean’ for all others
for the second i tried to split my task per measurement and then again into 4 parts (24-18mo, 18-12mo, 12-6mo and 6-0mo in the past)
this generally worked, but i still have a query that consumes so much memory that the daemon is oom killed:
import "types"
from(bucket: "source")
|> range(start: -6mo)
|> filter(fn: (r) => r._measurement == "some-measurement")
|> filter(fn: (r) => not types.isNumeric(v: r._value))
|> aggregateWindow(fn: last, every: 5m, createEmpty: false)
|> to(bucket: "target", org: "my-org")
i wouldn’t have assumed that the daemon uses (much) more memory than the complete dataset it’s trying to dowsample and this is just a single measurement and a 1/3-1/4 of the time so even in the worst case the complete data could at max be 40GiB/3 => 13.3GiB which would fit into memory completely…
(altough i would have assumed that it would be handled in a streaming fashion, which not consume any memory at all besides the few data points per target window…)
is there anything i can do to prevent influxdb from being oom killed besides:
- giving the machine more memory
- splitting the task further into smaller pieces
thanks