Downsampling task out of memory

Hi,

i hope somebody has some tips or guidance for my problem.

i have database with about 40GiB of data in it (spans about 1.5-2 years), now i want to downsample that (since i am only really interested in the more granular data older than 6mo or so)

now i created a ‘naive’ task that simply tried to downsample everything into a new bucket with something like this:

from(bucket: "source")
    |> range(start: -24mo)
    |> aggregateWindow(fn: mean, every: 5m, createEmpty: false)
    |> to(bucket: "target", org: "my-org")

i quickly noticed i ran into two problems:

  1. i have some string data where ‘mean’ does not make sens
  2. influxdb consumed a lot of memory and ultimately was killed by the oom killer of the kernel
    (the machine has 16GiB of memory)

for the first i now have split into multiple queries that does ‘last’ for string fields and ‘mean’ for all others

for the second i tried to split my task per measurement and then again into 4 parts (24-18mo, 18-12mo, 12-6mo and 6-0mo in the past)

this generally worked, but i still have a query that consumes so much memory that the daemon is oom killed:

import "types"
from(bucket: "source")
    |> range(start: -6mo)
    |> filter(fn: (r) => r._measurement == "some-measurement")
    |> filter(fn: (r) => not types.isNumeric(v: r._value))
    |> aggregateWindow(fn: last, every: 5m, createEmpty: false)
    |> to(bucket: "target", org: "my-org")

i wouldn’t have assumed that the daemon uses (much) more memory than the complete dataset it’s trying to dowsample and this is just a single measurement and a 1/3-1/4 of the time so even in the worst case the complete data could at max be 40GiB/3 => 13.3GiB which would fit into memory completely…

(altough i would have assumed that it would be handled in a streaming fashion, which not consume any memory at all besides the few data points per target window…)

is there anything i can do to prevent influxdb from being oom killed besides:

  • giving the machine more memory
  • splitting the task further into smaller pieces

thanks

Hello @flumm,

  1. I recommend using the following example to deal with different types:
    types.isType() function | Flux 0.x Documentation
    Oh whoops looks like you’ve seen this (didn’t finish reading but I’ll leave this here for anyone else who comes across it).

  2. Unfortunately you’ll have to split the task further into smaller pieces. InfluxDB v2 consumes as much memory as it can. The eng team is hoping to address these problems with InfluxDB Cloud powered by IOx. You can learn more about it here:
    Understanding InfluxDB IOx and the Commitment to Open Source | InfluxData

However you’ll be expected to perform downsampling with a serverless compute solution instead of with tasks.

ok maybe i’m confused or i did not express myself clearly. I’m talking about a self-hosted instance of influxdb OSS, so the influxdb cloud features don’t really apply here…

i also noticed that if leave out the type check, it actually can finish and does not use nearly as much memory (maybe there’s a bug here?), but for me that is not really the solution, since would like to take the ‘mean’ of numeric and the ‘last’ of string values…

again, it seems i don’t quite understand since i’m using influxdb oss not cloud, how could i do the downsampling different in my case (without tasks) ?

thanks again

Hello @flumm,
Sorry for the confusion I mean to imply if you were willign to migrate to the cloud solution that these caveats would apply.

You can still try completing further processing in OSS by breaking down your tasks or scaling vertically.