InfluxDB backed by NFS PV is eating up memory

I have deployed InfluxDB 1.6.1-alpine community edition as a pod in Kubernetes cluster and having a memory limit as 8Gi. This is backed by the NFS Volumes as persistent volumes. Have the following settings tuned as per one of the InfluxDB developer. But still when the pod restarts it goes to 8Gi and gets killed (OOM) by kubernetes. Is there anything else I am missing?

bind-address = ":8088"

[meta]
  dir = "/var/lib/influxdb/meta"
  retention-autocreate = true
  logging-enabled = true

[data]
  dir = "/var/lib/influxdb/data"
  index-version = "tsi1"
  wal-dir = "/var/lib/influxdb/wal"
  trace-logging-enabled = true
  query-log-enabled = true
  cache-max-memory-size = 1073741824
  cache-snapshot-memory-size = 26214400
  cache-snapshot-write-cold-duration = "10m0s"
  compact-full-write-cold-duration = "4h0m0s"
  

[coordinator]
  write-timeout = "10s"
  max-concurrent-queries = 0
  query-timeout = "0s"
  log-queries-after = "10s"
  max-select-point = 0
  max-select-series = 0
  max-select-buckets = 0

[retention]
  enabled = true
  check-interval = "30m0s"

[shard-precreation]
  enabled = true
  check-interval = "10m0s"
  advance-period = "30m0s"

[admin]
  enabled = true
  bind-address = ":8083"
  https-enabled = false
  https-certificate = "/etc/ssl/influxdb.pem"

[monitor]
  store-enabled = false
  store-database = "_internal"
  store-interval = "10s"

[http]
  enabled = true
  bind-address = ":8086"
  auth-enabled = false
  realm = ""
  log-enabled = true
  access-log-path = ""
  write-tracing = true
  pprof-enabled = true
  https-enabled = false
  https-certificate = ""
  https-private-key = ""
  max-row-limit = 0
  max-connection-limit = 0
  shared-secret = "beetlejuicebeetlejuicebeetlejuice"
  
  unix-socket-enabled = false
  bind-socket = ""

  max-concurrent-write-limit = 0
  max-enqueued-write-limit = 0
  enqueued-write-timeout = 0

[ifql]
  enabled = true
  log-enabled = true
  bind-address = ":8082"

[logging]
  format = "auto"
  level = "info"
  suppress-logo = false

[subscriber]
  enabled = true
  http-timeout = "30s"
  insecure-skip-verify = false
  ca-certs = ""
  write-concurrency = 40
  write-buffer-size = 1000


[[graphite]]
  enabled = false
  bind-address = ":2003"
  database = "graphite"
  retention-policy = ""
  protocol = "tcp"
  batch-size = 5000
  batch-pending = 10
  batch-timeout = "1s"
  consistency-level = "one"
  separator = "."
  udp-read-buffer = 0

[[collectd]]
  enabled = false
  bind-address = ":25826"
  database = "collectd"
  retention-policy = "autogen"
  batch-size = 5000
  batch-pending = 10
  batch-timeout = "10s"
  read-buffer = 0
  typesdb = "/usr/share/collectd/types.db"
  security-level = "none"
  auth-file = "/etc/collectd/auth_file"

[[opentsdb]]
  enabled = false
  bind-address = ":4242"
  database = "opentsdb"
  retention-policy = "autogen"
  consistency-level = "one"
  tls-enabled = false
  certificate = "/etc/ssl/influxdb.pem"
  batch-size = 1000
  batch-pending = 5
  batch-timeout = "1s"
  log-point-errors = true


[[udp]]
  enabled = false
  bind-address = ":8089"
  database = "udp"
  retention-policy = "autogen"
  batch-size = 5000
  batch-pending = 10
  read-buffer = 0
  batch-timeout = "1s"
  precision = "ns"

[continuous_queries]
  log-enabled = true
  enabled = true
  run-interval = "1s"
  query-stats-enabled = true

First, I recommend running (or upgrading to) the latest InfluxDB version, which is currently 1.7.6. The TSI index has had a number of bug fixes since version 1.6.1.

Memory issues are almost always due to the resource consumption. Is InfluxDB receiving any writes when it is started? Does it run out of memory when there are no writes? Does InfluxDB start when the container memory limit is 12Gi or 16Gi?

Generally, the only non-default config option you’ll need for your use case is index-version = "tsi1", which tells InfluxDB to use the TSI series index. InfluxDB has built-in default config values. I recommend only specifying non-default config values in the config file to avoid accidentally including stale config options (like the [admin] section). For example, this is a minimal valid InfluxDB config you could use:

[meta]
  dir = "/var/lib/influxdb/meta"

[data]
  dir = "/var/lib/influxdb/data"
  index-version = "tsi1"
  wal-dir = "/var/lib/influxdb/wal"

Thanks @gunnar . As you recommended I have migrated to 1.7.6 (msg=“InfluxDB starting” log_id=0F1ypAqG000 version=1.7.6 branch=1.7 commit=01c8dd416270f424ab0c40f9291e269ac6921964) .

And the same exact minimal configuration. And set the resource limit to 8Gi.

For your questions,
Yes, the prometheus is always running and influxdb receives writes when it is started.

yes, the influxDB is starting successfully with 8Gi. No issues in writes so far.

The problem is when Grafana queries the influxDB for visualization. Its getting killed with OOM error.

Attaching the logs which has the perfect startup, successful writes and grafana executing query.

influxdb.txt (36.3 KB)

Hi @gkarthiks, just recognized the username. :slight_smile: One important thing I forgot to mention is that the in-mem index needs to be converted to a TSI index using the following command:

influx_inspect buildtsi -datadir /path/to/influxdb/data -waldir /path/to/influxdb/wal

Two things to note when running the conversion:

  1. Stop the InfluxDB process before running the conversion
  2. Make sure the files generated by influx_inspect buildtsi have the correct file permissions. The easiest way to ensure they have the correct permissions is to run the command with the same user used to run the InfluxDB process (typically influxdb).

After starting InfluxDB with the TSI index, the index will only be brought into memory as needed. It will also be scrolled back to disk only when needed. To determine a new memory limit for the container, look at how much memory InfluxDB uses a few minutes after startup and then add 50% to 100% memory as a buffer.