CPU / Memory "explodes" with InfluxDB 1.8.1 -> unusable on Debian Stretch

Hello,

I have since 1.7.x very big problems with our InfluxDB. It was running since years without any big issues, but since 1.7.1 everything is pretty bad.
I had 188GB on metrics with Icinga2 and Telegraf and a few hundred hosts. I switched the index to TSI1, but 32GB RAM wasn’t enough. The problem starts at every hour and eating the whole CPUs and slowly the memory. It may has something todo with GC or the retention policy …
I searched over weeks for a solution, but I’m lost now …

  • System: Debian Stretch
  • VM: KVM
  • 36GB Memory
  • 6 vCPU cores
  • InflxudB 1.8.1
reporting-disabled = true

[meta]
  enabled = true
  dir = "/var/lib/influxdb/meta"
  bind-address = "graph-01.example.com:8088"
  http-bind-address = "graph-01.example.com:8091"
  retention-autocreate = true
  election-timeout = "1s"
  heartbeat-timeout = "1s"
  leader-lease-timeout = "500ms"
  commit-timeout = "50ms"
  cluster-tracing = false

[data]
  enabled = true
  dir = "/var/lib/influxdb/data"
  wal-dir = "/var/lib/influxdb/wal"
  wal-logging-enabled = true
  wal-fsync-delay = "50ms"
  trace-logging-enabled = false
  query-log-enabled = false
  index-version = "tsi1"
  max-series-per-database = 1000000
  compact-throughput = "1m"
  compact-throughput-burst = "10m"
  cache-snapshot-memory-size = "25k"
  cache-snapshot-write-cold-duration = "1m"

[hinted-handoff]
  enabled = true
  dir = "/var/lib/influxdb/hh"
  max-size = 1073741824
  max-age = "168h"
  retry-rate-limit = 0
  retry-interval = "1s"
  retry-max-interval = "1m"
  purge-interval = "1h"

[coordinator]
  write-timeout = "10s"
  query-timeout = "0"
  log-queries-after = "0"
  max-select-point = 0
  max-select-series = 0
  max-select-buckets = 0

[retention]
  enabled = true
  check-interval = "30m"

[shard-precreation]
  enabled = true
  check-interval = "10m"
  advance-period = "30m"

[monitor]
  store-enabled = false
  store-database = "_internal"
  store-interval = "10s"

[admin]
  enabled = false
  bind-address = "127.0.0.0:8088"
  https-enabled = true

[http]
  enabled = true
  bind-address = "graph-01.example.com:8086"
  auth-enabled = true
  log-enabled = false
  write-tracing = false
  pprof-enabled = false
  https-enabled = true
  max-row-limit = 10000
  realm = "InfluxDB"

[subscriber]
  enabled = true
  http-timeout = "30s"

[[graphite]]
  enabled = false

[[collectd]]
  enabled = false

[[opentsdb]]
  enabled = false

[[udp]]
  enabled = false

[continuous_queries]
  enabled = true
  log-enabled = true

I dropped in the end, both DBs and recreated them :frowning: (ok, I have a backup), so data dropped from 188GB to ~8GB. I also moved from ISCSI (10Gb/s) device to local storage (all SSD) and back … without success.

  • Icinga2 DB
> use icinga2
Using database icinga2
> SHOW RETENTION POLICIES
name       duration   shardGroupDuration replicaN default
----       --------   ------------------ -------- -------
rp_1_year  8760h0m0s  168h0m0s           1        true
rp_2_years 17520h0m0s 168h0m0s           1        false
rp_3_years 26208h0m0s 168h0m0s           1        false
  • Telegraf
> SHOW RETENTION POLICIES
name       duration   shardGroupDuration replicaN default
----       --------   ------------------ -------- -------
autogen    696h0m0s   168h0m0s           1        true
rp_1_years 8760h0m0s  168h0m0s           1        false
rp_5_years 43680h0m0s 168h0m0s           1        false
  • CQ
> SHOW CONTINUOUS QUERIES
name: icinga2
name            query
----            -----
cq_after_1_year CREATE CONTINUOUS QUERY cq_after_1_year ON icinga2 BEGIN SELECT mean(value) AS value, mean(crit) AS crit, mean(warn) AS warn INTO icinga2.rp_2_years.:MEASUREMENT FROM icinga2.rp_1_year./.*/ WHERE time < now() - 52w GROUP BY time(1h), * END
cq_after_2_year CREATE CONTINUOUS QUERY cq_after_2_year ON icinga2 BEGIN SELECT mean(value) AS value, mean(crit) AS crit, mean(warn) AS warn INTO icinga2.rp_3_years.:MEASUREMENT FROM icinga2.rp_2_years./.*/ WHERE time < now() - 104w GROUP BY time(1d), * END


name: telegraf
name             query
----             -----
cq_after_1_month CREATE CONTINUOUS QUERY cq_after_1_month ON telegraf BEGIN SELECT mean(*) INTO telegraf.rp_1_years.:MEASUREMENT FROM telegraf.autogen./.*/ GROUP BY time(1h), * END
cq_after_1_year  CREATE CONTINUOUS QUERY cq_after_1_year ON telegraf BEGIN SELECT mean(*) INTO telegraf.rp_5_years.:MEASUREMENT FROM telegraf.rp_1_years./.*/ GROUP BY time(1d), * END

I now let InfluxDB restart every hour, otherwise it kills the VM (OOM / I/O). I tried also to extend the memory to 48GB and SWAP space to 80GB(!!) but after three hours … it wasn’t enough.

Maybe something is stupid on my configuration … as I said … I tried a lot.

Any help would be great.

Update

Reading: https://link.medium.com/LDy0ublsH8 The problem is “mean(*)” which kills everything, if I understand it correct. Problem is … we have so many Telegraf plugins … How can I avoid to create tens or hundreds of CQs ?

Hi @linuxmail,

Have you narrowed it down to those CQs for certain? I.e., have you tried removing the CQs with mean(*) and the problem goes away?

I read the linked blog post. It is doing what I would expect: calculating the mean of all fields in all series. The GROUP BY * is telling it to group each series separately. If 1 million series were written in the query’s time window, then that GROUP BY would break it out into 1 million buckets and compute the mean of each, which can be expensive but is probably what you want.

hi,

yes, that helps, but the data is growing very fast. For two hours, round about 4GB. So I have two questions:
Is there a way, two make it better ? So mean: reduce / compact the metrics ? My problem is only on the Telegraf DB. It would be OK, to reduce the resolution from 10sec to maybe 1min after one day.

Second question is: Is there a way to find the metrics, with the most entries ? I’ve disabled for example the Ceph metrics, which are quite a lot, but maybe there are some others to release the pressure.

cu denny