Hello,
I have since 1.7.x very big problems with our InfluxDB. It was running since years without any big issues, but since 1.7.1 everything is pretty bad.
I had 188GB on metrics with Icinga2 and Telegraf and a few hundred hosts. I switched the index to TSI1, but 32GB RAM wasn’t enough. The problem starts at every hour and eating the whole CPUs and slowly the memory. It may has something todo with GC or the retention policy …
I searched over weeks for a solution, but I’m lost now …
- System: Debian Stretch
- VM: KVM
- 36GB Memory
- 6 vCPU cores
- InflxudB 1.8.1
reporting-disabled = true
[meta]
enabled = true
dir = "/var/lib/influxdb/meta"
bind-address = "graph-01.example.com:8088"
http-bind-address = "graph-01.example.com:8091"
retention-autocreate = true
election-timeout = "1s"
heartbeat-timeout = "1s"
leader-lease-timeout = "500ms"
commit-timeout = "50ms"
cluster-tracing = false
[data]
enabled = true
dir = "/var/lib/influxdb/data"
wal-dir = "/var/lib/influxdb/wal"
wal-logging-enabled = true
wal-fsync-delay = "50ms"
trace-logging-enabled = false
query-log-enabled = false
index-version = "tsi1"
max-series-per-database = 1000000
compact-throughput = "1m"
compact-throughput-burst = "10m"
cache-snapshot-memory-size = "25k"
cache-snapshot-write-cold-duration = "1m"
[hinted-handoff]
enabled = true
dir = "/var/lib/influxdb/hh"
max-size = 1073741824
max-age = "168h"
retry-rate-limit = 0
retry-interval = "1s"
retry-max-interval = "1m"
purge-interval = "1h"
[coordinator]
write-timeout = "10s"
query-timeout = "0"
log-queries-after = "0"
max-select-point = 0
max-select-series = 0
max-select-buckets = 0
[retention]
enabled = true
check-interval = "30m"
[shard-precreation]
enabled = true
check-interval = "10m"
advance-period = "30m"
[monitor]
store-enabled = false
store-database = "_internal"
store-interval = "10s"
[admin]
enabled = false
bind-address = "127.0.0.0:8088"
https-enabled = true
[http]
enabled = true
bind-address = "graph-01.example.com:8086"
auth-enabled = true
log-enabled = false
write-tracing = false
pprof-enabled = false
https-enabled = true
max-row-limit = 10000
realm = "InfluxDB"
[subscriber]
enabled = true
http-timeout = "30s"
[[graphite]]
enabled = false
[[collectd]]
enabled = false
[[opentsdb]]
enabled = false
[[udp]]
enabled = false
[continuous_queries]
enabled = true
log-enabled = true
I dropped in the end, both DBs and recreated them (ok, I have a backup), so data dropped from 188GB to ~8GB. I also moved from ISCSI (10Gb/s) device to local storage (all SSD) and back … without success.
- Icinga2 DB
> use icinga2
Using database icinga2
> SHOW RETENTION POLICIES
name duration shardGroupDuration replicaN default
---- -------- ------------------ -------- -------
rp_1_year 8760h0m0s 168h0m0s 1 true
rp_2_years 17520h0m0s 168h0m0s 1 false
rp_3_years 26208h0m0s 168h0m0s 1 false
- Telegraf
> SHOW RETENTION POLICIES
name duration shardGroupDuration replicaN default
---- -------- ------------------ -------- -------
autogen 696h0m0s 168h0m0s 1 true
rp_1_years 8760h0m0s 168h0m0s 1 false
rp_5_years 43680h0m0s 168h0m0s 1 false
- CQ
> SHOW CONTINUOUS QUERIES
name: icinga2
name query
---- -----
cq_after_1_year CREATE CONTINUOUS QUERY cq_after_1_year ON icinga2 BEGIN SELECT mean(value) AS value, mean(crit) AS crit, mean(warn) AS warn INTO icinga2.rp_2_years.:MEASUREMENT FROM icinga2.rp_1_year./.*/ WHERE time < now() - 52w GROUP BY time(1h), * END
cq_after_2_year CREATE CONTINUOUS QUERY cq_after_2_year ON icinga2 BEGIN SELECT mean(value) AS value, mean(crit) AS crit, mean(warn) AS warn INTO icinga2.rp_3_years.:MEASUREMENT FROM icinga2.rp_2_years./.*/ WHERE time < now() - 104w GROUP BY time(1d), * END
name: telegraf
name query
---- -----
cq_after_1_month CREATE CONTINUOUS QUERY cq_after_1_month ON telegraf BEGIN SELECT mean(*) INTO telegraf.rp_1_years.:MEASUREMENT FROM telegraf.autogen./.*/ GROUP BY time(1h), * END
cq_after_1_year CREATE CONTINUOUS QUERY cq_after_1_year ON telegraf BEGIN SELECT mean(*) INTO telegraf.rp_5_years.:MEASUREMENT FROM telegraf.rp_1_years./.*/ GROUP BY time(1d), * END
I now let InfluxDB restart every hour, otherwise it kills the VM (OOM / I/O). I tried also to extend the memory to 48GB and SWAP space to 80GB(!!) but after three hours … it wasn’t enough.
Maybe something is stupid on my configuration … as I said … I tried a lot.
Any help would be great.
Update
Reading: https://link.medium.com/LDy0ublsH8 The problem is “mean(*)” which kills everything, if I understand it correct. Problem is … we have so many Telegraf plugins … How can I avoid to create tens or hundreds of CQs ?