Retention policy deletion check let the memory explode

influxdb

#1

I just start using influxdb(1.7) to collect cgroup measurement,
and i find retention policy deletion check has a very expensive cost on memory and cpu,
the influx log like below:
ts=2018-09-12T00:08:01.600466Z lvl=info msg=“Retention policy deletion check (start)” log_id=0AU6aqCG000 service=retention trace_id=0AUmYLG0000 op_name=retention_delete_check op_event=start

s=2018-09-12T00:32:43.064925Z lvl=info msg=“Retention policy deletion check (end)” log_id=0AU6aqCG000 service=retention trace_id=0AUmYLG0000 op_name=retention_delete_check op_event=end op_elapsed=1481464.468ms

ts=2018-09-12T01:08:01.600467Z lvl=info msg=“Retention policy deletion check (start)” log_id=0AU6aqCG000 service=retention trace_id=0AUpz3l0000 op_name=retention_delete_check op_event=start

ts=2018-09-12T01:50:50.442853Z lvl=info msg=“Retention policy deletion check (end)” log_id=0AU6aqCG000 service=retention trace_id=0AUpz3l0000 op_name=retention_delete_check op_event=end op_elapsed=2568842.397ms
ts=2018-09-12T01:50:50.442905Z lvl=info msg=“Retention policy deletion check (start)” log_id=0AU6aqCG000 service=retention trace_id=0AUsQrIW000 op_name=retention_delete_check op_event=start
ts=2018-09-12T01:50:50.443269Z lvl=info msg=“Retention policy deletion check (end)” log_id=0AU6aqCG000 service=retention trace_id=0AUsQrIW000 op_name=retention_delete_check op_event=end op_elapsed=0.370ms

when i watch the heap size on the chrongraf,it looks like this:


cpu also has the same performance:

here is some more informations:
series has reached 4 million ,

shard groups for telegraf:

205 telegraf one_hour 2018-09-11T20:00:00Z 2018-09-11T21:00:00Z 2018-09-12T03:00:00Z
207 telegraf one_hour 2018-09-11T21:00:00Z 2018-09-11T22:00:00Z 2018-09-12T04:00:00Z
209 telegraf one_hour 2018-09-11T22:00:00Z 2018-09-11T23:00:00Z 2018-09-12T05:00:00Z
211 telegraf one_hour 2018-09-11T23:00:00Z 2018-09-12T00:00:00Z 2018-09-12T06:00:00Z
213 telegraf one_hour 2018-09-12T00:00:00Z 2018-09-12T01:00:00Z 2018-09-12T07:00:00Z
215 telegraf one_hour 2018-09-12T01:00:00Z 2018-09-12T02:00:00Z 2018-09-12T08:00:00Z
217 telegraf one_hour 2018-09-12T02:00:00Z 2018-09-12T03:00:00Z 2018-09-12T09:00:00Z
219 telegraf one_hour 2018-09-12T03:00:00Z 2018-09-12T04:00:00Z 2018-09-12T10:00:00Z

the retention policies:
one_hour 6h0m0s 1h0m0s 1 true

tsm file:
-rw-r–r-- 1 root root 390814115 9月 12 10:36 000000008-000000002.tsm
-rw-r–r-- 1 root root 132929882 9月 12 10:41 000000009-000000001.tsm
-rw-r–r-- 1 root root 132502656 9月 12 10:46 000000010-000000001.tsm
-rw-r–r-- 1 root root 127318427 9月 12 10:52 000000011-000000001.tsm
-rw-r–r-- 1 root root 391 9月 12 10:00 fields.idx

influxdb config:
[meta]
dir = “/data/tdwadmin/tdwenv/influxdb/influxdb/var/lib/influxdb/meta”
retention-autocreate = true
[data]
dir = “/data/tdwadmin/tdwenv/influxdb/influxdb/var/lib/influxdb/data”
wal-dir = “/data/tdwadmin/tdwenv/influxdb/influxdb/var/lib/influxdb/wal”
wal-fsync-delay = “0”
index-version = “inmem”
trace-logging-enabled = true
query-log-enabled = false
cache-max-memory-size = “20g”
cache-snapshot-memory-size = “128m”
max-index-log-file-size = “16m”
max-series-per-database = 0
max-values-per-tag = 0
compact-full-write-cold-duration = “1h”
cache-snapshot-write-cold-duration = “10m”
max-concurrent-compactions = 4

[coordinator]
write-timeout = “10s”

[retention]
enable = true
check-intrerval = 30

[http]
bind-address = “:8086”
realm = “InfluxDB”
log-enabled = false
access-log-path = “influx.http.log”
max-body-size = 0

and, yeah, the server has 128G RAM,48 Cores with a SSD of 256G.

details showing at github issue: https://github.com/influxdata/influxdb/issues/10277