We have an influxdb VM that has been on constant SWAP usage. Even if the VM is restarted the swap usage jumps to 100% within 20 minutes.
When we see memory usage: free -h we see as:
(Swap is at 100% despite of 59GB memory being available)
total used free shared buff/cache available
Mem: 123Gi 70Gi 567Mi 551Mi 52Gi 59Gi
Swap: 9Gi 9Gi 0B
If we look at the PSI we see it as:
IO. (Looks like most of the processes are waiting for IO)
cat /proc/pressure/io
some avg10=84.83 avg60=78.83 avg300=78.96 total=70337558807
full avg10=84.38 avg60=78.05 avg300=78.08 total=69619870053
Memory:
cat /proc/pressure/memory
some avg10=32.65 avg60=32.74 avg300=31.25 total=35534063966
full avg10=32.25 avg60=32.34 avg300=30.87 total=35182532561
The Swapiness Value is 60. If we set the value to 0, the VM becomes unresponsive within some time.
If we run atop
, we see the following in red:
SWP | tot 10.0G | | free 0.0M | swcac 505.9M
DSK | nvme2n1 | busy 100% | read 33115 | write 527 | discrd 0 | KiB/r 19 | KiB/w 173 | | KiB/d 0 | MBr/s 63.3 | MBw/s 8.9 | avq 88.19 | avio 0.30 ms
This clearly shows that is 100% busy (probably reads?)
If I login to the influx and run the following commands:
Series Cardinality
> show series cardinality
cardinality estimation
----------------------
252390866
Show Queries:
> show queries
qid query database duration status
--- ----- -------- -------- ------
265 SHOW QUERIES metrics 53µs running
It seems like Even when queries are not running the swap usage is high.
If we see the influx logs. most of the writes are failing:
metrics_user [24/Mar/2024:06:34:33 +0000] "POST /write?db=metrics&precision=n&consistency=one HTTP/1.1 " 500 20 "-" "okhttp/4.11.0" 9370eb83-e9a8-11ee-b07c-06e3b073e7a7 10494393
ts=2024-03-24T06:34:43.743687Z lvl=error msg="[500] - \"timeout\"" log_id=0o6lIAql000 service=httpd
ts=2024-03-24T06:35:57.088936Z lvl=info msg="Snapshot for path written" log_id=0o6lIAql000 engine=tsm1 trace_id=0o7n_W00000 op_name=tsm1_cache_snapshot path=/var/vcap/store/influxdb/data/metrics/metrics_default/2803 duration=36128.915ms
Running iotop it is clear that the disk activity is from influxdb:
4272 be/3 root 0.00 B/s 94.47 K/s ?unavailable? [jbd2/nvme2n1p1-8]
36921 be/2 vcap 1169.95 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
36927 be/2 vcap 323.37 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
36928 be/2 vcap 2038.33 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
36941 be/2 vcap 1936.59 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
37020 be/2 vcap 385.14 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
37031 be/2 vcap 2.29 M/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
37132 be/2 vcap 1758.56 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
37135 be/2 vcap 2.38 M/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
37136 be/2 vcap 2.49 M/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
37137 be/2 vcap 1140.88 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
37157 be/2 vcap 2.28 M/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
37159 be/2 vcap 911.98 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
37162 be/2 vcap 1714.96 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
37163 be/2 vcap 1762.19 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
37168 be/2 vcap 897.45 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
37169 be/2 vcap 1137.25 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
37172 be/2 vcap 1758.56 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
37173 be/2 vcap 1718.59 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
37175 be/2 vcap 1068.21 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
37180 be/2 vcap 962.85 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
37181 be/2 vcap 1994.73 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
37184 be/2 vcap 581.34 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
37185 be/2 vcap 1565.99 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
37189 be/2 vcap 639.48 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
37201 be/2 vcap 817.51 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
37202 be/2 vcap 468.71 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
37204 be/2 vcap 2.26 M/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
37213 be/2 vcap 904.71 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
37216 be/2 vcap 1144.52 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
38844 be/2 vcap 839.31 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
38848 be/2 vcap 941.05 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
38851 be/2 vcap 2.39 M/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
38853 be/2 vcap 2.25 M/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
38854 be/2 vcap 2.30 M/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
38855 be/2 vcap 1758.56 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
38859 be/2 vcap 109.00 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
38861 be/2 vcap 2.00 M/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
38862 be/2 vcap 2.02 M/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
38863 be/2 vcap 334.27 K/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
38864 be/2 vcap 2.02 M/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
38865 be/2 vcap 2.42 M/s 0.00 B/s ?unavailable? influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
Running sar: sar -d 10 6
sar -d 10 6
Linux 6.2.0-39-generic (ac2f95dd-14d9-4eed-8e2f-060615e24dce) 03/24/2024 _x86_64_ (32 CPU)
06:45:57 AM DEV tps rkB/s wkB/s dkB/s areq-sz aqu-sz await %util
06:46:07 AM nvme1n1 0.30 12.80 1.60 0.00 48.00 0.00 1.33 0.12
06:46:07 AM nvme0n1 0.30 0.00 3.20 0.00 10.67 0.00 1.00 0.12
06:46:07 AM nvme2n1 3420.80 67438.40 3687.20 0.00 20.79 106.47 31.13 100.00
06:46:07 AM DEV tps rkB/s wkB/s dkB/s areq-sz aqu-sz await %util
06:46:17 AM nvme1n1 1.00 0.00 9.20 0.00 9.20 0.00 0.90 0.16
06:46:17 AM nvme0n1 0.90 16.00 9.60 0.00 28.44 0.00 0.67 0.20
06:46:17 AM nvme2n1 3404.80 68434.40 7868.00 0.00 22.41 102.23 30.03 100.00
06:46:17 AM DEV tps rkB/s wkB/s dkB/s areq-sz aqu-sz await %util
06:46:27 AM nvme1n1 9.70 26.40 20.40 0.00 4.82 0.02 1.69 1.24
06:46:27 AM nvme0n1 0.30 0.00 4.40 0.00 14.67 0.00 0.67 0.08
06:46:27 AM nvme2n1 3215.40 46037.20 12006.40 0.00 18.05 66.12 20.56 100.00
^C
Average: DEV tps rkB/s wkB/s dkB/s areq-sz aqu-sz await %util
Average: nvme1n1 3.67 13.07 10.40 0.00 6.40 0.01 1.61 0.51
Average: nvme0n1 0.50 5.33 5.73 0.00 22.13 0.00 0.73 0.13
Average: nvme2n1 3347.00 60636.67 7853.87 0.00 20.46 91.61 27.37 100.00
Please Note: nvme2n1 is NOT the swap disk. It is the disk where influxdb data is stored.
Summary: Low CPU usage, high IO (Disk), Only 50% memory consumed yet high Swap usage.
- How do we debug the usage of why such high disk activity?
- How do we identify the High swap usage even when memory is available
Following are the details:
Influxdb Version 1.8.10
CPU Count: 32
Memory: 128 GB
Disk: 1TB
AWS VM Type: m6a.8xlarge (32CPU,128GB Memory)