High Swap usage on influxdb

We have an influxdb VM that has been on constant SWAP usage. Even if the VM is restarted the swap usage jumps to 100% within 20 minutes.

When we see memory usage: free -h we see as:
(Swap is at 100% despite of 59GB memory being available)

               total        used        free      shared  buff/cache   available
Mem:           123Gi        70Gi       567Mi       551Mi        52Gi        59Gi
Swap:            9Gi         9Gi          0B

If we look at the PSI we see it as:
IO. (Looks like most of the processes are waiting for IO)

cat /proc/pressure/io
some avg10=84.83 avg60=78.83 avg300=78.96 total=70337558807
full avg10=84.38 avg60=78.05 avg300=78.08 total=69619870053

Memory:

cat /proc/pressure/memory
some avg10=32.65 avg60=32.74 avg300=31.25 total=35534063966
full avg10=32.25 avg60=32.34 avg300=30.87 total=35182532561

The Swapiness Value is 60. If we set the value to 0, the VM becomes unresponsive within some time.

If we run atop, we see the following in red:

SWP |  tot    10.0G |               |  free    0.0M |  swcac 505.9M

DSK |       nvme2n1 |  busy    100% |  read   33115 |  write    527 |  discrd     0 |  KiB/r     19 |  KiB/w    173  |               | KiB/d      0  | MBr/s   63.3  | MBw/s    8.9  | avq    88.19  | avio 0.30 ms

This clearly shows that is 100% busy (probably reads?)
If I login to the influx and run the following commands:
Series Cardinality

> show series cardinality
cardinality estimation
----------------------
252390866

Show Queries:

> show queries
qid query        database duration status
--- -----        -------- -------- ------
265 SHOW QUERIES metrics  53µs     running

It seems like Even when queries are not running the swap usage is high.
If we see the influx logs. most of the writes are failing:

 metrics_user [24/Mar/2024:06:34:33 +0000] "POST /write?db=metrics&precision=n&consistency=one HTTP/1.1 " 500 20 "-" "okhttp/4.11.0" 9370eb83-e9a8-11ee-b07c-06e3b073e7a7 10494393
ts=2024-03-24T06:34:43.743687Z lvl=error msg="[500] - \"timeout\"" log_id=0o6lIAql000 service=httpd

ts=2024-03-24T06:35:57.088936Z lvl=info msg="Snapshot for path written" log_id=0o6lIAql000 engine=tsm1 trace_id=0o7n_W00000 op_name=tsm1_cache_snapshot path=/var/vcap/store/influxdb/data/metrics/metrics_default/2803 duration=36128.915ms

Running iotop it is clear that the disk activity is from influxdb:

 4272 be/3 root        0.00 B/s   94.47 K/s  ?unavailable?  [jbd2/nvme2n1p1-8]
  36921 be/2 vcap     1169.95 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  36927 be/2 vcap      323.37 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  36928 be/2 vcap     2038.33 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  36941 be/2 vcap     1936.59 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  37020 be/2 vcap      385.14 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  37031 be/2 vcap        2.29 M/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  37132 be/2 vcap     1758.56 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  37135 be/2 vcap        2.38 M/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  37136 be/2 vcap        2.49 M/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  37137 be/2 vcap     1140.88 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  37157 be/2 vcap        2.28 M/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  37159 be/2 vcap      911.98 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  37162 be/2 vcap     1714.96 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  37163 be/2 vcap     1762.19 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  37168 be/2 vcap      897.45 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  37169 be/2 vcap     1137.25 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  37172 be/2 vcap     1758.56 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  37173 be/2 vcap     1718.59 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  37175 be/2 vcap     1068.21 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  37180 be/2 vcap      962.85 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  37181 be/2 vcap     1994.73 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  37184 be/2 vcap      581.34 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  37185 be/2 vcap     1565.99 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  37189 be/2 vcap      639.48 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  37201 be/2 vcap      817.51 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  37202 be/2 vcap      468.71 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  37204 be/2 vcap        2.26 M/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  37213 be/2 vcap      904.71 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  37216 be/2 vcap     1144.52 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  38844 be/2 vcap      839.31 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  38848 be/2 vcap      941.05 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  38851 be/2 vcap        2.39 M/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  38853 be/2 vcap        2.25 M/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  38854 be/2 vcap        2.30 M/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  38855 be/2 vcap     1758.56 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  38859 be/2 vcap      109.00 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  38861 be/2 vcap        2.00 M/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  38862 be/2 vcap        2.02 M/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  38863 be/2 vcap      334.27 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  38864 be/2 vcap        2.02 M/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  38865 be/2 vcap        2.42 M/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid

Running sar: sar -d 10 6

sar -d 10 6
Linux 6.2.0-39-generic (ac2f95dd-14d9-4eed-8e2f-060615e24dce)   03/24/2024      _x86_64_        (32 CPU)

06:45:57 AM       DEV       tps     rkB/s     wkB/s     dkB/s   areq-sz    aqu-sz     await     %util
06:46:07 AM   nvme1n1      0.30     12.80      1.60      0.00     48.00      0.00      1.33      0.12
06:46:07 AM   nvme0n1      0.30      0.00      3.20      0.00     10.67      0.00      1.00      0.12
06:46:07 AM   nvme2n1   3420.80  67438.40   3687.20      0.00     20.79    106.47     31.13    100.00

06:46:07 AM       DEV       tps     rkB/s     wkB/s     dkB/s   areq-sz    aqu-sz     await     %util
06:46:17 AM   nvme1n1      1.00      0.00      9.20      0.00      9.20      0.00      0.90      0.16
06:46:17 AM   nvme0n1      0.90     16.00      9.60      0.00     28.44      0.00      0.67      0.20
06:46:17 AM   nvme2n1   3404.80  68434.40   7868.00      0.00     22.41    102.23     30.03    100.00

06:46:17 AM       DEV       tps     rkB/s     wkB/s     dkB/s   areq-sz    aqu-sz     await     %util
06:46:27 AM   nvme1n1      9.70     26.40     20.40      0.00      4.82      0.02      1.69      1.24
06:46:27 AM   nvme0n1      0.30      0.00      4.40      0.00     14.67      0.00      0.67      0.08
06:46:27 AM   nvme2n1   3215.40  46037.20  12006.40      0.00     18.05     66.12     20.56    100.00
^C

Average:          DEV       tps     rkB/s     wkB/s     dkB/s   areq-sz    aqu-sz     await     %util
Average:      nvme1n1      3.67     13.07     10.40      0.00      6.40      0.01      1.61      0.51
Average:      nvme0n1      0.50      5.33      5.73      0.00     22.13      0.00      0.73      0.13
Average:      nvme2n1   3347.00  60636.67   7853.87      0.00     20.46     91.61     27.37    100.00

Please Note: nvme2n1 is NOT the swap disk. It is the disk where influxdb data is stored.
Summary: Low CPU usage, high IO (Disk), Only 50% memory consumed yet high Swap usage.

  1. How do we debug the usage of why such high disk activity?
  2. How do we identify the High swap usage even when memory is available

Following are the details:
Influxdb Version 1.8.10
CPU Count: 32
Memory: 128 GB
Disk: 1TB
AWS VM Type: m6a.8xlarge (32CPU,128GB Memory)

Hello @vipinvkmenon,
I don’t have a lot of knowledge on how to debug this but the folowing resources could be helpful:

You could also create an issue on gh…but unfortunately we’re on version 3.x so its unlikely our engineers have much bandwidth there. Outside of the docs, I would maybe search for similar issues on the forums here and reach out to other users.

Hello @Anaisdg,

Thanks for the resources.
I had a similar problem as @vipinvkmenon, but for the memory with my VM where Telegraf and Grafana are hosted. I’m saying had, because at the time I’m writing the message the memory usage seams to be stable.
But 1 or 2 weeks ago, the RAM was just growing and growing and growing. I don’t really know how the problem have been solved, but I rearranged my queries and reboot my VM…
Now both are consuming around 6Gb of memory, for 30 - 40 devices monitored.

So I would tell @vipinvkmenon, try to improve your different queries… :face_with_diagonal_mouth:

Regards

Thanks sorry I couldn’t be more helpuf

1 Like