I am running InfluxDB Ver 2.1.1 on a Windows Server 2019 platform in a virtual environment. Everything was working great for a few weeks. I then started adding more than the 10 windows server I had while testing. A few days after bringing the server count to 150 the web UI stopped responding. The InfluxDB server has 6 CPUs and had 16GB RAM. I noticed that the server was running at 100 memory, so I increased the RAM to 48GB. I rebooted the server to make sure the application would be able to access the increased memory. When the server came back up I still was not able to get the web UI and Grafana can’t retrieve any data from the server. The only modification I made to the .yml file was to bolt-path, engine-path, and the http-bind-address. I can stop the service and run InfluxDB from PowerShell and I see that the server is receiving data.
I have only been working with InfluexDB for a few months so my knowledge is still very limited. Any help with finding out why the web UI is not accessible is greatly appreciated.
I got the web UI to come back up for about 30 minutes. I increased the number of CPUs from 6 to 8 and increased the amount of RAM from 48 GB to 64 GB. When I stop InfluxDB, the server runs between 2 & 4% CPU utilization and 3% memory utilization. When I start InfluxDB, it takes less than 4 min to consume all the RAM, and then the CPU pegs at 100%. There has to be a way to limit the resource consumption of InfluxDB, so the web UI stays stable.
Below are the contents of the config.yml file.
assets-path: ""
bolt-path: E:\InfluxDB\influxd.bolt
e2e-testing: false
engine-path: E:\InfluxDB
feature-flags: {}
flux-log-enabled: false
http-bind-address: :8086
http-idle-timeout: 3m0s
http-read-header-timeout: 10s
http-read-timeout: 0s
http-write-timeout: 0s
influxql-max-select-buckets: 0
influxql-max-select-point: 0
influxql-max-select-series: 0
key-name: ""
log-level: info
metrics-disabled: false
nats-max-payload-bytes: 1048576
nats-port: -1
no-tasks: false
pprof-disabled: false
query-concurrency: 1024
query-initial-memory-bytes: 0
query-max-memory-bytes: 0
query-memory-bytes: 9223372036854775807
query-queue-size: 1024
reporting-disabled: false
secret-store: bolt
session-length: 60
session-renew-disabled: false
sqlite-path: ""
storage-cache-max-memory-size: 1073741824
storage-cache-snapshot-memory-size: 26214400
storage-cache-snapshot-write-cold-duration: 10m0s
storage-compact-full-write-cold-duration: 4h0m0s
storage-compact-throughput-burst: 50331648
storage-max-concurrent-compactions: 0
storage-max-index-log-file-size: 1048576
storage-no-validate-field-size: false
storage-retention-check-interval: 30m0s
storage-series-file-max-concurrent-snapshot-compactions: 0
storage-series-id-set-cache-size: 0
storage-shard-precreator-advance-period: 30m0s
storage-shard-precreator-check-interval: 10m0s
storage-tsm-use-madv-willneed: false
storage-validate-keys: false
storage-wal-fsync-delay: 0s
storage-wal-max-concurrent-writes: 0
storage-wal-max-write-delay: 10m0s
storage-write-timeout: 10s
store: disk
testing-always-allow-setup: false
tls-cert: ""
tls-key: ""
tls-min-version: "1.2"
tls-strict-ciphers: false
tracing-type: ""
ui-disabled: false
vault-addr: ""
vault-cacert: ""
vault-capath: ""
vault-client-cert: ""
vault-client-key: ""
vault-client-timeout: 0s
vault-max-retries: 0
vault-skip-verify: false
vault-tls-server-name: ""
vault-token: ""
Any help or suggestions on optimizing the config file to bring the server back down to a reasonable size (4 CPU & 32GB of RAM) would be appreciated.
Look at cardinality of your data. I found a tag in mine that often had random contents, and created an infinite number of series. Keeping the index in memory is likely causing swapping.