Greetings
I have got an influxDB cluster (v1.8.0) running in our self hosted Kubernetes cluster, it works fine in general, but I am experiencing a weird issue with it’s cpu usage and it is having very high cpu usage intermittently, the high usage stays for sometime and them the usage comes down automatically.
date;kubectl -n influxdb top pod --use-protocol-buffers | grep -E 'user-0|user-1'
Wed Sep 14 11:25:01 UTC 2022
user-0 8077m 30128Mi
user-1 7607m 18279Mi
date;kubectl -n influxdb top pod --use-protocol-buffers | grep -E 'user-0|user-1'
Wed Sep 14 11:29:31 UTC 2022
user-0 13012m 30128Mi
user-1 18310m 18278Mi
I am relatively new to influxdb and I appreciate if someone could share pointers on fixing this, thanks!
Anaisdg
September 20, 2022, 7:36pm
2
Hello @prasadkris ,
Welcome!
I’m not sure but it looks like it could be related to:
opened 10:05AM - 10 Sep 20 UTC
Experienced performance issues with InfluxDB after upgrading from InfluxDB v.1.8… .0 to v1.8.1 or v.1.8.2 Currently the problem is temporarily handled by downgrading back to 1.8.0.
__Steps to reproduce:__
In this environment, nothing else than
1. Upgrade InfluxDB version to 1.8.1 or 1.8.2
__Expected behavior:__
InfluxDB updates to a newer version and continues to work without issues.
__Actual behavior:__
CPU Utilization spiking to ~100%, some databases are being queried successfully, where as at least one of the largest databases (~65GB) is returning HTTP POST 500 / timeout.
Other metrics, such as memory & disk metrics are not affected drastically.
Screenshots from Grafana to visualize the CPU utilization after upgrading to v.1.8.1

__Environment info:__
* System info: Linux 3.10.0-1127.18.2.el7.x86_64 x86_64
* InfluxDB version: InfluxDB v1.8.0 (git: 1.8 781490d)
* Host VM specs:
CentOS 7
8 vCPUs
52 GB RAM
500 GB SSD disk
* Other environment details:
No other heavy workloads running on the servers other than InfluxDB. Grafana front-end.
~65 databases
~150 GB of data (~85% raw data, ~15% downsampled)
Config settings on default other than some directory settings and TSI indexing turned on
Wal and Data directories are located on the same storage device
__Logs:__
Example error line from journalctl:
```
Sep 10 10:01:04 influxdb02 influxd[22667]: ts=2020-09-10T07:01:04.768040Z lvl=error msg="[500] - \"timeout\"" log_id=0P9h7VtW000 service=httpd
```
Example error line from HTTP access log:
```
x.x.x.x - telegraf [10/Sep/2020:10:01:00 +0300] "POST /write?db=all_operated HTTP/1.1" 500 20 "-" "Go-http-client/1.1" 62720985-f333-11ea-ac27-42010ae8030c 10609871
```
__Other notes:__
CPU Load is also constantly high, these issues are most likely somehow linked together.

There was memory issues caused by CQs too, but disabling CQ:s on the largest database resolved them. CPU utilization & load was not affected by this.
Retention policies in use:
```
> show retention policies
name duration shardGroupDuration replicaN default
---- -------- ------------------ -------- -------
autogen 0s 168h0m0s 1 false
raw 336h0m0s 24h0m0s 1 true
agg 9600h0m0s 168h0m0s 1 false
```
Data is downsampled from "raw" to "agg" RP with continuous queries.
```
> show continuous queries
name: <database>
name query
---- -----
cq_aggregate CREATE CONTINUOUS QUERY cq_aggregate ON <database> BEGIN SELECT mean(*) INTO <database>.agg.:MEASUREMENT FROM <database>.raw./.*/ GROUP BY time(5m), * END
```
Are you a new user? Why have you decided to use 1.x instead of 2.x?
Thank you!