Hi everyone,
We are running into an issue on Influx 2.7.4 running on a Ubuntu Linux VM.
After some inconsistent duration of time, delete operations will start to time out and the CPU usage will gradually increase.
We are accessing Influx from a C# application and there are no intermediate network proxies between the application and influx.
Investigating the logs and metrics shows all recorded delete operations finishing in a relatively short time and no obvious errors.
http_api_request_duration_seconds_bucket{handler=“platform”,method=“POST”,path=“/api/v2/delete”,response_code=“204”,status=“2XX”,user_agent=“unknown”,le=“0.025”} 105993
http_api_request_duration_seconds_bucket{handler=“platform”,method=“POST”,path=“/api/v2/delete”,response_code=“204”,status=“2XX”,user_agent=“unknown”,le=“0.05”} 117921
http_api_request_duration_seconds_bucket{handler=“platform”,method=“POST”,path=“/api/v2/delete”,response_code=“204”,status=“2XX”,user_agent=“unknown”,le=“0.1”} 120840
http_api_request_duration_seconds_bucket{handler=“platform”,method=“POST”,path=“/api/v2/delete”,response_code=“204”,status=“2XX”,user_agent=“unknown”,le=“0.25”} 122566
http_api_request_duration_seconds_bucket{handler=“platform”,method=“POST”,path=“/api/v2/delete”,response_code=“204”,status=“2XX”,user_agent=“unknown”,le=“0.5”} 122990
http_api_request_duration_seconds_bucket{handler=“platform”,method=“POST”,path=“/api/v2/delete”,response_code=“204”,status=“2XX”,user_agent=“unknown”,le=“1”} 123320
http_api_request_duration_seconds_bucket{handler=“platform”,method=“POST”,path=“/api/v2/delete”,response_code=“204”,status=“2XX”,user_agent=“unknown”,le=“2.5”} 123394
http_api_request_duration_seconds_bucket{handler=“platform”,method=“POST”,path=“/api/v2/delete”,response_code=“204”,status=“2XX”,user_agent=“unknown”,le=“5”} 123396
http_api_request_duration_seconds_bucket{handler=“platform”,method=“POST”,path=“/api/v2/delete”,response_code=“204”,status=“2XX”,user_agent=“unknown”,le=“10”} 123396
http_api_request_duration_seconds_bucket{handler=“platform”,method=“POST”,path=“/api/v2/delete”,response_code=“204”,status=“2XX”,user_agent=“unknown”,le=“+Inf”} 123396
Since these metrics are written after the web request response is written, it leads me to believe the timeouts are delete requests that somehow get stuck internally in Influx and never get back to the web layer, and then consume more CPU as something internal to the system is babysitting more and more stuck requests / goroutines.
Our only workaround right now is to periodically restart influx, but this doesn’t mitigate the problem entirely.
Has anyone seen similar behavior and have a suggestion on a fix or workaround?
Thanks,
-Steven