Delete Timeouts on Influx 2.7.4

Hi everyone,

We are running into an issue on Influx 2.7.4 running on a Ubuntu Linux VM.
After some inconsistent duration of time, delete operations will start to time out and the CPU usage will gradually increase.
Screen Shot 2024-03-14 at 2.12.28 PM

We are accessing Influx from a C# application and there are no intermediate network proxies between the application and influx.

Investigating the logs and metrics shows all recorded delete operations finishing in a relatively short time and no obvious errors.

http_api_request_duration_seconds_bucket{handler=“platform”,method=“POST”,path=“/api/v2/delete”,response_code=“204”,status=“2XX”,user_agent=“unknown”,le=“0.025”} 105993
http_api_request_duration_seconds_bucket{handler=“platform”,method=“POST”,path=“/api/v2/delete”,response_code=“204”,status=“2XX”,user_agent=“unknown”,le=“0.05”} 117921
http_api_request_duration_seconds_bucket{handler=“platform”,method=“POST”,path=“/api/v2/delete”,response_code=“204”,status=“2XX”,user_agent=“unknown”,le=“0.1”} 120840
http_api_request_duration_seconds_bucket{handler=“platform”,method=“POST”,path=“/api/v2/delete”,response_code=“204”,status=“2XX”,user_agent=“unknown”,le=“0.25”} 122566
http_api_request_duration_seconds_bucket{handler=“platform”,method=“POST”,path=“/api/v2/delete”,response_code=“204”,status=“2XX”,user_agent=“unknown”,le=“0.5”} 122990
http_api_request_duration_seconds_bucket{handler=“platform”,method=“POST”,path=“/api/v2/delete”,response_code=“204”,status=“2XX”,user_agent=“unknown”,le=“1”} 123320
http_api_request_duration_seconds_bucket{handler=“platform”,method=“POST”,path=“/api/v2/delete”,response_code=“204”,status=“2XX”,user_agent=“unknown”,le=“2.5”} 123394
http_api_request_duration_seconds_bucket{handler=“platform”,method=“POST”,path=“/api/v2/delete”,response_code=“204”,status=“2XX”,user_agent=“unknown”,le=“5”} 123396
http_api_request_duration_seconds_bucket{handler=“platform”,method=“POST”,path=“/api/v2/delete”,response_code=“204”,status=“2XX”,user_agent=“unknown”,le=“10”} 123396
http_api_request_duration_seconds_bucket{handler=“platform”,method=“POST”,path=“/api/v2/delete”,response_code=“204”,status=“2XX”,user_agent=“unknown”,le=“+Inf”} 123396

Since these metrics are written after the web request response is written, it leads me to believe the timeouts are delete requests that somehow get stuck internally in Influx and never get back to the web layer, and then consume more CPU as something internal to the system is babysitting more and more stuck requests / goroutines.

Our only workaround right now is to periodically restart influx, but this doesn’t mitigate the problem entirely.

Has anyone seen similar behavior and have a suggestion on a fix or workaround?

Thanks,
-Steven

2 Likes

Hello @Steven_Rychlik,
First I want to apologize for the delay. I was out of office.
Hmmmm how large are the deletes that you are trying to perform? What does the delete query look like? The distribution of data across shards can impact delete operations’ efficiency. Is VM throttling a concern? Have you changed any of InfluxDB’s cache sizes and write settings.
Here are config options:

You could try different TMS settings:

That being said InfluxDB 2.x is notorious for problematic deletes although I usually hear concerns being related to memory consumption not CPU.
Are you running other tasks as well?

Thanks for the feedback.

The deletes are usually capped at 24 hours of 1 second data (often smaller).
It is usually something like
delete
start = {}
stop = {}
predicate = measurement = {} and tag = {a}

It’s likely that the data for various delete statements are in different shards, but probably not for a single delete. We keep all data in one bucket.

We haven’t changed cache sizes.

There are many other influx queries happening during this time, but nothing else is happening on the VM.