Influxdb 2.1.1 sudden slowdown and write timeouts

I have a home-grown influxdb cluster with 3 servers running influxdb 1.8.4 and 2 running influxdb 2.1.1. In front of all these influxdb servers, there is a custom application that duplicates the incoming write requests so all the servers receive exactly the same data.

Now for the past 2 weeks since I added the influxdb 2.1.1 servers, it happened twice that one of them suddenly came to a crawl and would return {"code":"internal error","message":"unexpected error writing points to database: timeout"} to almost all requests. Once the server slowed down it doesn’t recover on its own. A simple restart would “fix” this issue and after the restart it would happily process all the data that couldn’t get written. During the time these All the time the other 2.1.1 server and all the 1.8.4 ones were running just fine. The only difference between the 2.1.1 server that had this issue and the other one is that the “problematic” one is also receiving a little bit of queries, while the other one is receiving writes only.

I have enabled debug logging on both servers but unfortunately that didn’t help as everything looks quite normal (to my eyes) and more or less the same between these two 2.1.1 servers. Any suggestions/ideas on how I can investigate this strange slowdown? Thanks.

another thing that seems fishy is that ~5 hours after the influxdb 2.1.1 server started to timeout on everything, it hit the max open files limit.

ts=2022-03-04T07:27:58.157183Z lvl=info msg="http: Accept error: accept tcp accept4: too many open files; retrying in 1s" log_id=0Z~pVheG000 service=http
ts=2022-03-04T07:27:58.393647Z lvl=info msg="Error writing snapshot from compactor" log_id=0Z~pVheG000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot error="compaction in progress: open /data/influx/engine/data/0c0662e3701e086e/autogen/66/000003673-000000001.tsm.tmp: too many open files"

During this 5 hours CPU was mostly idle but the memory usage of influxdb was creeping up from 10GB to 55GB before the restart.