I have a home-grown influxdb cluster with 3 servers running influxdb 1.8.4 and 2 running influxdb 2.1.1. In front of all these influxdb servers, there is a custom application that duplicates the incoming write requests so all the servers receive exactly the same data.
Now for the past 2 weeks since I added the influxdb 2.1.1 servers, it happened twice that one of them suddenly came to a crawl and would return {"code":"internal error","message":"unexpected error writing points to database: timeout"}
to almost all requests. Once the server slowed down it doesn’t recover on its own. A simple restart would “fix” this issue and after the restart it would happily process all the data that couldn’t get written. During the time these All the time the other 2.1.1 server and all the 1.8.4 ones were running just fine. The only difference between the 2.1.1 server that had this issue and the other one is that the “problematic” one is also receiving a little bit of queries, while the other one is receiving writes only.
I have enabled debug logging on both servers but unfortunately that didn’t help as everything looks quite normal (to my eyes) and more or less the same between these two 2.1.1 servers. Any suggestions/ideas on how I can investigate this strange slowdown? Thanks.