I’ve run into this issue. We are sending 100’s of millions of requests to the InfluxDB2 API and it has been running perfectly up until a couple of weeks ago.
This is from our Sentry. You can see quite clearly that the API was running perfectly and then started returning errors on the 17th of August.
I don’t really think that making a retry is a reasonable solution here. Returning a 500 is an error, and should be corrected. It was also clearly running absolutely perfectly up until the middle of August. Clearly something changed within the InfluxDB infra at this point!
I don’t think just catching the 500 and retrying is acceptable for a paid service! The Influx API was returning so many errors that we blew out our Sentry quota in August!
That’s really really interesting.
I just checked our AWS logs and we see the same pattern.
I hadn’t realised this!
Here is a chart from the last 2 months that counts every instance of the error (503)
The errors start on 17th August
I’ve now raised an issue on Github.
You might like to visit and confirm it’s affecting you to get it sorted.
Thanks @dabeeeenster. I thought I was going mad!
Hello @asmith and @dabeeeenster,
I’ve created an issue for the storage team to take a look at your question and concerns. I’ll also bug someone if they haven’t responded in a couple days. I appreciate your detailed questions and thanks for your patience.
I appreciate all your help over the years @Anaisdg .
By the way if you can help with some stub code to use urllib3 to write line data to InfluxDB 2.0 cloud then I could test whether the failures also happen with the urllib3 library which I think matches what Python API client uses.
I managed to get the requests library working but the following doesn’t work:
@Anaisdg ,
There have been some pretty worrying developments to this problem.
The errors have been absolutely flooding our systems today (we got 30,000 failures of the (503) type.
Can you give me any updates on how the storage team are getting on with finding and solving the issue?
Hi @Anaisdg , thanks for pointing out the status link. Is there an API way of checking status so that I create log events in our architecture that help with diagnosing performance issues from a dashboard?
Also, I can report that the (503) errors have stopped now.
May I ask what the storage team found and fixed?
I’m wanting to learn more about the types of failure of the InfluxDB cloud platform that we may have to mitigate in the future. Until we observed the 503 errors, In my naivety, I had no concept that InfluxDB Cloud could be temporarily unavailable. I now need to go away and design some AWS components that can store the data temporarily when InfluxDB has issues. Of course nothing has 100% uptime eh?
Again, please can you clarify what the storage team found and fixed?
And pass on my congrats and thanks for fixing it!!
Hi @Anaisdg , I think we got our wires crossed. I was asking about the 503 errors that have been existing for about a month. I was asking whether it was fixed and what caused it.
I think your reply was referring to the outage on 17th Aug which is maybe a different issue and not the cause of the 503 errors in this topic.
So, maybe I should check again … Have the 503 errors been fixed? Because today I was getting Timeout failures on InfluxDB writes which is another symptom of Influx cloud not being available… as are the same other customers. See the github link
Hello @asmith,
Thanks for helping me understand I didn’t read carefully enough.
I believe engineering has upgraded many of their 3rd party services which was the initial cause but are still investigating.