Write to InfluxDB cloud fails with ApiException: (503) Reason: Service Unavailable; upstream connect error or disconnect/reset before headers. reset reason: connection failure

Hi,

I’ve run into this issue. We are sending 100’s of millions of requests to the InfluxDB2 API and it has been running perfectly up until a couple of weeks ago.

image

This is from our Sentry. You can see quite clearly that the API was running perfectly and then started returning errors on the 17th of August.

I don’t really think that making a retry is a reasonable solution here. Returning a 500 is an error, and should be corrected. It was also clearly running absolutely perfectly up until the middle of August. Clearly something changed within the InfluxDB infra at this point!

I don’t think just catching the 500 and retrying is acceptable for a paid service! The Influx API was returning so many errors that we blew out our Sentry quota in August!

That’s really really interesting.
I just checked our AWS logs and we see the same pattern.
I hadn’t realised this!
Here is a chart from the last 2 months that counts every instance of the error (503)
The errors start on 17th August

In addition I notice that it happened for a very short period on 28th July and 6th August.
But nothing outside of this

Hi @Anaisdg ,
I think that maybe I need to do as you said and submit an issue in the client library repo.
Do you mean here GitHub - influxdata/influxdb-client-python: InfluxDB 2.0 python client ?

Is this specific to the Python SDK?!

No. I’m getting the same fails with a backup client I wrote to use http

I’ve now raised an issue on Github.
You might like to visit and confirm it’s affecting you to get it sorted.
Thanks @dabeeeenster. I thought I was going mad!

Hello @asmith and @dabeeeenster,
I’ve created an issue for the storage team to take a look at your question and concerns. I’ll also bug someone if they haven’t responded in a couple days. I appreciate your detailed questions and thanks for your patience.

I appreciate all your help over the years @Anaisdg .

By the way if you can help with some stub code to use urllib3 to write line data to InfluxDB 2.0 cloud then I could test whether the failures also happen with the urllib3 library which I think matches what Python API client uses.

I managed to get the requests library working but the following doesn’t work:

import json
import urllib3


headers={'Content-Type': 'application/vnd.flux','Authorization': 'Token xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=='}
url = 'https://eu-central-1-1.aws.cloud2.influxdata.com/api/v2/write?orgID=xxxxxxxxxxxxxx&bucket=TEST_bucket&precision=ms'
payload = 'cpu_load_short,host=server01,region=us-west value1=99.64\ncpu_load_short,host=server01,region=us-west value2=5.64\n'

#import requests
#x = requests.post(url, data=payload, headers=headers) # this succeeds with [204]

# THIS FAILS
http = urllib3.PoolManager()
r = http.request(
    'POST',
    url,
    body=payload,
    headers=headers
    )
returnval = r.data.decode('utf-8')
print(f"Request returned:{returnval}\n")

(uncommenting the two requests lines works)

The error returned is:

HTTPSConnectionPool(host=‘eu-central-1-1.aws.cloud2.influxdata.com’, port=443): Max retries exceeded with url: /api/v2/write?orgID=xxxxxxxxxxxx&bucket=TEST_bucket&precision=ms (Caused by SSLError(SSLError(“bad handshake: Error([(‘SSL routines’, ‘tls_process_server_certificate’, ‘certificate verify failed’)])”)))

Clearly, using the urllib3 library is not as simple as using the requests library.

Hello @asmith,
Sure I’ll give it a try and get back to you end of day.

hey thanks @Anaisdg
Cheers.

Thanks to @bednar
Here is code to test if the problem persists when using the urllib3 library

import urllib3
import certifi

https_url = 'https://eu-central-1-1.aws.cloud2.influxdata.com'
# DON'T FORGET THE HTTPS
org = 'my_org'
token = 'my_token'
bucket = 'my_bucket'

headers = {'Content-Type': 'application/vnd.flux', 'Authorization': ('Token %s' % token)}
url = '%s/api/v2/write?org=%s&bucket=%s&precision=ms' % (https_url, org, bucket)
payload = 'cpu_load_short,host=server01,region=us-west value1=99.64\ncpu_load_short,host=server01,region=us-west value2=5.64\n'

https = urllib3.PoolManager(ca_certs=certifi.where())
r = https.request(
    'POST',
    url,
    body=payload,
    headers=headers
)

print(f"Response status: '{r.status}', success: {r.status == 204}\n")

@Anaisdg ,
There have been some pretty worrying developments to this problem.
The errors have been absolutely flooding our systems today (we got 30,000 failures of the (503) type.

Can you give me any updates on how the storage team are getting on with finding and solving the issue?

Thanks for your help.

Hello @asmith,
There was an outage. Are you still having trouble?
In the future try checking:

As far as 503’s go, that the issue is known and Engineering is working hard to fix that as a high priority. They have already applied a mitigation.

Hi @Anaisdg , thanks for pointing out the status link. Is there an API way of checking status so that I create log events in our architecture that help with diagnosing performance issues from a dashboard?

Also, I can report that the (503) errors have stopped now.
May I ask what the storage team found and fixed?
I’m wanting to learn more about the types of failure of the InfluxDB cloud platform that we may have to mitigate in the future. Until we observed the 503 errors, In my naivety, I had no concept that InfluxDB Cloud could be temporarily unavailable. I now need to go away and design some AWS components that can store the data temporarily when InfluxDB has issues. Of course nothing has 100% uptime eh?

Again, please can you clarify what the storage team found and fixed?

And pass on my congrats and thanks for fixing it!! :grinning_face_with_smiling_eyes:

Hello @asmith,
I don’t think there’s a public API for checking the status, but I’ll ask.

I don’t remember what was causing the errors anymore tbh.

Hi @Anaisdg , I think we got our wires crossed. I was asking about the 503 errors that have been existing for about a month. I was asking whether it was fixed and what caused it.

I think your reply was referring to the outage on 17th Aug which is maybe a different issue and not the cause of the 503 errors in this topic.

So, maybe I should check again … Have the 503 errors been fixed? Because today I was getting Timeout failures on InfluxDB writes which is another symptom of Influx cloud not being available… as are the same other customers. See the github link

Hello @asmith,
Thanks for helping me understand I didn’t read carefully enough.
I believe engineering has upgraded many of their 3rd party services which was the initial cause but are still investigating.

Hello @asmith,
I was wrong please take a look at:
The subscribe button on the status page that gives various options including webhooks and RSS feeds.
https://status.influxdata.com/
There’s a limited api with details here: InfluxDB Cloud Status - API