Write to InfluxDB cloud fails with ApiException: (503) Reason: Service Unavailable; upstream connect error or disconnect/reset before headers. reset reason: connection failure

dabeeeenster · September 7, 2021, 9:51am

Hi,

I’ve run into this issue. We are sending 100’s of millions of requests to the InfluxDB2 API and it has been running perfectly up until a couple of weeks ago.

This is from our Sentry. You can see quite clearly that the API was running perfectly and then started returning errors on the 17th of August.

I don’t really think that making a retry is a reasonable solution here. Returning a 500 is an error, and should be corrected. It was also clearly running absolutely perfectly up until the middle of August. Clearly something changed within the InfluxDB infra at this point!

I don’t think just catching the 500 and retrying is acceptable for a paid service! The Influx API was returning so many errors that we blew out our Sentry quota in August!

asmith · September 7, 2021, 1:28pm

That’s really really interesting.
I just checked our AWS logs and we see the same pattern.
I hadn’t realised this!
Here is a chart from the last 2 months that counts every instance of the error (503)
The errors start on 17th August

In addition I notice that it happened for a very short period on 28th July and 6th August.
But nothing outside of this

asmith · September 7, 2021, 1:44pm

Hi @Anaisdg ,
I think that maybe I need to do as you said and submit an issue in the client library repo.
Do you mean here GitHub - influxdata/influxdb-client-python: InfluxDB 2.0 python client ?

dabeeeenster · September 7, 2021, 1:46pm

Is this specific to the Python SDK?!

asmith · September 7, 2021, 2:00pm

No. I’m getting the same fails with a backup client I wrote to use http

asmith · September 7, 2021, 2:01pm

I’ve now raised an issue on Github.
You might like to visit and confirm it’s affecting you to get it sorted.
Thanks @dabeeeenster. I thought I was going mad!

github.com/influxdata/influxdb-client-python

Write to InfluxDB cloud fails with ApiException: (503) Reason: Service Unavailable; upstream connect error or disconnect/reset before headers. reset reason: connection failure

opened 01:59PM - 07 Sep 21 UTC

closed 07:40AM - 26 Oct 21 UTC

gefaila

wontfix

__Steps to reproduce:__ Using the Python client make repeated calls to the writ…e API. Do this simultaneously from 2 or 3 different clients (not sure if this is relevant but this is the fail condition) Write rates are a call to the API every second or so I'm using the SYNCHRONOUS calls ` influx_client = InfluxDBClient(url=os.environ['influx_url'], token=os.environ['token'],retries=retries) write_api = influx_client.write_api(write_options=SYNCHRONOUS) ` (repeatedly) ` influx_returns = write_api.write(InfluxBucket, my_org, Influx_lines,'ms') ` __Expected behavior:__ Until 17th Aug we were seeing the expected behaviour. All the writes succeeded. Data was written reliably __Actual behavior:__ From 17th August we started getting > ApiException: (503) >Reason: Service Unavailable; upstream connect error or disconnect/reset before headers. >reset reason: connection failure The errors definitely started on 17th Aug and I know we didn't change anything because everyone was on holiday! I have AWS logs which show when this started ![image](https://user-images.githubusercontent.com/69196723/132357356-01a72c1b-97a4-4154-9cef-5968fe878b4e.png) You can see the discussion [here ](https://community.influxdata.com/t/write-to-influxdb-cloud-fails-with-apiexception-503-reason-service-unavailable-upstream-connect-error-or-disconnect-reset-before-headers-reset-reason-connection-failure/21501/5)that it's affecting multiple users now __Specifications:__ - Python 3.8: - InfluxDB Version: Cloud v2 - Platform: AWS Lambda written in Python -

Anaisdg · September 7, 2021, 3:06pm

Hello @asmith and @dabeeeenster,
I’ve created an issue for the storage team to take a look at your question and concerns. I’ll also bug someone if they haven’t responded in a couple days. I appreciate your detailed questions and thanks for your patience.

asmith · September 7, 2021, 3:24pm

I appreciate all your help over the years @Anaisdg .

By the way if you can help with some stub code to use urllib3 to write line data to InfluxDB 2.0 cloud then I could test whether the failures also happen with the urllib3 library which I think matches what Python API client uses.

I managed to get the requests library working but the following doesn’t work:

import json
import urllib3


headers={'Content-Type': 'application/vnd.flux','Authorization': 'Token xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx=='}
url = 'https://eu-central-1-1.aws.cloud2.influxdata.com/api/v2/write?orgID=xxxxxxxxxxxxxx&bucket=TEST_bucket&precision=ms'
payload = 'cpu_load_short,host=server01,region=us-west value1=99.64\ncpu_load_short,host=server01,region=us-west value2=5.64\n'

#import requests
#x = requests.post(url, data=payload, headers=headers) # this succeeds with [204]

# THIS FAILS
http = urllib3.PoolManager()
r = http.request(
    'POST',
    url,
    body=payload,
    headers=headers
    )
returnval = r.data.decode('utf-8')
print(f"Request returned:{returnval}\n")

asmith · September 7, 2021, 3:25pm

(uncommenting the two requests lines works)

asmith · September 7, 2021, 3:27pm

The error returned is:

HTTPSConnectionPool(host=‘eu-central-1-1.aws.cloud2.influxdata.com’, port=443): Max retries exceeded with url: /api/v2/write?orgID=xxxxxxxxxxxx&bucket=TEST_bucket&precision=ms (Caused by SSLError(SSLError(“bad handshake: Error([(‘SSL routines’, ‘tls_process_server_certificate’, ‘certificate verify failed’)])”)))

Clearly, using the urllib3 library is not as simple as using the requests library.

Anaisdg · September 7, 2021, 3:28pm

Hello @asmith,
Sure I’ll give it a try and get back to you end of day.

asmith · September 7, 2021, 3:29pm

hey thanks @Anaisdg
Cheers.

asmith · September 8, 2021, 9:35am

Thanks to @bednar
Here is code to test if the problem persists when using the urllib3 library

import urllib3
import certifi

https_url = 'https://eu-central-1-1.aws.cloud2.influxdata.com'
# DON'T FORGET THE HTTPS
org = 'my_org'
token = 'my_token'
bucket = 'my_bucket'

headers = {'Content-Type': 'application/vnd.flux', 'Authorization': ('Token %s' % token)}
url = '%s/api/v2/write?org=%s&bucket=%s&precision=ms' % (https_url, org, bucket)
payload = 'cpu_load_short,host=server01,region=us-west value1=99.64\ncpu_load_short,host=server01,region=us-west value2=5.64\n'

https = urllib3.PoolManager(ca_certs=certifi.where())
r = https.request(
    'POST',
    url,
    body=payload,
    headers=headers
)

print(f"Response status: '{r.status}', success: {r.status == 204}\n")

asmith · September 17, 2021, 5:41pm

@Anaisdg ,
There have been some pretty worrying developments to this problem.
The errors have been absolutely flooding our systems today (we got 30,000 failures of the (503) type.

github.com/influxdata/influxdb-client-python

Write to InfluxDB cloud fails with ApiException: (503) Reason: Service Unavailable; upstream connect error or disconnect/reset before headers. reset reason: connection failure

opened 01:59PM - 07 Sep 21 UTC

closed 07:40AM - 26 Oct 21 UTC

gefaila

wontfix

__Steps to reproduce:__ Using the Python client make repeated calls to the writ…e API. Do this simultaneously from 2 or 3 different clients (not sure if this is relevant but this is the fail condition) Write rates are a call to the API every second or so I'm using the SYNCHRONOUS calls ` influx_client = InfluxDBClient(url=os.environ['influx_url'], token=os.environ['token'],retries=retries) write_api = influx_client.write_api(write_options=SYNCHRONOUS) ` (repeatedly) ` influx_returns = write_api.write(InfluxBucket, my_org, Influx_lines,'ms') ` __Expected behavior:__ Until 17th Aug we were seeing the expected behaviour. All the writes succeeded. Data was written reliably __Actual behavior:__ From 17th August we started getting > ApiException: (503) >Reason: Service Unavailable; upstream connect error or disconnect/reset before headers. >reset reason: connection failure The errors definitely started on 17th Aug and I know we didn't change anything because everyone was on holiday! I have AWS logs which show when this started ![image](https://user-images.githubusercontent.com/69196723/132357356-01a72c1b-97a4-4154-9cef-5968fe878b4e.png) You can see the discussion [here ](https://community.influxdata.com/t/write-to-influxdb-cloud-fails-with-apiexception-503-reason-service-unavailable-upstream-connect-error-or-disconnect-reset-before-headers-reset-reason-connection-failure/21501/5)that it's affecting multiple users now __Specifications:__ - Python 3.8: - InfluxDB Version: Cloud v2 - Platform: AWS Lambda written in Python -

Can you give me any updates on how the storage team are getting on with finding and solving the issue?

Thanks for your help.

Anaisdg · September 17, 2021, 6:25pm

Hello @asmith,
There was an outage. Are you still having trouble?
In the future try checking:

As far as 503’s go, that the issue is known and Engineering is working hard to fix that as a high priority. They have already applied a mitigation.

asmith · September 23, 2021, 7:54pm

Hi @Anaisdg , thanks for pointing out the status link. Is there an API way of checking status so that I create log events in our architecture that help with diagnosing performance issues from a dashboard?

Also, I can report that the (503) errors have stopped now.
May I ask what the storage team found and fixed?
I’m wanting to learn more about the types of failure of the InfluxDB cloud platform that we may have to mitigate in the future. Until we observed the 503 errors, In my naivety, I had no concept that InfluxDB Cloud could be temporarily unavailable. I now need to go away and design some AWS components that can store the data temporarily when InfluxDB has issues. Of course nothing has 100% uptime eh?

Again, please can you clarify what the storage team found and fixed?

And pass on my congrats and thanks for fixing it!!

Anaisdg · September 28, 2021, 6:36pm

Hello @asmith,
I don’t think there’s a public API for checking the status, but I’ll ask.

I don’t remember what was causing the errors anymore tbh.

asmith · September 28, 2021, 9:36pm

Hi @Anaisdg , I think we got our wires crossed. I was asking about the 503 errors that have been existing for about a month. I was asking whether it was fixed and what caused it.

I think your reply was referring to the outage on 17th Aug which is maybe a different issue and not the cause of the 503 errors in this topic.

So, maybe I should check again … Have the 503 errors been fixed? Because today I was getting Timeout failures on InfluxDB writes which is another symptom of Influx cloud not being available… as are the same other customers. See the github link

Anaisdg · September 29, 2021, 6:25pm

Hello @asmith,
Thanks for helping me understand I didn’t read carefully enough.
I believe engineering has upgraded many of their 3rd party services which was the initial cause but are still investigating.

Anaisdg · September 30, 2021, 3:35pm

Hello @asmith,
I was wrong please take a look at:
The subscribe button on the status page that gives various options including webhooks and RSS feeds.
https://status.influxdata.com/
There’s a limited api with details here: InfluxDB Cloud Status - API

Topic		Replies	Views
Problem writing to InfluxDB Cloud from AWS Lambda InfluxDB 2 influxdb	3	1988	August 24, 2020
Influx Stops Writing	4	1863	June 17, 2017
Influx on PI3 stops writing	1	1538	December 10, 2017
InfluxDBv2.3 (Ubuntu 22.04) is stuck after large writing operation (python) InfluxDB 2 influxdb , python	7	1758	September 9, 2022
InfluxDB Client failure when writing large amounts of data InfluxDB 2 influxdb , time-series , client-libraries , query , performance	14	9996	August 21, 2020

Write to InfluxDB cloud fails with ApiException: (503) Reason: Service Unavailable; upstream connect error or disconnect/reset before headers. reset reason: connection failure

Related topics