Write to InfluxDB cloud fails with ApiException: (503) Reason: Service Unavailable; upstream connect error or disconnect/reset before headers. reset reason: connection failure

We use AWS to pipe our data via Lambda functions (Python) to write to Synchronously InfluxDB 2.0 Cloud using

write_api.write()

We write about once every 2 sec from two or three simultaneously running Lambda clients.

Ever 1000 successful calls (approximately 50 times / day) the write() call fails with

ApiException: (503) Reason: Service Unavailable; upstream connect error or disconnect/reset before headers. reset reason: connection failure

Because of the way AWS Lambda functions work, we can never be sure how many Lambda functions are actually running. As load increases, more lambda functions will be started and may be running / writing simultaneously.

I was under the impression that InfluxDB 2.0 Cloud could handle millions of writes / sec from multiple clients. So Iā€™m puzzled how 2 or 3 concurrent Python clients calling write_api.write() are causing problems.

Are there any limitations Iā€™m overstepping that I should know about?

@asmith,
Can you please share your script?
@bednar do you know of any limitations?

1 Like

Thanks @Anaisdg
Can I ask what happens if two Python clients try to write simultaneously to the same bucket (i.e. one client calls client.write() and before it has returned (i.e. 200-300ms later) the other client calls client.write().

import json
import boto3
from influxdb_client import InfluxDBClient
from influxdb_client.client.write_api import SYNCHRONOUS
import re
import dateutil.parser
from datetime import datetime, timezone
import os
import sys, traceback
import logging

logger = logging.getLogger()
logger.setLevel(logging.WARNING)

#SET UP THE THINGS THAT NEED ONLY CALLING ONCE

# SET UP THE INFLUX DB CONNECTION
InfluxBucket = os.environ['InfluxBucket']
influx_client = InfluxDBClient(url=os.environ['influx_url'], token=os.environ['token'])
write_api = influx_client.write_api(write_options=SYNCHRONOUS)
epoch = dateutil.parser.parse('1970-01-01T00:00:00.000Z')
Influx_TAGS_shadowname = os.environ['TAGS_shadowname']
shadow_client = boto3.client('iot-data', region_name='eu-central-1')
 
code_link='https://eu-central-1.console.aws.amazon.com/lambda/home?region=eu-central-1#/functions/MQTT_InfluxDB?tab=code'

def get_JSON_logger(message, logLevel, do_trace, event, context, err_no):
    logmessage = {}
    logmessage["logLevel"] =logLevel
    logmessage["Error Number"] =err_no
    logmessage["message"]=message
    if do_trace: logmessage["traceback"]=traceback.format_exc(limit=1, chain=True) 
    else: logmessage["traceback"]=None
    logmessage["event"] =event
    logmessage["function"] =context.function_name
    logmessage["code_link"]=code_link
    return logmessage
        
def lambda_handler(event, context):
    
    Influx_lines=[]
''' I've removed the code that prepares Influx_lines 
....
'''
    try:
        if len(Influx_lines)>0:
            influx_returns = write_api.write(InfluxBucket, "xxxxxxxxxxxxxxxxx", Influx_lines,'ms')

        else:
            logmessage = get_JSON_logger('received event, but no InfluxDB data could be constructed from it', 'ERROR', True, event, context, 2001)
            logmessage['influx_returns']=influx_returns
            logmessage['Influx_lines']=Influx_lines
            logger.error(json.dumps(logmessage, indent = 5))
            return { 'Lambda FAIL': 400 }
    
            
    except:
        logmessage = get_JSON_logger('exception', 'ERROR', True, event, context, 2000)
        logmessage['influx_returns']=influx_returns
        logmessage['Influx_lines']=Influx_lines
        # add other elements here
        logger.error(json.dumps(logmessage, indent = 5))
        return { 'Lambda FAIL': 400 }

    ###########
    return {
        'SUCCESS': 200
    }

The error returned is caught in the catch section:

ā€œtracebackā€: "Traceback (most recent call last):
File "/var/task/lambda_function.py", line 109, in lambda_handler\n influx_returns = write_api.write(InfluxBucket, "xxxxxxxxxxxxxxxx", Influx_lines,ā€˜msā€™)
influxdb_client.rest.ApiException: (503)
Reason: Service Unavailable
HTTP response headers: HTTPHeaderDict({ā€˜Dateā€™: ā€˜Thu, 02 Sep 2021 01:10:44 GMTā€™, ā€˜Content-Typeā€™: ā€˜text/plainā€™, ā€˜Content-Lengthā€™: ā€˜91ā€™, ā€˜Connectionā€™: ā€˜keep-aliveā€™, ā€˜Strict-Transport-Securityā€™: ā€˜max-age=15724800; includeSubDomainsā€™})
HTTP response body: upstream connect error or disconnect/reset before headers. reset reason: connection failure"

The code is pretty mature and has been running for over a year. As we connect machines we are seeing higher data rates. But this code is only being executed once every 2 sec or so. However, AWS does create 2-3 copies of the Lambda function which are running simultaneously.

What would really really help is to test if the problem goes away by using HTTPS requests.
I have not been able to find a python example of sending data to influx using HTTPS. If you have something then I can try a HTTPS version.

@Anaisdg and @bednar
Thanks for engagingā€¦

Iā€™ve spent the day trying to implement a https solution but Iā€™m not getting anywhere. Iā€™m sure that a failure to write should cause retries no? Not an exception. I donā€™t really understand the InfluxDB Python client or the underlying HTTP / Sockets connection.

Is that what is happening? The socket is getting closed and the client tries to write on a closed / dead socket? In that case; How would I

  • detect this and
  • prevent the exception and
  • make sure my data got written reliably

Should I be using Retries functionality in order to prevent this?

Hello @asmith,
Here are some examples using the requests library with python:

Iā€™m not sure about the rest. Yes I would try to use retry options.
@bednar can you help here?

Iā€™m not sure how often heā€™s on the forums. You might want to just submit an issue in the client library repo.

Thanks @Anaisdg ,
Iā€™ve seen similar code but what is always missing is the actual place where we put influx data

measures,host=ā€œmyhostā€ data1=34,data2=2.5 <timestamp

Do you have any idea where this goes since the ā€œpayloadā€ above seems taken up with other information.

Thanks

Hello @asmith,
It should look something like this:

proto="http://"
domain="us-west-2-1.aws.cloud2.influxdata.com"
api_path="/api/v2/write/"
query_param = "write?org=YOUR_ORG&bucket=YOUR_BUCKET&precision=ns"
url = proto + domain + api_path + query_param

headers = {"Authorization": "Token " + my_token}
raw_data = 
 "
mem,host=host1 used_percent=23.43234543 1556896326
mem,host=host2 used_percent=26.81522361 1556896326
mem,host=host1 used_percent=22.52984738 1556896336
mem,host=host2 used_percent=27.18294630 1556896336
"
r = requests.post(url, headers=headers, data=raw_data)

Ahaaa. Yes that looks right.
Regarding the use of Python requests
Do you think that this will make any difference to disconnections?
Because doesnā€™t the

influxdb_client.client.write_api

use https under the hood anyway?

And did you have any thoughts about the effects of two clients calling client_api.write() at the same time?

The client support synchronous writes.

I donā€™t think it should but it canā€™t hurt to test it I suppose.

Hi All, sorry for late response I was out of grid.

There isnā€™t know a limitation for this.

Perform second request to InfluxDB Server.

The upstream connect error or disconnect/reset before headers. error is from AWS Lambda network stack. The ā€œcoreā€ error is connection failure. I am not expert to AWS Lamda but it can be same network issue.

Yes, you have to configure retries this way.

Regards

1 Like

Thanks @bednar,
You mentioned that the error

upstream connect error or disconnect/reset before headers.

is a Lambda / AWS one. However, I have seen the same error when Iā€™m on InfluxDB 2.0 Cloud and exploring data. Are we sure that itā€™s not an availability issue that happens from time to time?

@bednar , I appreciate the suggestion of a solution (sending a second request if the first fails).
But to return to my question for a moment ā€¦ are we saying that InfluxDB can cope with simultaneous requests, or weā€™re saying that it canā€™t (i.e. one of them will fail and therefore it will be always necessary to anticipate the need for retries)?

It can.

As a solution use retry strategy. You can configure retries by: GitHub - influxdata/influxdb-client-python: InfluxDB 2.0 python client

Regards

@bednar that link is BRILLIANT!
Should be in the docs!
I can also recommend Utilities - urllib3 2.0.0a3 documentation for details of the parameters to use in retries.

I have used

retries = Retry(total=10, redirect=0,backoff_factor=0.1,raise_on_redirect=True,raise_on_status=True)

Because Iā€™d like it to try 10 times over the next few seconds but raise an exception if it fails

Can I ask why we would ever need redirect>0?
Is there any legitimate case in which an attempt to write to InfluxDB Cloud 2.0 would get redirected?
I know my API key gets stripped but isnā€™t it better security to say there are no redirects?

I am also testing an alternative function that uses only urllib3 but doesnā€™t use retries.

import requests
influx_returns=requests.post(url, data=Influx_lines, headers=headers)

So far it works but has two negative side effects:

  1. it takes 50% longer to write than the python InfluxDB library (200ms rather than 125ms). With millions of writes / month and AWS billing Lambda code by millisecond, this is significant.
  2. it screws up the use of AWS boto3 library (some conflict between import requests and import boto3

I will report back with my findings on the best fix when I fully test it (for others to find)
Cheers
Andrew

It is useful when you use a HTTP proxy.

I donā€™t think so.

Checkout this doc to disable stripping your key - GitHub - influxdata/influxdb-client-python: InfluxDB 2.0 python client