Diagnosing 500 error on server-to-server replication

I have two InfluxDB 2 servers running on separate nodes and a remote replication relationship setup from A → B for one bucket. This has been working well for a while, but recently stopped replication data into B. I’m trying to diagnose what’s going on, but there is very little information available in the logs.

The destination server (B) has the following in it’s logs:

influxdb  | ts=2025-05-14T05:40:00.318278Z lvl=warn msg="internal error not returned to client" log_id=0wSD7~Y0000 handler=error_logger error=EOF
influxdb  | ts=2025-05-14T05:40:00.318428Z lvl=debug msg=Request log_id=0wSD7~Y0000 service=http method=POST host=influx.hostb.lan path=/api/v2/write query="bucket=df70c0b76e6a4805&org=9cda1162aaa46c08" proto=HTTP/1.1 status_code=500 response_size=88 content_length=0 referrer= remote=10.0.2.164:50118 user_agent=influxdb-oss-replication authenticated_id=0edf4d87c0642000 user_id=0c39febadc5ed000 took=0.699ms error="internal error" error_code="internal error"

On the source server (A) the logs are equally unhelpful:

influxdb  | ts=2025-05-14T05:40:00.338793Z lvl=debug msg="remote write error" log_id=0wSDZ~M0000 service=replications replication_id=0edc37d9b7035000 attempt=10 error message=msg status code=500
influxdb  | ts=2025-05-14T05:40:00.338842Z lvl=error msg="Error in replication stream" log_id=0wSDZ~M0000 service=replications replication_id=0edc37d9b7035000 error="invalid response code 500, must be 204: 500 Internal Server Error: An internal error has occurred - check server logs" retries=11

Any suggestions on how I can further debug what’s going on? I’ve recreated the remote and the replication configuration between these servers - and it continues to fail with a 500 error code. Both servers are running 2.7.11 in a docker container.

Hello @rygregg,
One issue that’s common with replication failures is schema conflicts.
This also might be useful:

Yep ‘drop-non-retryable-data’ fixed it for me! I had been playing with writing batch data and made a mistake, and that held up the queue. Once I updated the replication with that flag I could see the sync queue size dropping and eventually could see the data come through to the cloud
Also you might want to to monitor your replication queue size to see if data is getting stuck:

influx replication list --org-id <OSS org ID> --token <OSS token>

Also you might want to look at: