Replication from a local influx database to another instance (including paid cloud instance) is slow. If ever we have a network issue, it takes a long time for replication to catch up.
Other than this slow throttling, I really like the feature…
There are several topics on this from over the years (I can only post 2 with a new forum account):
None of these include any support or even acknowledgement of the issue. I have also contacted support and just received a link on how to monitor the replication stream.
Will this slow replication which seems like it should be quite simple to speed up ever going to be acknowledged and improved?
The best practice for writing to InfluxDB 2.X to maximize replication speed is to write in large batches to the original database: try 5000 points per write as a starting point. Recommendations for debugging replication problems are here.
We always welcome community PRs to improve InfluxDB. At this moment, we are not planning on significant changes to data replication, but well-written and tested PRs are gratefully reviewed. Tag me in them to be sure I see them.
Thank you for the responses. Apologies I didn’t provide very much information in this post.
I am using influxdb:2.7.11 on the edge devices, and then replicating to the influx cloud service.
Normally the syncing all works well, replication is real-time and it all works great. The issue stems from a prolonged network outage on the edge devices. In this case, it takes a long time for the replication to catch up. This means if you have a network outage for say 1 day, it may be 2 days before replication has caught back up to real-time, as both the old and new writes seem to be queued and throttled.
To the best of my understanding - at least in Influx 2.x the replication follows the data log step-by-step. It means that pushing 50 bytes of a record to the bucket queues them for the replication as a separate entry, and regardless of the queue length - there is no mechanism for batching multiple entries into one replication query to the remote server. So every 50 bytes get wrapped in ~2K of HTTP headers, then HTTPS handshaking (there is no multi-part or other trick to reuse HTTP streams, and I am in doubt whether lower-level TCP/SSL socket connections are being reused). This is why decent traffic in the replication “line” doesn’t actually ship the data and empty the growing queue.
The only solution I’ve found for this - is a client-side batching. When the batch is pushed to DB - it is queued for the replication as a single entity (there is a code to split large requests, but my knowledge of Go isn’t enough to figure whether it is used, since playing with hard-coded values didn’t seem to change the behaviour). Of course, at the cost of losing data when client fails, as well as a gap between committing the data in the client code and its appearance in DB/reports/alerts.
Yes, replicating writes can be slow to catch up when there have been connectivity gaps. As others have suggested, batching incoming writes is the path to efficient replication.