Strategy for Availability and Recovery with Possibly Unreliable Internet Service?

uptime
backup
influxdb
#1

We are building up a sensor system around InfluxDB. In these early stages we’re simply running a local instance of InfluxDB as a Docker container.

Looking down the road we are considering our options and strategy to maximize availability and recovery. Monitoring certain readings is fairly critical for us. Losing alarm systems built around InfluxDB for even a handful of hours could lead to a costly failure in the right circumstances.

On the one hand, relying on Influx Cloud eliminates much of our systems management overhead but introduces the real complication of Internet service outages at our facility (and to a lesser extent also Influx Cloud outages — Amazon’s recent experience will strike again). On the other hand, a local instance (likely involving Influx’s open source Relay project) eliminates outside outage issues but significantly increases our systems management overhead and overall risk.

We recognize that no architecture is bulletproof. Further, even if we did maintain everything on-premise but lost Internet service we still must address the issue of getting actual alarm conditions out of the facility.

What’s a good way to think about this? Is there a hybrid approach that’s not entirely unwieldy? How does one evaluate and deal with possible Internet service outages if relying on a cloud-based system? Concentrate on redundant Internet service? Engineer a hybrid local / cloud data system? Skip technical solutions and statistically model downtime in order to engineer the business around the consequences?

Perspectives and recommendations appreciated.

#2

@mkarlesky This is a question without a good answer right now. What is needed is an easy way to ship data from one Influx instance to another. I have seen some Influx users who use Kapacitor (i.e. caching data locally and shipping it up when network partition is healed) to ship the data but the concerns you raise about availability at the remote site are understandable.

Most people I have seen work on this seriously have solved it with a combination of:

Engineer a hybrid local / cloud data system
Engineer the business to tolerate occasional connectivity interruption between remote sties and cloud

What would you liek in an ideal world?

#3

Thanks for the response.

In an ideal world, I think we’d like to be able to “hitch” an on-premise instance of InfluxDB to an Influx Cloud cluster. If the Cloud cluster is unavailable the local instance would handle writes and queries and then sync up with the Cloud once it came back online. Similarly, if the on-premise instance goes down it would sync up with the Cloud instance once it was back up and running. It’s perhaps a little RAID-like? Obviously, there’s a host of issues to resolve — matching versions of the database, dealing with any throughput limitations of the lesser on-premise instance.

In our case, it’s probably more likely we might lose Internet connectivity than Influx Cloud. And other enterprise-y systems of ours might well necessitate a backup Internet connection. So we fully acknowledge we may be looking at engineering a solution to a problem that isn’t really warranted. Perhaps we should concentrate on redundant Internet connectivity and putting in place policies to work with brief outages otherwise.

Again, any perspective is appreciated.

#4

Personally I’d like to have telegraf output to multiple instances simultaneously, and be able to do selective restores to fill in the gaps between systems when one is recovered

1 Like
#5

We ended up solving our problem by replicating Influx. We use a cloud-based Influx cluster for all our data across all facilities for all time. Each facility has a local instance of Influx running with a retention policy that limits storage to six weeks. We stream measurements from each local facility to both the cloud-based and local Influx instances. Each facility’s dashboards and alarms draw from the local instance of Influx. Our heavy-lifting reporting and data exploration draws against the cloud instance. Backups are run for both the cloud and local facilities. This setup gives us flexibility and redundancy in the event of outages. On the rare outage we’re able to (manually) fill in data gaps afterwards and depending on circumstances route some services to failover options or around maintenance needs. The division of responsibilities is working well for us.