Using Influxdb-relay for high availability

influxdb
telegraf
#1

Sometimes our influxdb goes down which caused data loss. So, I came to know about influxdb-relay fro high availablity.

    ┌─────────────────┐                 
    │writes & queries │                 
    └─────────────────┘                 
             │                          
             ▼                          
         ┌───────────────┐                  
         │               │                  
┌────────│ Load Balancer │─────────┐        
│        │               │         │        
│        └──────┬─┬──────┘         │        
│               │ │                │        
│               │ │                │        
│        ┌──────┘ └────────┐       │        
│        │ ┌─────────────┐ │       │┌──────┐
│        │ │/write or UDP│ │       ││/query│
│        ▼ └─────────────┘ ▼       │└──────┘
│  ┌──────────┐      ┌──────────┐  │        
│  │ InfluxDB │      │ InfluxDB │  │        
│  │ Relay    │      │ Relay    │  │        
│  └──┬────┬──┘      └────┬──┬──┘  │        
│     │    |              |  │     │        
│     |  ┌─┼──────────────┘  |     │        
│     │  │ └──────────────┐  │     │        
│     ▼  ▼                ▼  ▼     │        
│  ┌──────────┐      ┌──────────┐  │         
│  │          │      │          │  │        
└─▶│ InfluxDB │      │ InfluxDB │◀─┘        
   │          │      │          │           
   └──────────┘      └──────────┘           

I have only 1 influxdb server.
I don’t think that this works for a single server, is it so ?
Can I prevent data loss when that server is down ?

It does not seem so Because, when that server is down, it says that unable to write points and gives a 503.

luvpreet@DHARI-Inspiron-3542:/etc$ curl -i -XPOST 'http://localhost:9096/write?db=tester' --data-binary 'glass,host=server01,region=us-west value=0.64 1434055562000000000'

HTTP/1.1 503 Service Unavailable
Content-Length: 35
Content-Type: application/json
Date: Mon, 17 Apr 2017 12:48:33 GMT

{"error":"unable to write points"}

InfluxDB Relay not buffering failed writes
#2

@Luv You need at least two servers for HA.

#3

Basically, the relay stores the data in the RAM when the influx server is down. When RAM is available and relay is also there, so why is there a need for the second server ?

What functionality does that second server provide in this process ?

These below are the lines from the github page of influxdb-relay,

Let's say one of the InfluxDB servers goes down for an hour on 2016-03-10. Once midnight UTC rolls ``over, all InfluxDB processes are now writing data to the shard for 2016-03-11 and the file(s) for ``2016-03-10 have gone cold for writes. We can then restore things using these steps:

  • Tell the load balancer to stop sending query traffic to the server that was down (this should be done as soon as an outage is detected to prevent partial or inconsistent query returns.)
  • Create backup of 2016-03-10 shard from a server that was up the entire day
  • Restore the backup of the shard from the good server to the server that had downtime
  • Tell the load balancer to resume sending queries to the previously downed server

During this entire process the Relays should be sending current writes to all servers, including the one with downtime.

In these lines, where is the functionality of the relay given ?

I hope that you will clear my misconception about this concept.

#4

@jackzampolin What if I open my socket by any means(say django server) at port 8998. Then I give my relay this port 8998 as my second influxdb server. I tell my django server to return success 200 code always whenever a /write uri comes. Will this work for the relay as the second innfluxdb server?

Basically I have only 1 influxdb server, so I am trying to figure out a way to use this relay to prevent data loss from the influxdb, as influxdb sometimes goes down. If I cannot use this method, can you please tell me the significance of 2nd influxbd server ?

#5

Honestly, that’s not the intended use-case of the relay. We have a relay in our environment but we have multiple nodes that it’s sending information to.

Spinning up a django interface to listen isn’t really helpful, unless somehow you’re storing that infomration off somewhere to replay into influxdb when it comes back up. Not to mention the single-point-of-failure for the device itself…

If anything, spin up two influxdb instances on two separate ports writing to separate fs locations if you truly want redundancy on the same box. Otherwise, classic HA rules apply and you’ll need a second node.

Happy to discuss more if you’re curious…

#6

@sebito91
Here is what I think,

The relay uses main memory to keep the data when the influxdb server is down. If the influxdb server is down, it should keep storing the data in main memory and push the data back into influxdb when it comes up again. Why would it need another influxdb server then, does not it automatically stores data in the main memory ? Please, tell me the significance of this second server, what does it do ?

I think second server is needed just to send the success response to the user.

Sorry if I got it totally wrong.

#7

You’re really just talking about buffering and not redundancy (a requirement for High Availability aka HA). Just keep in mind that this could be a large buffer depending on the amount of data you have coming in.

To be clear though, buffering is not something this relay was ever intended to handle (see #L212). This was meant as a pure forwarding mechanism, not a store-and-forward (think haproxy or nginx). The responsibility is on the backend to be up and ready to receive data, which is why you’d need >1 backend to enable HA in the traditional sense.

If you’re stuck with only one node, you won’t be able to implement a traditional HA setup since you have a single point of failure in your primary node. You could implement buffering + retries into the relay if you so desired, but that’s not the current implementation of this relay. It’s not that it can’t be done, it’s just not done at the moment.

In all honesty, the relay we use is custom though inspired by the influxdb-relay. We don’t implement retries but we do batch data points to send to the backend and have multiple nodes for HA.

1 Like
#8

So, what is in the current relay ? Here is what I think, you are here to correct me if I am wrong

It uses 2 influxdb servers, which writes to same shards. So, if one of them fails, other one can write the data in the shard. Is it like so ?

There is written that it retries to send the data when the server again comes up, that is why I thought of it like it was a buffer in the RAM. Please clarify these lines also.

#9

Technically the writes are sent to both backends, but the shards could be different so they are not an exact copy of one another.

Each influxdb instance will maintain its own shard set so depending on the machine and the concurrent uptime the shard enumeration could be off. That’s not a big deal provided all of the data is intact, but just good to know overall. If you were to try to copy data from one host to another, you should look to use the influxdb backup tool.

The Buffering section of the README explains how the relay works for periodic, short drops. The point is that you should not rely on RAM to maintain all data points until your backend is back up. If it’s a quick blip, fine the buffer should be able to save you; if it’s a longer outage like a bricked backend node then RAM will fill up quickly and all subsequent incoming points will be dropped.

The key point once again is that to implement a traditional high availability service, you would need more than one backend OR more than one influxdb process running on a single node. The former is the true HA setup, the latter still present a single point of failure.

#10

For HA and horizontal scale out – we recommend InfluxEnterprise. We continue to design, build, and operate this. Yes, this is our commercial offering.

Learn more here:
https://www.influxdata.com/products/editions/