InfluxDB Clustering

pauldix · June 29, 2017, 5:27am

Originally published at: InfluxDB Clustering - High Availability and Scalability | InfluxData

I last blogged about InfluxDB clustering a while ago and thought it was time to update you with a feature that is only available in our commercial product offerings – InfluxCloud and InfluxEnterprise.

Architectural Overview

[caption id="attachment_9378" align="alignleft" width="179"]

InfluxDB Clustering Overview[/caption]

An InfluxEnterprise installation allows for a clustered InfluxDB installation which consists of two separate software processes: Data nodes and Meta nodes. To run an InfluxDB cluster, both the meta and data nodes are required.

The meta nodes expose an HTTP API that the influxd-ctl command uses. This command is what system operators use to perform operations on the cluster like adding and removing servers, moving shards (large blocks of data) around a cluster and other administrative tasks. They communicate with each other through a TCP Protobuf protocol and a Raft consensus group.

Data nodes communicate with each other through a TCP and Protobuf protocol. Within a cluster, all meta nodes must communicate with all other meta nodes. All data nodes must communicate with all other data nodes and all meta nodes.

The meta nodes keep a consistent view of the metadata that describes the cluster. The meta-cluster uses the HashiCorp implementation of Raft as the underlying consensus protocol. This is the same implementation that they use in Consul. The meta nodes can run on very modestly sized VMs (t2-micro is sufficient in most cases).

The data nodes replicate data and query each other via a Protobuf protocol over TCP. Details on replication and querying are covered in the documentation. Data nodes are responsible for handling all writes and queries. Sizing is dependent on your schema and your write and query load.

Optimal Server Counts

For optimal InfluxDB Clustering, you’ll need to choose how many meta and data nodes to configure and connect. You can think of InfluxEnterprise as two separate clusters that communicate with each other: a cluster of meta nodes and one of data nodes.

Meta Nodes: The magic number is 3!

The number of meta nodes is driven by the number of meta node failures they need to be able to handle, while the number of data nodes scales based on your storage and query needs.

The consensus protocol requires a quorum to perform any operation, so there should always be an odd number of meta nodes. For almost all use cases, 3 meta nodes is the correct number, and such a cluster will operate normally even with the loss of 1 meta node. A cluster with 4 meta nodes can still only survive the loss of 1 node. Losing a second node means the remaining two nodes can only gather two votes out of a possible four, which does not achieve a majority consensus. Since a cluster of 3 meta nodes can also survive the loss of a single meta node, adding the fourth node achieves no extra redundancy and only complicates cluster maintenance. At higher numbers of meta nodes the communication overhead increases exponentially, so a configuration of 5 meta nodes is likely the max you’d ever want to have.

Data Nodes: Based on scalability requirements

Data nodes hold the actual time series data. The minimum number of data nodes to run is 1 and can scale up from there. Generally, you’ll want to run a number of data nodes that is evenly divisible by your replication factor. For instance, if you have a replication factor of 2, you’ll want to run 2, 4, 6, 8, 10, etc. data nodes. However, that’s not a hard and fast rule, particularly because you can have different replication factors in different retention policies.

As a rule of thumb: InfluxDB Clustering should have 3 meta nodes with an even number of data nodes.