Cluster Basics

High availability features

A system can be characterized as having high availability if it meets several key criteria such as: having a high uptime, recovering smoothly, achieving zero data loss, and essentially handling and adapting to unexpected situations and scenarios.

The GraphDB cluster is designed for high availability and has several features that are crucial for achieving enterprise-grade highly available deployments. It is based on coordination mechanisms known as consensus algorithms. They allow a collection of machines to work as a coherent group that can survive the failures of some of its members and provide lower latency. In essence, such protocols define the set of rules for messaging between machines. Because of this, they play a key role in building reliable large-scale software systems.

Consensus algorithms aim to be fault-tolerant, where faults can be classified in two categories:

  • Crash failure: The component abruptly stops functioning and does not resume. The other components can detect the crash and adjust their local decisions in time.

  • Byzantine failure: The component behaves arbitrarily with no absolute conditions. It can send contradictory messages to other components or simply remain silent. It may look normal from outside.

The GraphDB cluster uses the Raft consensus algorithm for managing a replicated log on distributed state machines. It implements consensus by first electing a distinguished leader, then giving the leader complete responsibility for managing the replicated log. The leader accepts log entries from clients, replicates them on other servers, and tells servers when it is safe to apply log entries to their state machines. It can tolerate n 2m + 1 failures.

Quorum-based replication

The GraphDB cluster relies on quorum-based replication, meaning that the cluster should have over 50% alive nodes in order to be able to execute INSERT/DELETE operations. This ensures that there will always be a majority of GraphDB nodes that always have up-to-date data.

If there are unavailable nodes when an INSERT/DELETE operation is executed, but there are more than 50% alive nodes, the request will be accepted, distributed among the reachable alive nodes, and saved if everything is OK. Once the unavailable nodes come back online, the transactions will be distributed to them as well.

If there are fewer than 50% available nodes, any INSERT/DELETE operations will be rejected.

Internal and external proxy

Internal proxy

In normal working conditions, the cluster nodes have two states - leader and follower. The follower nodes can accept read requests, but cannot write any data. To make it easier for the user to communicate with the cluster, an integrated proxy will redirect all requests (with some exceptions) to the leader node. This ensures that regardless of which cluster node is reached, it can accept all user requests.

However, if a GraphDB cluster node is unavailable, you need to switch to another cluster node that will be on another URL. This means that you need to know all cluster node addresses and make sure that the reached node is healthy and online.

External proxy

For even better usability, the proxy can be deployed separately on its own URL. This way, you do not need to know where all cluster nodes are. Instead, there is a single URL that will always point to the leader node.

The externally deployed proxy will behave like a regular GraphDB instance, including opening and using the Workbench. It will always know which one the leader is and will always serve all requests to the current leader.

Query load balancer

In order to achieve maximum efficiency, the GraphDB cluster distributes the incoming read queries to all nodes, prioritizing the ones that have fewer running queries. This ensures the optimal hardware resource utilization of all nodes.

Cluster roles

As mentioned above, the GraphDB cluster is made up of two basic node types: leaders and followers. Usually, it comprises an odd number of nodes in order to tolerate failures. At any given time, each of the nodes is in one of four states:

  • Leader: Usually, there is one leader that handles all client requests, i.e., if a client contacts a follower, the follower redirects it to the leader.

  • Follower: A cluster is made up of one leader and all other servers are followers. They are passive, meaning that they issue no requests on their own but simply respond to requests from leaders and candidates.

  • Candidate: This state is used when electing a new leader.

  • Restricted: In this state, the node cannot respond to requests from other nodes and cannot participate in election. A node goes into this state when there is a license issue, i.e., invalid or expired license.

Terms

The Raft algorithm divides time into terms of arbitrary length. Terms are numbered with consecutive integers. Each term begins with an election, in which one or more candidates attempt to become leader. If a candidate wins the election, then it serves as leader for the rest of the term. In some situations an election will result in a split vote. In this case the term will end with no leader; a new term (with a new election) will commence. Raft ensures that there is at most one leader in a given term.

Different servers may observe the transitions between terms at different times. Raft terms act as a logical clock in Raft, and they allow servers to detect obsolete information such as stale leaders. Each server stores a current term number, which increments with term passings. Current terms are exchanged whenever servers communicate; if one server’s current term is smaller than the other’s, then it updates its current term to the larger value. If a candidate or leader discovers that its term is out of date, it immediately reverts to the follower state. If a server receives a request with a stale term number, it rejects the request.

Log replication

The GraphDB cluster nodes communicate using remote procedure calls (RPCs), and the basic consensus algorithm requires only two types of RPCs:

  • RequestVote: RPCs that are initiated by candidates during elections

  • AppendEntries: RPCs that are initiated by leaders to replicate log entries and to provide a form of heartbeat

Servers retry RPCs if they do not receive a response in a timely manner, and they issue RPCs in parallel for best performance.

The log replication resembles a two-phase commit where:

  1. The user sends a commit transaction request.

  2. The transaction is replicated in local transaction log.

  3. The transaction is replicated to other followers in parallel.

  4. The leader waits until enough members (total/2 + 1) have replicated the entry.

  5. The leader start applying the entry to GraphDB successfully.

  6. The leader sends a heartbeat until successful commit in GraphDB.

  7. The leader sends a second RPC informing followers to apply log entry to GraphDB.

  8. The leader informs the client that the transaction is successful.

_images/cluster-log-replication.png

Note

If followers crash or run slowly, or if network packets are lost, the leader retries AppendEntries RPCs indefinitely in parallel (even after it has responded to the client) until all followers eventually store all log entries.

Only updates relating to repositories, data manipulation, and access are replicated between logs. This includes adding/deleting repositories, user right changes, SQL views, smart updates, and standard repository updates.

Leader election

Raft uses a heartbeat mechanism to trigger leader election. When nodes start up, they begin as followers. A node remains in the follower state as long as it is receiving valid RPCs from a leader or candidate. Leaders send periodic heartbeats (AppendEntries RPCs that carry no log entries) to all followers in order to maintain their authority. If a follower receives no communication over a period of time called the election timeout, then it assumes there is no viable leader and begins an election to choose a new leader. A candidate wins an election if it receives votes from a majority of the servers in the full cluster for that term. Each node can vote for at most one candidate in a given term, on a first-come-first-served basis.

If the cluster gets into a situation where only one node is left, that node will switch to read-only mode. The state shown in its cluster status will switch to candidate, as it cannot achieve a quorum with other cluster nodes when new data is written.

The leader election process goes as follows:

  1. After the initial configuration request has been sent, one of the nodes will be set as leader at random.

  2. If the current leader node stops working for some reason, a new election is being held to promote the most voted follower nodes to candidate status, and one of those candidates will become the new leader.

  3. The leader node sends a constant heartbeat (a form of node status check to see if the node is present and able to perform its tasks).

  4. If only one node is left active for some reason, its status will change to candidate and it will switch to read-only mode to prevent further tinkering with it, until more nodes appear in the cluster group.

_images/cluster-automatic-leader-election.png

Image source: Stanford Digital Repository