Cluster failures

Note

This section presumes that you have familiarized yourself with Cluster Topologies, and are aware of the pros and cons of your current topology.

What’s in this document?

Failures

Dead worker

When a worker node dies, the cluster will continue serving read and write operations, as long as there are at least two more workers in OK status. One of the live workers will be available for serving replications, backups and read requests while staying in replication server mode. The other worker will process read and write queries.

If for any reason multiple workers die and there is only one remaining worker, the cluster will not be able to self-recover or perform backups without losing the write operations. During a backup or replication, the last operational worker will enter read-only mode and stop serving any writes.

Hint

The behavior of the last functional worker is controlled by the ProtectSingleWorker flag. By default, the parameter value is false, and the master allows starting a backup or replication even if this will stop the writes.

Dead master

In the event of a master hardware failure, the client API either delays the failure, or sends read and write queries to a properly functioning master. This varies between the different topologies.

The primary master dies

In case of primary master failure, the read-only master will continue to serve read queries. You can also manually switch the second master to primary.

The read-only master dies

If the read-only master dies, then the primary master will not be able to synchronize its transaction log. All writes will stop unless the read-only master is unpeered.

The muted master dies

When a muted master dies, the primary master will continue serving read and write operations. All workers connected to the failed master will stop receiving updates.

Rolling upgrade

The Rolling Upgrade scenario is not related to a hardware fault, but to the need for production environments to be upgraded with little or no downtime. The scenario includes shutting down the worker and master nodes (in a particular order), updating to a newer GraphDB version, and starting all nodes, while in the meantime the cluster handles the read/write load. Different topologies may provide different levels of failover:

  • no failover: the simplest case (upgrade with downtime);

  • delayed writes: read queries are handled by the cluster, while the write queries are accepted but not executed until the upgrade is finished;

  • no downtime: both the read and write queries are handled during the upgrade.