Replication¶
What’s in this document?
Normal operations¶
During normal operations, the master node keeps the cluster synchronized by processing updates.
When the master detects an out-of-sync worker - because it is offline or unreachable due to network problems - it tries to bring the problem worker node to the same state as the other worker nodes. The master checks whether the signature of the out-of-sync worker corresponds to a known state that was reached back in the transaction log. If so, it starts to replay the missing transactions to the problem worker node until it reaches the same state as the other worker nodes. During this process, the cluster remains in read-write mode and can process updates.
Replication¶
When the master detects an out of sync worker whose signature does not correspond to a known state in the transaction log, it initiates replication. It chooses an up-to-date worker that is further ahead in the execution of the transaction logs to replicate to the problem worker node. Replicating a worker involves shutting down both workers and initiating a binary transfer of the good worker’s storage folder directly to the storage folder of the bad worker. When replication is complete, both nodes become available.
Note that setting the parameter IncrementalUpdateLimit
to a number of missing updates is no longer supported.
The logic behind¶
When the signature of the problem worker corresponds to a known state, the master automatically chooses between replaying the missing transactions (incremental updates) and replicating the worker, based on which option will be faster.
Let’s say that the estimate of incremental update is
incrementalDurationS
, and the estimate of replication is
replicationDurationS
. GraphDB EE prefers the replication
when both of these are true
:
incrementalDurationS
>replicationDurationS
*FullRreplicationTimeFactor
- this is the speed-upincrementalDurationS
>MinTimeToConsiderFullReplicationS
- this handles the case when the relative difference is big but the absolute difference is small. E.g., 1sec for full replication vs. 2sec for incremental updates.
Parameters¶
There are three parameters that control the smart replication process. They
are controlled via the JMX bean ReplicationCluster:name=ClusterInfo/{$MASTER}
, and are persisted
in the master’s configuration file.
Parameter |
Type |
Default value |
Description |
---|---|---|---|
|
bits/sec (long) |
104857600 (100Mbps) |
The network speed. Used to estimate the time for full replication. |
|
ratio (float) |
1.3 |
Speed-up ratio. |
|
seconds (long) |
600 (10 minutes) |
Minimum absolute time. |
Log messages showing the reasons for starting replication¶
Other log messages¶
Compression for replications¶
Hint
The below example for compressing the repositories is using Snappy.
Note
Enable the compression on all masters and all workers. A mixed setup will result in problems with the replication, including backup and restore.
The compression is enabled by default. To disable the replication compression, add this setting:
-Dreplication.compression=false
.
Expected speed up¶
The following table indicates the expected replication speed tested in a local area network.
Repository size: 41 gigabytes
Network link |
Without compression |
With compression (default) |
---|---|---|
1Gb/s |
6m30.074s |
3m10.519s |
100Mb/s |
57m38.210s |
23m58.711s |