Replication¶
What’s in this document?
Normal operations¶
During normal operations, the master node keeps the cluster synchronised by processing updates.
When the master detects an out-of-sync or a slow worker -
because it is offline or unreachable due to network problems - it tries
to bring the problem worker node to the same state as the other worker
nodes. The master checks whether the signature of the out-of-synch
worker corresponds to a known state, which was reached some way back in
the transaction log, and if so, it starts to replay the missing
transactions to the problem worker node until it reaches the same state
as the other worker nodes. During this time, the cluster remains in a
read-write
mode and can process updates.
Replication¶
When the master detects an out-of-sync or a slow worker whose signature does not correspond to a known state in the transaction log, it initiates replication. It chooses an up-to-date worker that is further in the execution of the transaction logs to replicate to the problem worker node. Replicating a worker involves shutting down both workers and initiating a binary transfer of the good worker’s storage folder directly to the storage folder of the bad worker. When replication is complete, both nodes become available.
Note
Replication is always automatic. Manual and remote replication are no longer supported.
Note
Setting the parameter IncrementalUpdateLimit
to a number of missing updates is no longer supported.
The logic behind¶
When the signature of the problem worker corresponds to a known state, the master automatically chooses between replaying the missing transactions (incremental updates) and replicating the given worker, based on which one will be faster.
Let’s say that the estimate of incremental update is
incrementalDurationS
and the estimate of replication is
replicationDurtationS
. GraphDB EE prefers the replication
when both of these are true:
incrementalDurationS
>replicationDurationS
*FullRreplicationTimeFactor
- this is the speed-upincrementalDurationS
>MinTimeToConsiderFullReplicationS
- this handles the case when the relative difference is big but the absolute difference is small. E.g., 1s for full replication vs. 2s for incremental updates.
Parameters¶
There are 3 parameters that control the smart replication process. They
are controlled via the JMX bean
ReplicationCluster:name=ClusterInfo/{$MASTER}
and are persisted
in the master’s configuration file.
Parameter | Type | Default value | Description |
---|---|---|---|
NetSpeedBitsPerSec |
bits/sec (long) | 104857600 (100Mbps) | The network speed. Used to estimate the time for full replication. |
FullReplicationTimeFactor |
ratio (float) | 1.3 | Speed-up ratio. |
MinTimeToConsiderFullReplicationS |
seconds (long) | 600 (10 minutes) | Minimum absolute time. |
Log messages showing the reasons for starting replication¶
Other log messages¶
Compression for replications¶
Hint
The given example for compressing the repositories is with Snappy.
Note
Enable the compression on all masters and all workers. A mixed setup results in problems with the replication, including backup and restore.
By default, the compression is enabled. To disable the replication compression add this setting:
-Dreplication.compression=false
.
Expected speed up¶
The following table indicates the expected replication speed tested in a local area network.
Repository size: 41GB
Network link | Without compression | With compression (default) |
---|---|---|
1Gb/s | 6m30.074s | 3m10.519s |
100Mb/s | 57m38.210s | 23m58.711s |