Company
Date Published
Author
David Taylor
Word count
2319
Language
English
Hacker News points
None

Summary

The CockroachDB Physical Cluster Replication (PCR) feature allows organizations to recover from disasters by adding "warm standby" clusters to their most critical deployments, providing an added layer of resilience and disaster preparedness. The primary design goal for PCR is to supply a recovery solution for full-cluster loss events with a much lower Recovery Time Objective (RTO) than restoring from backups. To achieve this, the system replicates every write, from an active cluster to a standby cluster, so that the standby can serve as a low-RTO failover. The replication process operates at scale, is asynchronous, and provides transactional consistency after recovery. The PCR process runs almost entirely on the standby cluster, where it divides the primary cluster into partitions, assigns them to nodes, and then connects to a node in the primary cluster to receive notifications of every change to every row in its assigned partition. The standby node buffers and applies these changes, keeping track of the timestamp as of which it knows it has applied all changes, and forwards this information to a central coordinator process. After failover, the system can "revert" any partially applied changes that violate transactional consistency by finding and removing every KV stored with an MVCC timestamp greater than the timestamp to which they are failing over, effectively restoring the content of the cluster as of that timestamp.