Friday, May 5, 2023

 

The preceding post discussed the Bayou writes and the write stability and commitment. This article discusses a conflict resolution system with an inspiration from the Bayou model.

The service we look under the hood is Cosmos DB whose update propagation, conflict resolution and causality tracking have a frame of reference in Bayou system. Some transformations were applied because Cosmos DB is massively scalable and comes with resource governance, multiple consistency levels, and stringent and comprehensive SLAs.

Cosmos DB makes use of a partition set that is distributed across multiple regions and allows multi-region writes. A replication protocol replicates data among the physical partitions comprising a given partition-set. Each physical partition accepts writes and serves reads from clients typically co-located in the same region. Writes accepted by a physical partition within a region are durably committed and made highly available before they are acknowledged to the client. The set of writes is exchanged with other partitions within the partition-set and resolved based on the virtual time. Clients can request either tentative or committed writes by passing a request header. This exchange is dynamic and based on the topology of the partition-set, regional proximity of the physical partitions and the configured consistency levels. The commit scheme used by Cosmos DB is the primary commit. Knowledge of which writes have committed, and in which order they were committed then propagates to the other partitions. The primary behaves like any other partition in all other aspects. Different replicated data collections can have a different partition designated as primary. The primary is dynamic and integral part of the reconfiguration of the partition-set based on the topology of the overlay. The committed writes including multi-row or batched updates are guaranteed to be ordered by virtue of the vector clock.

Vector clocks are used to encode virtual time and region ID during each exchange for consensus with the replica set and partition set respectively. This helps with causality tracking and version vectors help resolve the update conflict because the partition with the timestamp that is lowest among all the available clocks can execute a stable write. With the help of version vectors, the update conflicts are resolved while the topology and peer selection algorithm are designed to ensure fixed and minimal storage and minimal network overhead of version vectors. This algorithm guarantees the strict convergence property.

The Azure Cosmos DB databases configured with multiple write regions allows interchangeable conflict resolution policies for the clients and these include ‘Last-Write-Wins’ and application defined ‘Custom’ conflict resolution policies.  The Last-Write-Wins allows customization of any numerical property to override the system-defined timestamp property-based clock protocol. The Custom conflict resolution policy allows application defined merge procedure as part of the commitment protocol with the guarantee that these procedures are executed exactly once per write-write conflict. All these procedures take the same input parameters of ‘incomingItem’ for the write that generates the conflicts, the ’existingitem’for the currently persisted item, ‘isTombstone’ to indicate if the incomingItem conflicts with a previously deleted item and ‘conflictingItems’ that includes the list of persisted versions of all items that are conflicting with the incomingItem.  A simple merge conflict procedure could simply let deleted item to win, existing item to win, existing conflicting item to win or incoming item to win in that order.

Reference: https://medium.com/paxos-algorithm/paxos-algorithm-71b76aaeb6b

No comments:

Post a Comment