Sunday, May 7, 2023

 

Many cloud services and computing infrastructure are concerned with providing consistency levels and redundancy for the changes made to the data.  For example, Azure Cosmos DB provides five consistency levels of Eventual, Consistent-Prefix, Session, Bounded Staleness and strong consistency in the increasing order.  Each level provides availability and performance tradeoffs. Each level provides availability and performance tradeoffs and the relaxation of consistency levels from stronger to weaker increases the availability and throughput and reduces latency. These are proven to be true for any distributed system by the PACELC theorem that states that in case of network partitioning (P) in a distributed computer system, one has to choose between availability (A) and consistency (C) (as per the CAP theorem), but else (E), even when the system is running normally in the absence of partitions, one has to choose between latency (L) and consistency (C).

Another example of such a cloud service is Service Fabric. It is often compared to Kubernetes which is also a container orchestration framework that hosts applications. Kubernetes extends this idea of app+container all the way where the host can be nodes of a cluster. Cluster membership is maintained by requiring kubelets to maintain a connection with the API server to send heartbeats every period. Service Fabric avoids this with a decentralized ring and failure detection. It is motivated by the adage that distributed consensus is at the heart of numerous co-ordination problems, but it has a trivially simple solution if there is a reliable failure detection service.

Consensus algorithms that help participating processes that need co-ordination to agree on some data value that is needed during computation are critical to providing these consistency levels, availability, and performance. Algorithms like Paxos, Raft and Bayou have demonstrated their use in a variety of production and mission critical systems. This article surveys these approaches.

Paxos and Raft, both take a similar approach differing only in their approach for leader election. Raft only allows servers with up-to-date logs to become leaders, whereas Paxos allows any server to be leader if it then updates its logs to ensure that it remains up-to-date. Raft is more efficient since it does not involve log entries to be exchanged during leader elections. 

Raft algorithm was proposed to address the long-standing issues with understandability of the widely studied Paxos algorithm. It has a clear abstraction and presentation and can be a simplified version of the original Paxos algorithm. Specifically, Paxos divides terms between servers whereas Raft allows a follower to become a candidate in any term, but followers will vote for only one candidate per term. Paxos followers will vote for any candidate, whereas Raft followers will only vote for a candidate if the candidate’s log is at least as up to date. If a leader has uncommitted log entries from a previous term, Paxos will replicate them in the current term whereas Raft will replicate them in their original term. Raft’s leader election is therefore lightweight when compared to Paxos

Saturday, May 6, 2023

 

The data in an Azure Cosmos DB container is horizontally partitioned into many replica-sets, which replicate writes in each region. Replica-sets durably commit writes using a majority quorum.  Each region contains all data partitions of an Azure Cosmos DB container. If the account is distributed across N regions, there will be N x 4 copies of all the data. Within a datacenter, CosmosDB comprises massive stamps of machines, each with dedicated local storage. There are many clusters involved and machines within a cluster might spread across 10-20 fault domains for high availability within a region. A machine may contain replicas each of which will comprise of Resource Governor, Transport, Admission Control and Database Engine. The database engine comprises of Resource Manager, Language Runtime Hosts, Query Processor, RSM, Index Manager, Storage Engine, Log Manager, and I/O Manager.

Global distribution in Azure Cosmos DB is turnkey and regions can be added or dropped from a database and a database consists of one or more containers. A container serves as a logical unit of distribution and scalability. They are schema-agnostic and provide a scope for a query.  In a given region, data within a container is distributed by using a partition key and is transparently managed by the underlying physical partitions. This is the local distribution pattern. Each physical partition is also replicated across geographical regions in a global distribution manner.

A physical partition is implemented by a group of replicas, called a replica-set. Each machine hosts hundreds of replicas and they correspond to physical partitions that are dynamically placed, and load balanced across machines within a cluster and across datacenters within a region. Local or global distribution does not matter to queries because Cosmos DB automatically indexes everything upon ingestion and users do not have to deal with schema or index management.

Cosmos DB’s global distribution relies on two key abstractions – replica sets and partition sets. A replica-set is a modular Lego block for co-ordination, and a partition set is a dynamic overlay of one or more geographically distributed physical partitions.

Cosmos DB makes use of a partition set that is distributed across multiple regions and allows multi-region writes. A replication protocol replicates data among the physical partitions comprising a given partition-set. Each physical partition accepts writes and serves reads from clients typically co-located in the same region. Writes accepted by a physical partition within a region are durably committed and made highly available before they are acknowledged to the client. The set of writes is exchanged with other partitions within the partition-set and resolved based on the virtual time. Clients can request either tentative or committed writes by passing a request header. This exchange is dynamic and based on the topology of the partition-set, regional proximity of the physical partitions and the configured consistency levels. The commit scheme used by Cosmos DB is the primary commit. Knowledge of which writes have committed, and in which order they were committed then propagates to the other partitions. The primary behaves like any other partition in all other aspects. Different replicated data collections can have a different partition designated as primary. The primary is dynamic and integral part of the reconfiguration of the partition-set based on the topology of the overlay. The committed writes including multi-row or batched updates are guaranteed to be ordered by virtue of the vector clock.

Friday, May 5, 2023

 

The preceding post discussed the Bayou writes and the write stability and commitment. This article discusses a conflict resolution system with an inspiration from the Bayou model.

The service we look under the hood is Cosmos DB whose update propagation, conflict resolution and causality tracking have a frame of reference in Bayou system. Some transformations were applied because Cosmos DB is massively scalable and comes with resource governance, multiple consistency levels, and stringent and comprehensive SLAs.

Cosmos DB makes use of a partition set that is distributed across multiple regions and allows multi-region writes. A replication protocol replicates data among the physical partitions comprising a given partition-set. Each physical partition accepts writes and serves reads from clients typically co-located in the same region. Writes accepted by a physical partition within a region are durably committed and made highly available before they are acknowledged to the client. The set of writes is exchanged with other partitions within the partition-set and resolved based on the virtual time. Clients can request either tentative or committed writes by passing a request header. This exchange is dynamic and based on the topology of the partition-set, regional proximity of the physical partitions and the configured consistency levels. The commit scheme used by Cosmos DB is the primary commit. Knowledge of which writes have committed, and in which order they were committed then propagates to the other partitions. The primary behaves like any other partition in all other aspects. Different replicated data collections can have a different partition designated as primary. The primary is dynamic and integral part of the reconfiguration of the partition-set based on the topology of the overlay. The committed writes including multi-row or batched updates are guaranteed to be ordered by virtue of the vector clock.

Vector clocks are used to encode virtual time and region ID during each exchange for consensus with the replica set and partition set respectively. This helps with causality tracking and version vectors help resolve the update conflict because the partition with the timestamp that is lowest among all the available clocks can execute a stable write. With the help of version vectors, the update conflicts are resolved while the topology and peer selection algorithm are designed to ensure fixed and minimal storage and minimal network overhead of version vectors. This algorithm guarantees the strict convergence property.

The Azure Cosmos DB databases configured with multiple write regions allows interchangeable conflict resolution policies for the clients and these include ‘Last-Write-Wins’ and application defined ‘Custom’ conflict resolution policies.  The Last-Write-Wins allows customization of any numerical property to override the system-defined timestamp property-based clock protocol. The Custom conflict resolution policy allows application defined merge procedure as part of the commitment protocol with the guarantee that these procedures are executed exactly once per write-write conflict. All these procedures take the same input parameters of ‘incomingItem’ for the write that generates the conflicts, the ’existingitem’for the currently persisted item, ‘isTombstone’ to indicate if the incomingItem conflicts with a previously deleted item and ‘conflictingItems’ that includes the list of persisted versions of all items that are conflicting with the incomingItem.  A simple merge conflict procedure could simply let deleted item to win, existing item to win, existing conflicting item to win or incoming item to win in that order.

Reference: https://medium.com/paxos-algorithm/paxos-algorithm-71b76aaeb6b

Thursday, May 4, 2023

 

When a server executes a write for the last time, that write is said to be stable.  Writes are repeated because other servers may ask for it to be undone. A given write may be repeated many times and the results will depend on the execution history of the server. A write becomes stable when the set of writes that precede it in the server’s write log become stable. This means that the server has already received and executed any writes that could possibly be ordered before the given write.

The corresponding concept for the application is termed commitment. As in the example of meeting room scheduling application, users may try to schedule separate meetings for the same time in the same room. Only when one of the users discovers that the writes has become stable and his schedule still shows that he has reserved the room for the desired time, can he be sure that the tentative reservation has been scheduled.

Since clients require to know that a write has stabilized, the Bayou API provides means for enquiring about the stability of a specific write. The answer may depend on the server that is asked but Bayou also provides support for clients that may choose to access only stable data.  

There are many approaches to determining when a write is stable. One approach requires the servers to exchange information about the set of writes along with the current value of the clock that it uses to timestamp new writes. With suitable assumptions about the propagation order of writes, a server could then determine that a write is stable when it has a lower timestamp than all the servers’ clocks. The main drawback with this approach is that a server that remains disconnected can prevent writes from stabilizing, which could cause a large number of writes to rollback, when the server reconnects.

Since Bayou does not require network connectivity to stay up all the time, one way to speed up the rate at which the writes stabilize involves a commit procedure.  A write becomes stable when it is clearly called out as committed. Committed writes in commit order are placed ahead of any tentative writes in each servers’ write log. This along with the information exchange between servers provides stability.

When a server designated as the primary takes responsibility for committing updates, it is called a primary commit. Knowledge of which writes have committed, and in which order they were committed then propagates to the other server. The primary behaves like any other server in all other aspects. Different replicated data collections can have a different server designated as primary.  Any commit protocol that prevents different groups of servers from committing updates in different orders would meet Bayou needs.

 

Wednesday, May 3, 2023

 

Much of the Bayou model can be explained by its use for a meeting room scheduler whose purpose is to facilitate reserving meeting rooms when the requestors have agreed upon a room and a set of acceptable times. At most, one person can reserve the room for any given period. Users interact with a UI that shows the schedule of a room along with the times it is already reserved. A user simply selects an available time to reserve it but because she may be requesting a time that might have already been reserved by someone else and does not show up on her display, she has the choice to select several available times instead of just one. Eventually one of them will be reserved. A reservation may remain “tentative” for a while.  In this stage, it may be rescheduled as other conflicting reservations become known.

Data is persisted in the form of a data collection that is replicated in full at several servers. Applicants running as clients interact with the servers through the Bayou API.  There are two basic operations supported: Read and Write. Read is for queries and write is for inserting, modifying, and deleting data items. Access to one server is sufficient for a client to perform useful work.

A Bayou write is processed with the input parameters of the update to perform, a dependency check supplied by the application and a merge procedure also supplied by the application. The process evaluates the dependency check to see whether the actual differs from the expected result. If there is a mismatch, it evaluates the merge-procedure to determine the resolved update otherwise the resolved update is the given update. The resulting update is then applied to the storage.

A Bayou write consists of an update that originates from a form filled in by the meeting room scheduling user who provides the datetime, duration and purpose of a new reservation request. It also consists of a dependency check that involves a SQL like query such as Select key from meetings where day is the given day, and the time lies between a start and an end with the expected result to be an empty data set. It consists of a merge procedure that takes different alternatives for time slots and for each choice, check if there would be a conflict by running the dependency check like query for the time slot of that alternative and skipping it when there is a conflict or creating a new update and breaking out of the looping over the alternatives. If no alternative is acceptable, the new update is tentatively the one user provided and the write operation returns the resulting update.

When applying a set of Bayou writes to the database, the system must check if the set of writes originated from the user or from another server. When the write is received from a client, the server’s virtual timestamp is incremented, and the write is inserted at the end of the write log with the state as tentative. Then the Bayou write is invoked with the parameters as this record, the dependency check and the merge procedure. In the case of the write set originating from another server, the append and the timestamp increment order is opposite to that of the sequence for the write set originating from the user. The write set is ordered, and the insertion point is determined, which is also the point recorded to rollback to in the form of a tuple store and the write inserted. Each write after the insertion point is then rolled forward by applying the Bayou write with that record, the dependency check, and the merge procedure. The logical clock for the server’s closure is recorded.

Tuesday, May 2, 2023

 

This is a continuation of a discussion on Consensus algorithms as it appears here. Bayou, differs from both Paxos and Raft in its support for weakly consistent storage system over an environment with less than ideal network connectivity. Users can read and write any accessible replica. It has focused on supporting application specific mechanisms to detect and resolve the update conflicts. It includes novel methods for conflict detection called dependency checks and per-write conflict resolution based on the client provided merge procedures. It enables the rollback of previously executed writes and redo them according to a global serialization order. Bayou also permits clients to observe the results of all writes received by a server, including tentative writes whose conflicts have not been ultimately resolved.

The Bayou storage system provides an infrastructure for collaborative applications that manages the conflicts introduced by concurrent activity while relying only on the weak connectivity available for mobile computing. Bayou emphasizes supporting application specific conflict detection and resolution. The applications are expected to specify the notion of a conflict along with its policy for resolving conflicts. The system implements the mechanism for reliably detecting conflicts.

The system includes two mechanisms for automatic conflict detection and resolution that are intended to support arbitrary applications: dependency checks and merge procedures. These mechanisms permit clients to indicate for every write whether there is a conflict and the steps to resolve it. Each write operation includes a dependency check consisting of an application specified query and its expected result. A conflict is detected if the query, when run at a server against its current copy of the data, does not return the expected result. The dependency check is a precondition for performing the update that is included in the write operation.

An example of such a write can be described in terms of a meeting room scheduling application.  This write attempts to reserve an hour long time slot. The dependency check can be expressed in a SQL query that returns information about any previously reserved meetings that overlap with this time slot. The expected result is an empty set. The dependency check itself can be implemented with version vectors and timestamps as is conventional in distributed systems but it can also go beyond to resolve Read-Write conflicts where each write explicitly specifies the expected values of any data items on which the update depends, including data items that have been read but are not being updated. This is similar to the optimistic style concurrency control used in some databases. Dependency queries can read any data in the server’s replica, dependency checks can enforce arbitrary, multi-item integrity constraints on the data

Once a conflict is detected, a merge procedure is run by the Bayou server to resolve the conflict. It is responsible for resolving any conflicts detected by its dependency check and for producing a revised update to apply.  The end-to-end process of detecting a conflict, running a merge procedure, and applying a revised update, is performed atomically at each server as part of executing the write. Separating the dependency check from the merge procedure allows the servers to avoid running the merge procedure when the checks do not result in a conflict. In the event of an unresolved conflict by way of automatic resolution, the merge procedure still completes but with enough logging for a proper resolution by way of intervention later.

Reference: https://medium.com/paxos-algorithm/paxos-algorithm-71b76aaeb6b

 

Monday, May 1, 2023

 

A brief survey of distributed consensus algorithms in cloud services.

Many cloud services and computing infrastructure are concerned with providing consistency levels and redundancy for the changes made to the data.  For example, Azure Cosmos DB provides five consistency levels of Eventual, Consistent-Prefix, Session, Bounded Staleness and strong consistency in the increasing order.  Each level provides availability and performance tradeoffs. Each level provides availability and performance tradeoffs and the relaxation of consistency levels from stronger to weaker increases the availability and throughput and reduces latency. These are proven to be true for any distributed system by the PACELC theorem that states that in case of network partitioning (P) in a distributed computer system, one has to choose between availability (A) and consistency (C) (as per the CAP theorem), but else (E), even when the system is running normally in the absence of partitions, one has to choose between latency (L) and consistency (C).

Another example of such a cloud service is Service Fabric. It is often compared to Kubernetes which is also a container orchestration framework that hosts applications. Kubernetes extends this idea of app+container all the way where the host can be nodes of a cluster. Cluster membership is maintained by requiring kubelets to maintain a connection with the API server to send heartbeats every period. Service Fabric avoids this with a decentralized ring and failure detection. It is motivated by the adage that distributed consensus is at the heart of numerous co-ordination problems, but it has a trivially simple solution if there is a reliable failure detection service.

Consensus algorithms that help participating processes that need co-ordination to agree on some data value that is needed during computation are critical to providing these consistency levels, availability, and performance. Algorithms like Paxos, Raft and Bayou have demonstrated their use in a variety of production and mission critical systems. This article surveys these approaches.

Paxos and Raft, both take a similar approach differing only in their approach for leader election. Raft only allows servers with up-to-date logs to become leaders, whereas Paxos allows any server to be leader if it then updates its logs to ensure that it remains up-to-date. Raft is more efficient since it does not involve log entries to be exchanged during leader elections. 

Raft algorithm was proposed to address the long-standing issues with understandability of the widely studied Paxos algorithm. It has a clear abstraction and presentation and can be a simplified version of the original Paxos algorithm. Specifically, Paxos divides terms between servers whereas Raft allows a follower to become a candidate in any term, but followers will vote for only one candidate per term. Paxos followers will vote for any candidate, whereas Raft followers will only vote for a candidate if the candidate’s log is at least as up to date. If a leader has uncommitted log entries from a previous term, Paxos will replicate them in the current term whereas Raft will replicate them in their original term. Raft’s leader election is therefore lightweight when compared to Paxos.