Saturday, May 6, 2023

 

The data in an Azure Cosmos DB container is horizontally partitioned into many replica-sets, which replicate writes in each region. Replica-sets durably commit writes using a majority quorum.  Each region contains all data partitions of an Azure Cosmos DB container. If the account is distributed across N regions, there will be N x 4 copies of all the data. Within a datacenter, CosmosDB comprises massive stamps of machines, each with dedicated local storage. There are many clusters involved and machines within a cluster might spread across 10-20 fault domains for high availability within a region. A machine may contain replicas each of which will comprise of Resource Governor, Transport, Admission Control and Database Engine. The database engine comprises of Resource Manager, Language Runtime Hosts, Query Processor, RSM, Index Manager, Storage Engine, Log Manager, and I/O Manager.

Global distribution in Azure Cosmos DB is turnkey and regions can be added or dropped from a database and a database consists of one or more containers. A container serves as a logical unit of distribution and scalability. They are schema-agnostic and provide a scope for a query.  In a given region, data within a container is distributed by using a partition key and is transparently managed by the underlying physical partitions. This is the local distribution pattern. Each physical partition is also replicated across geographical regions in a global distribution manner.

A physical partition is implemented by a group of replicas, called a replica-set. Each machine hosts hundreds of replicas and they correspond to physical partitions that are dynamically placed, and load balanced across machines within a cluster and across datacenters within a region. Local or global distribution does not matter to queries because Cosmos DB automatically indexes everything upon ingestion and users do not have to deal with schema or index management.

Cosmos DB’s global distribution relies on two key abstractions – replica sets and partition sets. A replica-set is a modular Lego block for co-ordination, and a partition set is a dynamic overlay of one or more geographically distributed physical partitions.

Cosmos DB makes use of a partition set that is distributed across multiple regions and allows multi-region writes. A replication protocol replicates data among the physical partitions comprising a given partition-set. Each physical partition accepts writes and serves reads from clients typically co-located in the same region. Writes accepted by a physical partition within a region are durably committed and made highly available before they are acknowledged to the client. The set of writes is exchanged with other partitions within the partition-set and resolved based on the virtual time. Clients can request either tentative or committed writes by passing a request header. This exchange is dynamic and based on the topology of the partition-set, regional proximity of the physical partitions and the configured consistency levels. The commit scheme used by Cosmos DB is the primary commit. Knowledge of which writes have committed, and in which order they were committed then propagates to the other partitions. The primary behaves like any other partition in all other aspects. Different replicated data collections can have a different partition designated as primary. The primary is dynamic and integral part of the reconfiguration of the partition-set based on the topology of the overlay. The committed writes including multi-row or batched updates are guaranteed to be ordered by virtue of the vector clock.

No comments:

Post a Comment