The data in an Azure Cosmos DB container is
horizontally partitioned into many replica-sets, which replicate writes in each
region. Replica-sets durably commit writes using a majority quorum. Each region contains all data partitions of
an Azure Cosmos DB container. If the account is distributed across N regions,
there will be N x 4 copies of all the data. Within a datacenter, CosmosDB
comprises massive stamps of machines, each with dedicated local storage. There
are many clusters involved and machines within a cluster might spread across
10-20 fault domains for high availability within a region. A machine may contain
replicas each of which will comprise of Resource Governor, Transport, Admission
Control and Database Engine. The database engine comprises of Resource Manager,
Language Runtime Hosts, Query Processor, RSM, Index Manager, Storage Engine,
Log Manager, and I/O Manager.
Global distribution in Azure Cosmos DB is turnkey
and regions can be added or dropped from a database and a database consists of
one or more containers. A container serves as a logical unit of distribution
and scalability. They are schema-agnostic and provide a scope for a query. In a given region, data within a container is
distributed by using a partition key and is transparently managed by the
underlying physical partitions. This is the local distribution pattern. Each
physical partition is also replicated across geographical regions in a global
distribution manner.
A physical partition is implemented by a group of
replicas, called a replica-set. Each machine hosts hundreds of replicas and they
correspond to physical partitions that are dynamically placed, and load
balanced across machines within a cluster and across datacenters within a
region. Local or global distribution does not matter to queries because Cosmos
DB automatically indexes everything upon ingestion and users do not have to
deal with schema or index management.
Cosmos DB’s global distribution relies on two key
abstractions – replica sets and partition sets. A replica-set is a modular Lego
block for co-ordination, and a partition set is a dynamic overlay of one or
more geographically distributed physical partitions.
Cosmos DB makes use of a partition set that is
distributed across multiple regions and allows multi-region writes. A
replication protocol replicates data among the physical partitions comprising a
given partition-set. Each physical partition accepts writes and serves reads
from clients typically co-located in the same region. Writes accepted by a
physical partition within a region are durably committed and made highly
available before they are acknowledged to the client. The set of writes is
exchanged with other partitions within the partition-set and resolved based on
the virtual time. Clients can request either tentative or committed writes by
passing a request header. This exchange is dynamic and based on the topology of
the partition-set, regional proximity of the physical partitions and the
configured consistency levels. The commit scheme used by Cosmos DB is the
primary commit. Knowledge of which writes have committed, and in which order
they were committed then propagates to the other partitions. The primary
behaves like any other partition in all other aspects. Different replicated
data collections can have a different partition designated as primary. The
primary is dynamic and integral part of the reconfiguration of the partition-set
based on the topology of the overlay. The committed writes including multi-row
or batched updates are guaranteed to be ordered by virtue of the vector clock.
No comments:
Post a Comment