Cluster computing

Wednesday, September 24, 2014

NoSQL Databases and scalability.

Traditional databases such as SQL Server have a rich history in scalability and performance considerations for user data storage such as horizontal and vertical table and index partitioning, data compression, resource governor, IO resource governance, partition table parallelism, multiple file stream containers, NUMA aware large page memory and buffer array allocation, buffer pool extension, In-memory OLTP, delayed durability etc. In NoSQL databases, the unit of storage is the key-value collection and each row can have different number of columns from a column family. The option for performance and scalability has been to use sharding and partitions. Key Value pairs are organized according to the key. Keys in turn are assigned to a partition. Once a key is assigned to a partition, it cannot be moved to a different partition. Since it is not configurable, the number of partitions in a store is decided upfront. The number of storage nodes in use by the store can however be changed. When this happens, the store undergoes reconfiguration, the partitions are balanced between new and old shards, redistribution of partition between one shard and another takes place. The more the number of partitions the more granularities there are for the reconfiguration. It is typical to have ten to twenty partitions per shard. Since the number of partitions cannot be changed, this is often a design consideration.

The number of nodes belonging to a shard called its replication factor improves the throughput. The higher the replication factor, the faster the read because of the availability but the same is not true for writes since there is more copying involved. Once the replication factor is set, the database takes care of creating the appropriate number of replication nodes for each shard.

A topology is a collection of storage nodes, replication nodes, and the associated services. At any point of time, a deployed store has one topology. The initial topology is such that it minimizes the possibility of a single point of failure for any given shard. If the storage node hosts more than one replication node, those replication nodes will not be from the same shard. If the host machine goes down, the shard can continue for reads and writes.

Performance tuning on the NoSQL databases is an iterative process. Topologies, replication factor, shards and partitions can be changed to improve performance or to adjust for the storage nodes.

Courtesy:

Microsoft and Oracle documentations.

Cluster computing

Wednesday, September 24, 2014

No comments:

Post a Comment