Today we continue to review the NoSQL design. These involve Key Value store, run on a large number of commodity machines with data partitioned and replicated between machines. We discussed the infrastructure which consists of physical and virtual nodes Data is partitioned based on consistent hashing. Cachepoints are arranged along the circle depicting the key range and cache objects corresponding to the range. With multiple copies of data, we need consistency models. The choice of consistency model determines how the client request is dispatched to a replica and how the replicas propagate and apply the updates. The topology that helps with this can be classified as a single master model, a multi-master or no-master model or a Quorum based 2-PC model that provides strict consistency.
The syncing of states is achieved with techniques such as the traditional 2PC. To improve availability, quorum based 2PC was introduced. In this model, the coordinator asks all the replicas to write the data but waits only for the responses from a select number of them. When reading the data, a minimum set of replicas are read and the one returned with the latest timestamp. This technique provided strict consistency.
If we don't require strict consistency, we can go with a Gossip protocol to propagate the updates via message exchanges. A vector clock is used to help with the messaging. Both the client and replicas maintain vector clocks Each local vector clock merges the timestamp by picking out the maximum value. The client does not share its vector clock, only the replicas do
We now discuss the gossip protocol. This comes in useful to propagate the changes.
In state transfer model, each replica maintains a state version tree which contains all the conflicting updates. When the client sends its vector clock, the replicas will check whether the client state precedes any of its current versions and discard it accordingly. When it receives updates from other replicas via gossip, it will merge the version trees.
In operation transfer model, each replica replica has to first apply all operations corresponding to the cause before those corresponding to the effect. This is necessary to keep the operations in the same sequence on all replicas and is achieved by adding another entry in the vector clock, a V-state, that represents the time of the last updated state. In order that this causal order is maintained, each replica will buffer the update operation until it can be applied to the local state A tuple of two timestamps - one from the client's view and another from the replica's local view is associated with every submitted operation. This now comes in useful to the replica to collect all the pending operations in its queue. With its vector clock larger than that of the client, it can proceed to executing the operations. The same is true for background operations when the clients receive updates with each other's vector clock. In this case, they exchange their logs and with this, each replica will check whether certain operation can be applied because their dependencies have been applied Since the operations are sorted by vector clock in causal order, simultaneous operations can also be serialized. The replica performing the first update gets to pick a sequence number higher than all others and everyone follows this sequence number in their vector clocks. Since operations are in different stages of processing on different replicas, a replica will not discard the operations it has completed until it sees the vector clocks from all others to have preceded it.
As a specific example, let us look at the map reduce operation over the input key list. The operations for map and reduce are sent to all the nodes whose replicas own the key ranges. When the map operations complete, the reduce operations are executed to aggregate the results. This is different from scatter and gather because the latter requires a single point of consolidation and map-reduce doesn't. Moreover, that single point is usually the point of request whereas in map-reduce, it may vary from the node servicing the originating request. Scatter-gather is a pattern for multiprogramming or parallel programming where the data is sent with the split requests. In Map-Reduce only the computation is sent to the data resident to the nodes and they use it with their local key-ranges. All operations including map-reduce occur with causal order.
Deletes are somewhat different because when an object is deleted, its timestamp may also be lost. Therefore to handle this case, the associated metadata and timestamp information of the deleted objects is kept around long enough until confirmed by all other replicas.
Storage for the map-reduce may also be in the form of distributed files or even databases. The advantage of using databases in such cases is that they not only provide copy-on-write but also multi-version concurrency control. A bloom filter or an index can be used to determine if a key exists in a set. All operations are logged. All data and logs are flushed to disk to handle recovery for that node.
It may be interesting to note that the distributed system concepts such as Lamport timestamps, Byzantine generals, are on a level different from the data store and data related operations at a local node. For example distributed query can work with other instances of databases over a network and with OLE DB data sources. In other words, this can be internalized as a database specific feature. The charm of NoSQL is the ability to replicate, scale out and add or remove more commodity servers without impacting ongoing operations. There are more than one ways to access data in a distributed manner. The block chain from Bitcoin is also a distributed database. Also, the scale out capability of NoSQL together with the requirements for transactions and strict consistency, were the motivating factors behind NewSQL data architecture.
Courtesy horicky blog and Ricky Ho
#codingexercise http://ideone.com/2pBEvk
The syncing of states is achieved with techniques such as the traditional 2PC. To improve availability, quorum based 2PC was introduced. In this model, the coordinator asks all the replicas to write the data but waits only for the responses from a select number of them. When reading the data, a minimum set of replicas are read and the one returned with the latest timestamp. This technique provided strict consistency.
If we don't require strict consistency, we can go with a Gossip protocol to propagate the updates via message exchanges. A vector clock is used to help with the messaging. Both the client and replicas maintain vector clocks Each local vector clock merges the timestamp by picking out the maximum value. The client does not share its vector clock, only the replicas do
We now discuss the gossip protocol. This comes in useful to propagate the changes.
In state transfer model, each replica maintains a state version tree which contains all the conflicting updates. When the client sends its vector clock, the replicas will check whether the client state precedes any of its current versions and discard it accordingly. When it receives updates from other replicas via gossip, it will merge the version trees.
In operation transfer model, each replica replica has to first apply all operations corresponding to the cause before those corresponding to the effect. This is necessary to keep the operations in the same sequence on all replicas and is achieved by adding another entry in the vector clock, a V-state, that represents the time of the last updated state. In order that this causal order is maintained, each replica will buffer the update operation until it can be applied to the local state A tuple of two timestamps - one from the client's view and another from the replica's local view is associated with every submitted operation. This now comes in useful to the replica to collect all the pending operations in its queue. With its vector clock larger than that of the client, it can proceed to executing the operations. The same is true for background operations when the clients receive updates with each other's vector clock. In this case, they exchange their logs and with this, each replica will check whether certain operation can be applied because their dependencies have been applied Since the operations are sorted by vector clock in causal order, simultaneous operations can also be serialized. The replica performing the first update gets to pick a sequence number higher than all others and everyone follows this sequence number in their vector clocks. Since operations are in different stages of processing on different replicas, a replica will not discard the operations it has completed until it sees the vector clocks from all others to have preceded it.
As a specific example, let us look at the map reduce operation over the input key list. The operations for map and reduce are sent to all the nodes whose replicas own the key ranges. When the map operations complete, the reduce operations are executed to aggregate the results. This is different from scatter and gather because the latter requires a single point of consolidation and map-reduce doesn't. Moreover, that single point is usually the point of request whereas in map-reduce, it may vary from the node servicing the originating request. Scatter-gather is a pattern for multiprogramming or parallel programming where the data is sent with the split requests. In Map-Reduce only the computation is sent to the data resident to the nodes and they use it with their local key-ranges. All operations including map-reduce occur with causal order.
Deletes are somewhat different because when an object is deleted, its timestamp may also be lost. Therefore to handle this case, the associated metadata and timestamp information of the deleted objects is kept around long enough until confirmed by all other replicas.
Storage for the map-reduce may also be in the form of distributed files or even databases. The advantage of using databases in such cases is that they not only provide copy-on-write but also multi-version concurrency control. A bloom filter or an index can be used to determine if a key exists in a set. All operations are logged. All data and logs are flushed to disk to handle recovery for that node.
It may be interesting to note that the distributed system concepts such as Lamport timestamps, Byzantine generals, are on a level different from the data store and data related operations at a local node. For example distributed query can work with other instances of databases over a network and with OLE DB data sources. In other words, this can be internalized as a database specific feature. The charm of NoSQL is the ability to replicate, scale out and add or remove more commodity servers without impacting ongoing operations. There are more than one ways to access data in a distributed manner. The block chain from Bitcoin is also a distributed database. Also, the scale out capability of NoSQL together with the requirements for transactions and strict consistency, were the motivating factors behind NewSQL data architecture.
Courtesy horicky blog and Ricky Ho
#codingexercise http://ideone.com/2pBEvk
No comments:
Post a Comment