Cluster computing

Sunday, November 15, 2015

Parallel and Concurrent Programming in NoSQL databases.

In NoSQL databases, the preferred method of parallelizing tasks is the MapReduce algorithm. It is a simple scatter gather programming model that enables all mappers to behave the same on their respective localized subgroups of data and a reducer that can aggregate the data from all the mappers. This applies well to algorithms that calculate sums over the entire data range.

However, we look at dataflow parallelism that is not as strict or simplistic and enables even more algorithms to be applied to Big Data. In Dataflow processing , there is a foundation for message passing and parallelizing CPU intensive and I/O intensive applications that have high throughput and low latency. It also gives you explicit control over how data is buffered and moved around the system While traditionally, we required callbacks and synchronization objects, such as locks, to coordinate tasks and access to shared data, by using this programming model, we can create data flow objects that process the data as they become available. You only declare how the data is handled when it becomes available and also any dependencies between the data. Requirements to synchronize the shared data and other maintenance activities are common across algorithms and can be facilitated with a library. Moreover, this library can schedule asynchronous arrival of data, so tthroughput and latency can be improved. This kind of stream processing enables algorithms to be applied that are difficult to be expressed in Summation Form.

This framework consists of dataflow blocks which are data structures that buffer and process data. There are three kinds of dataflow blocks. Source blocks, target blocks and propagator blocks. A source block acts as a source of data and can be read from. A target block acts as both a source block and a target block, and can be read from and written to. Interfaces can be defined separate for sources, targets and propagators. Propagator interface should inherit from both source and target. By implementing the interfaces, we define dataflow blocks that form a pipeline which are linear sequences of dataflow blocks or networks which are graphs of dataflow blocks. A pipeline is a simple case of a network and one where a source asynchronously propagates data to targets. As more and more data blocks are added and removed, the blocks handle all thread safety operations of linking and unlinking. Just like protocol handlers, blocks can choose to accept or reject message as in a filtering mechanism. This enables explicit data blocks to be earmarked for processing

This programming model uses the concept of message passing where independent components of a program communicate with one another by sending messages. One way to propagate messages is to call say a post method which acts asynchronously. Source blocks offer data to target blocks by calling the OfferMessage method. The target block responds to an offered message returning either an Accepted, Declined or Deferred status. When the target block requires the message, it calls ConsumeMessage or calls ReleaseReservation when it defers a message.

Dataflow blocks also support the concept of completion. A dataflow block that is in the completed state does not perform any further work. Each dataflow block has an associated task and it represents the completion status of the block. Because we can wait for a task to finish, we can wait for the terminal nodes of a dataflow network to finish. Tasks can throw an exception or return an error on failures thus denoting success otherwise. As with most tasks, dataflow blocks can be paused, resumed or canceled thus enabling robustness and reliability.

Some dataflow blocks can be predefined – for example we can have buffering blocks, execution blocks and grouping blocks which have now relaxed the count of mappers and reducers in the conventional way of using MapReduce algorithms. The degree of parallelism can also be explicitly specified for these dataflow blocks or it can be handled automatically.

Let us take an example of the summation form to compute the mean
For example if we take 3,1,5, 7, 9 the mean is 25/5 = 5
If we were to add 10,15,17,19,24, the mean would now become 85 + 25 / 10 = 11
We could arrive at this by adding 85 /5 = 17
and taking the sum (5 + 17) / 2 for two equal ranges otherwise we take the weights based on their element count = 11
In this case we are reusing the previous computation of the data and we don't have to revisit that data again.

Cluster computing

Sunday, November 15, 2015

No comments:

Post a Comment