Parallel and Concurrent Programming in NoSQL databases.
In NoSQL databases, the preferred method of parallelizing
tasks is the MapReduce algorithm. It is a simple scatter gather programming
model that enables all mappers to behave the same on their respective localized
subgroups of data and a reducer that can aggregate the data from all the
mappers. This applies well to algorithms
that calculate sums over the entire data range.
However, we look at dataflow parallelism that is not as
strict or simplistic and enables even more algorithms to be applied to Big
Data. In Dataflow processing , there is a foundation for message passing and
parallelizing CPU intensive and I/O intensive applications that have high
throughput and low latency. It also gives you explicit control over how data is
buffered and moved around the system While traditionally, we required callbacks
and synchronization objects, such as locks, to coordinate tasks and access to
shared data, by using this programming model, we can create data flow objects
that process the data as they become available. You only declare how the data is handled when
it becomes available and also any dependencies between the data. Requirements
to synchronize the shared data and other maintenance activities are common
across algorithms and can be facilitated with a library. Moreover, this library
can schedule asynchronous arrival of data, so tthroughput and latency can be
improved. This kind of stream processing enables algorithms to be applied that
are difficult to be expressed in Summation Form.
This framework consists of dataflow blocks which are data
structures that buffer and process data. There are three kinds of dataflow
blocks. Source blocks, target blocks and propagator blocks. A source block acts
as a source of data and can be read
from. A target block acts as both a source block and a target block, and can be
read from and written to. Interfaces can be defined separate for sources,
targets and propagators. Propagator interface should inherit from both source
and target. By implementing the interfaces, we define dataflow blocks that form
a pipeline which are linear sequences of dataflow blocks or networks which are
graphs of dataflow blocks. A pipeline is a simple case of a network and one
where a source asynchronously propagates data to targets. As more and more data blocks are added and
removed, the blocks handle all thread safety operations of linking and
unlinking. Just like protocol handlers, blocks can choose to accept or reject
message as in a filtering mechanism. This enables explicit data blocks to be
earmarked for processing
This programming model
uses the concept of message passing where independent components of a program communicate with one
another by sending messages. One way to propagate messages is to call say a
post method which acts asynchronously. Source blocks offer data to target
blocks by calling the OfferMessage method. The target block responds to an
offered message returning either an Accepted, Declined or Deferred status. When
the target block requires the message, it calls ConsumeMessage or calls ReleaseReservation
when it defers a message.
Dataflow blocks also support the concept of completion. A dataflow block that is in the completed
state does not perform any further work. Each dataflow block has an associated
task and it represents the completion
status of the block. Because we can wait for a task to finish, we can wait for
the terminal nodes of a dataflow network to finish. Tasks can throw an
exception or return an error on failures thus denoting success otherwise. As
with most tasks, dataflow blocks can be paused, resumed or canceled thus
enabling robustness and reliability.
Some dataflow blocks can be predefined – for example we can
have buffering blocks, execution blocks and grouping blocks which have now
relaxed the count of mappers and reducers in the conventional way of using
MapReduce algorithms. The degree of parallelism can also be explicitly
specified for these dataflow blocks or it can be handled automatically.
Let us take an example of the summation form to compute the mean
For example if we take 3,1,5, 7, 9 the mean is 25/5 = 5
If we were to add 10,15,17,19,24, the mean would now become 85 + 25 / 10 = 11
We could arrive at this by adding 85 /5 = 17
and taking the sum (5 + 17) / 2 for two equal ranges otherwise we take the weights based on their element count = 11
In this case we are reusing the previous computation of the data and we don't have to revisit that data again.
Let us take an example of the summation form to compute the mean
For example if we take 3,1,5, 7, 9 the mean is 25/5 = 5
If we were to add 10,15,17,19,24, the mean would now become 85 + 25 / 10 = 11
We could arrive at this by adding 85 /5 = 17
and taking the sum (5 + 17) / 2 for two equal ranges otherwise we take the weights based on their element count = 11
In this case we are reusing the previous computation of the data and we don't have to revisit that data again.
No comments:
Post a Comment