Monday, November 25, 2013

In this post, I want to take a break to talk about sorting data from very large files. For data sets in memory, we can use one of several sorting algorithms. For data that is bigger than what can reside in memory, we will look at techniques for sorting externally and implementing dictionaries or B-Trees.
In external sorting, we implement what is known as replacement selection to establish initial runs followed by poly-phase merge sort to merge the runs into one sorted file. A run is a sequence of number that is already sorted.
To create the initial runs, a buffer is allocated in memory to act as a holding pane for several records. Initially the buffer is filled, then the following steps are repeated until the input is exhausted.
Select the record with the smallest key that is >= the last record written
If all the keys are smaller than the key of the last record written, then we have reached the end of the run.
Select the record with the smallest key for the first record of the next run.
Write the selected record
Replace the selected record with a new record from the input.
If we take a two record buffer, this is easy to illustrate. We load the buffer and write the smallest key to the destination. We replace this record with the next one and repeat the process. This process continues until all the keys written are smaller than the last key. At this point, we terminate the run and start another.
This strategy simply utilizes an intermediate buffer to hold values until the appropriate time for output.
Initially all the data is in one place. The source is read and the runs are distributed to other places. After the initial runs are created, they are merged. We use Fibonacci number of runs with each input so that the number of passes can be minimized.  Fibonacci numbers have the property that each number is the sum of two preceding numbers. When we take the number of runs in each input and merge in the third, we are left with run counts of smaller and smaller number in this sequence. Each merge takes the numbers down by a notch. As mentioned, this reduced the number of passes on the data and is very helpful when we are considering sequential access. We want optimized access because we want to be efficient on the large dataset.
In poly-phase merge, we work with frames of data at a time and we take runs of data that are sequential from both source and merge them into a longer run on a third storage in each phase.  Then we relabel the storage so that the merged data acts as a source and merge this with the remainder of one of the sources. Finally we write it back to the destination.
The runs help us to merge perfectly in the sense that there are no extra runs on any source. That is why the step to creating the initial runs helps.
courtesy : Niemann
 

No comments:

Post a Comment