Cluster computing

Thursday, June 7, 2018

We were discussing file system improvements and cloud operating system.we mentioned Xen and OSv. Zen is a hypervisor technology and OSv tries to reduce the software layers for a single application on a Virtual Machine
This means a general purpose operating system is no longer necessary. The runtime that the application requires is difficult to be ported to different hardware environment but manageable to be ported to hypervisors. This is what the cloud operating systems propose.
#codingexercise
Get count of square submatrix that are all zero and size k
int GetCountSquareSubMatrixAll0Sizek (int[,] A, int rows, int cols, int k)
{
int Total = 0;
for ( int I = 0; I < rows; i++) {
    for ( int j = 0; j < cols; j++) {
            // use this as the start of submatrix
            int count = 0;
            for ( int x = i; x < k; x++)
                for ( int y = j; y < k; y++)
                       If ( x < rows && y < cols && A [x,y] == 0)
              count++;
            If (count == k * k) total += 1;
}
}
return total;
}

}

Wednesday, June 6, 2018

We will start reviewing file system trends in Linux. This is relevant because there is newer demand from cloud computing that the traditional file system did not anticipate. While we will be reviewing a paper from wisc.edu titled "a study of linux file system evolution" which features eight years of linux file system changes across thousands of patches, we find relevancy in emerging trends in cloud computing. For example, on one hand, organizations are building low footprint linux operating system and container framework on top and on the other hand, cloud operating systems such as OSv is emerging where the OS kernel involving threads, processes and drivers are removed from the Virtual Machine. Finally, users are being asked to forget about management chores and offload it to managed services from the cloud for their files. Replication and backup are now part of the offerings from these cloud managed services. Moreover, file systems are also being stretched over clusters as in the case of OneFS and at the same time traditional file system protocols are improving usages
One of the common themes seen in these emerging trends is the applicability of a general purpose platform versus a specialized product or service. Cloud operating systems are designed to run a single application within a single virtual machine. This means a general purpose operating system is no longer necessary. The runtime that the application requires is difficult to be ported to different hardware environment but manageable to be ported to hypervisors. This is what the cloud operating systems propose. A single kernel and application image is then termed a "unikernel" Windows "picoprocess" framework via DrawBridge and Xen via OSv are some examples.
Cloud Operating Systems do not necessarily compete with Linux. Xen for example is still Linux. Linux has a wide hardware support and continues to have dominance in cloud computing. All that the Cloud Operating Systems propose is to remove layers of the software stack improving performance and security.Xen is popular for its Paravirtualization - a term used to refer to the simple and idealized interface for I/O to the guest. While Xen provides hypervisor technology, it is also trying to come out with its own version of cloud operating system called MirageOS which it refers to as exokernel.

Tuesday, June 5, 2018

Let us code the solution to the problem we discussed yesterday for splitting a positive integer array into two as nearly equal sum as possible:

int TryGetPartition (List <int> A)

{

    Assert ( A != null );

    Assert ( A.All (x => x > 0));

    int index = 0;

    If  ( A.count <= 1) return index;

    int left = A [0];

    int right = A [A.Count - 1];

    int I = 1;

    int j = n-2;

    int retry = 0;

    while ( i  <  j)

    {

if ( left + A [i] <= right + A [j]){

            right += A [j];

            J  = j - 1;

            continue;

     }

    if ( left + A [i] <= right) { left += A [i]; I = I + 1; continue;}

    if ( left > right + A [j]) { right += A [j]; j = j -1; continue;}

    if ( left > right) { right += A [j]; j = j -1; continue;}

            retry = retry + 1;

    if ( retry > 3) { break; }

     }

    if ( i == j) {

If (left+A[I] <= right)

index = I+1;

Else

Index=I;

}

    if  ( i > j ) {

If (left <= right)

index = I-1;

Else

index = I;

}

    return index;

}

Monday, June 4, 2018

We were discussing an interesting coding exercise where we wanted to split a positive integer array such that the left and right are as nearly balanced in their sum as possible.
we discussed a two pass solution where we find the sums from the left and the right ends and then compare these at each candidate position. The sums from the left and the right help us evaluate if the split at the specified index will result in equal sums. There may be many such splits possible but we pick the index that match the criteria in our iteration.
Then we refined each candidate position by adjusting the left and the right if we know the entire sum
We provided a code sample for this yesterday. What we didn't discuss was whether there is a one-pass solution possible where the iteration from left to right gives the desired partition.
One way to try to do it in one-pass would be keep track of average from either end and grow the left or the right towards the solution based on the difference in the averages. This way we are attempting to make incremental progress from start to finish where each incremental state is guaranteed to make progress over the previous state. It is a progress because we are reducing the array correctly by eliminating the need for a left or a right Whether we take the average or the sum, the question is should we or should we not pass a candidate position without including all the elements in either the left sum or the right sum. Subsequent elements we encounter may actually make this index position a viable split.
Let us take an example:
5 2 1 3
if we take the left initialized to 5 and the right initialized to 3 then we grow the right because it is imbalanced and evaluate based on the succeeding element. In this case we , pick 2 and 1 and choose 1, then add 1 to the right and we now have left as 5 and right as 4. Similarly, we look into the succeeding element as 2 and add it to the right. This way we have achieved the desired split at index = 1
The greedy strategy may not work here because we might skip a candidate index based on the succeeding elements however the undiscovered subarray might might side entirely with the left irrespective of the adjacent element. Hence it is not proper to skip the index based on the adjacent element only. The qualification or disqualification of a method varies based on whether the array can have positive or negative integers

Sunday, June 3, 2018

we have another interesting array to discuss
we want to split a positive integer array such that the left and right are as nearly balanced as possible.
we discussed a two pass solution where we find the sums from the left and the right ends and then compare these at each candidate position
we also discussed evaluating each candidate position by adjusting the left and the right if we know the entire sum

int GetPartition(List<int> A)

{

int n = A.Count;

int sum = A.Sum(); // O(N)

Int min = sum;

int index = 0;

int left = 0;

int right = sum;

for (int i = 0; i < n - 1; i++) {

left = left + A[i];

right = right – A[i];

if (Math.Abs(left –right) < min) {

min = Math.Abs(left - right) ;

if (right < left) {

index = i + 1;

} else {

index = i;

}

return index;

}

This is still not one-pass because the initialization of the sum variable is a linear operation but this is definitely simpler than the earlier prefix and suffix information The key takeaway here is that we need sum from both sides to determine the partition.
When we have the sum already for the elements encountered so far, we can look back to determine the partition. Each additional element encountered as the array grows requires a check for the partition in the already known array.
This makes one-pass difficult. For one-pass we may need additional stats such as the minimum average encountered so far

for 5 3 1 6

the averages would be 5, 4, 3 4 and we could take the point of split as zero based index 2. This assumes there is only one local minima. In some cases we may have more than one minima. In such cases we may consider the last occurrence of the most common average or some other heuristic which jives with the goal that the badness of the array as the square of the difference between the array element and the acceptable average is reduced. The easiest way to test this is to have all permutations of three positive distinct integers

Saturday, June 2, 2018

Recently I came across an interesting calculation problem and I want to share it in this blog post.
Let us say we are given a series of byte range sequences in the form of <offset, length> pair which have been used up. We now want to calculate the used bytes size.
How do we do this ?
There may be overlap in the byte range sequences.
One approach is that we calculate positive summations and negative summations.
For example, if we have
offset 5, length 10
offset 10, length 10
then the used space size is 10 + 10 - overlap = 20 - 5 = 15
The idea of using summations is that we can leverage the summation model as a neat aggregation that works in both batch and stream processing.
All we need to do is make an entry for (offset 10, length -5)
As long as these entries are made in the system, the query processing becomes much simpler.
Let us now consider another approach, where we try to find the used space size without making negative references.
We calculate it on the fly. For example, we take overlapping intervals and translate them to non-overlapping contiguous intervals.
One example of this is to adjust the above as follows:
offset 5, length 5
offset 10, length 10
In this case we still keep the summation as a useful aggregation in the query but now we have to edit the original entries in the table.
Let us take a third approach and shift the emphasis on querying without modifying the entries.
Here we sort the byte ranges by offsets and as we encounter subsequent entries with overlaps, we massage them into non-overlapping ranges to cumulate the used space
For example,
LIst<Interval> merge(List<Interval> slices)
{
Var s = new Stack<Interval> ();
Slices.sort(); // by starting time
If (slices.count == 0) return new List<Interval>{Interval(0,0)};
If (slices.count == 1) return new List<Interval>{slices[0]};
S.push(slices[0]);
For(int i =1; i < slices.count; i++)
{
var top = slices.top();
If (top.end()< slices[i].start)
S.push(slices[i]);
Else{
Top.end = slices[i].end;
S.push(top);
}
Return s.toList():
}
offset 5, length 10
offset 10, length 10
implies offset 5, length 15

offset 5, length 10
offset 20, length 5
implies same

Friday, June 1, 2018

We mentioned BigTable and BigQuery in earlier posts where we discussed the purpose of each as forms of storage and processing. It may be interesting to note that for a developer a table and query represent common notions of processing. It is the equivalent of data structure and algorithms or storage and compute at any level of software stack. The only refinement is that the the table represents transactional operations for entries which are separated from the read only analytical processing some of which may involve heavy aggregations. These are performed independent of each other and consequently they do not interfere with each other. One of the advantages of this separation is that the analytical costs are taken by the users of the query. But that is not all. The query parameters can be requested all the way from the end users of the system even via webAPIs. This mean that the logic can be used over and over again for different callers. New queries can be written when earlier don't suffice. It becomes cheap to write queries to the desired results without affecting any of the existing data gathering operations. The ability to expose query parameters and logic to the end users results in new use cases that the system does not need to know about or have to implement while providing as detailed information to those users as possible.
The use of cloud databases and their abilities significantly improves the appeal for the storage of data because there are no limits while providing the guarantees of a database. In addition, the developer notion of storage gets as simplified as possible with all the associated routines maintained as much as possible.
Many systems cannot afford the luxury of the cloud databases because there are too low-level in the system architecture or they cannot afford the tools and processes surrounding data handling for a database. In such cases, we can consider the notion of a table as a data source and use alternate forms of storage. These data sources are essentially a collection and we need only as much guarantees around the collection as the producer and the consumer of the data want. There is no limitation to the storage as long as we can separate the processing of the data in read only operations.
Finally, there is a universal appeal to keeping a collection of entries in that it becomes easy to implement logic in the form of standard query operators with its methods such as .Where() and .Sum() for predicates and aggregations. The acceptance of these operators across stacks and languages indicate that there are common expressions of querying logic which enables us to automatically use the most flexible form for future growth.
#codingexercise
We discussed yesterday a technique to partition the array into the most nearly balanced sub-arrays as possible by making two passes on the array
Another way to do this is to evaluate each candidate position of the split to see if we can get the best result.