Cluster computing

Wednesday, March 16, 2016

Tuesday, March 15, 2016

We continue reading the paper "Erasure coding in Windows Azure storage" by Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin.
Erasure coding is a data protection method where data is broken into fragments, expanded and encoded with redundant data pieces and stored across a different set of locations or media. This paper talks about Erasure coding in windows Azure storage WAS. We were discussing the design decisions such as why Reed-Solomon code was not sufficient and why the paper introduced a new one. We also saw how the Local Reconstruction code introduces group fragment parity and improves reconstruction cost. This is with some storage overhead but provides more options with a new trade-off point. We discussed the coding equations and LRC's maximally recoverable property which is a way of saying LRC can decode any failure pattern which is possible to reconstruct.
The challenge was to find a set of coding equations that achieves the maximally recoverable property
These coding equations were found this way:
LRC can tolerate arbitrary 3 failures by choosing the following two sets of coding coefficients a and b for groups x and y. For each group, each coefficient is linearly applicable for each of the three failures, or applicable as the squares or none at all. The same goes for the other coefficient for the other group. This gives us six equations.
Then the LRC's coding equation are as follows: the sum of the linear coefficients of each group as p0, the sum of the square coefficients of each group as p1, the no coefficients case for one group as px, and the no coefficients case for the other group as py.
These coefficients can then be found by applying the following scenarios:
None of the four parities fails
Only one of the px and py fails
Both px and py fail
To ensure all the cases are decodable, all the matrices should be non-singular which leads to the following conditions.
None of the coefficients are zero.
None of the coefficients of group 1 are same as that of group 2
the sum of the coefficients of group 1 should not be the same as the sum of the coefficients of group 2.
#codejam problem
Determine sums of list of integers
public static List<int> Sums( this List<List<int>> numbers)
{
var ret = new List<int>();
numbers.ForEach (x => ret.Add(x.Sum()));
return ret;
}
Could we consider implementing extension methods in C++ ?
In C++ we have the ability to wrap classes and structs if we cannot inherit from them. Therefore we can add methods easily as function pointers or methods.

Monday, March 14, 2016

We continue reading the paper "Erasure coding in Windows Azure storage" by Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin.
Erasure coding is a data protection method where data is broken into fragments, expanded and encoded with redundant data pieces and stored across a different set of locations or media. This paper talks about Erasure coding in windows Azure storage WAS. We were discussing the design decisions such as why Reed-Solomon code was not sufficient and why the paper introduced a new one. We also saw how the Local Reconstruction code introduces group fragment parity and improves reconstruction cost. This is with some storage overhead but provides more options with a new trade-off point.
We now look at Fault Tolerance. We saw how data fragments are used to compute each parity in LRC. We now see how the parities are computed from the data fragments. These are called coding equations.They have maximally recoverable property. By this property, LRC can decode any failure pattern which is theoretically decodeable. For example a 6,2,2 LRC contains four parity fragments and can tolerate upto four failures but these failures cannot be arbitrary. If the failures are on x1, x2, x3, and px, then this failure pattern is non-decodeable. There are only two parities which are global and they can only help to decode three missing data fragments and py is not available. It is impossible to decode three data fragments from merely two parity fragments. Such failure patterns are called information-theoretically non-decodeable

#codejam problem
We were discussing sums generated from number sequences.
This has a dynamic programming approach as well
The range of sums that can be generated
In general, if the range of sums that can be generated using the first k numbers in the result is from 0 to r, and the next number has value x, then either:
• x <= r+1 and it is possible to generate any sum from 0 to r+x using the first k+1 numbers, or
• x > r + 1 and it is not possible to generate the sum r + 1, since x and all the following numbers have greater values.

MaxSumImpossible(Numbers, I, j)
If len(Numbers) == 1
Return Numbers[0]
m[I,j] = 0
For k = 0 to j-I-1
if (jth coin <= MaxSumImpossible(Numbers Choose K, I,j-1)
q = MaxSumImpossible(Numbers Choose K, I,j-1) + MaxSumImpossible(Numbers with jth Number only, j,j)
If q > m[I,j]
M[I,j] = q

return m[I,j]

Sunday, March 13, 2016

We continue reading the paper "Erasure coding in Windows Azure storage" by Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin.
Erasure coding is a data protection method where data is broken into fragments, expanded and encoded with redundant data pieces and stored across a different set of locations or media. This paper talks about Erasure coding in windows Azure storage WAS. We were discussing the design decisions such as why Reed-Solomon code was not sufficient and why the paper introduced a new one. Today we continue to read Local Reconstruction Codes.
The Reed-Solomon code explains the concept of a reconstruction cost. In this method, each parity is computed from all of the data fragments.Therefore a (k,n) code will require all k fragments for parity. When any fragment goes down, the number of fragments required to reconstruct is k. This is the reconstruction cost.
In LRC, this is reduced by reducing the number of fragments used to compute a parity. A subset of data fragments is used to compute some of the parities. In a (6,3) code, LRC generates 4 instead of 3 parities where the first two parities are global parities and are computed from all the data fragments. The other two parities the LRC divides the data fragments into two equal size groups and computes one local parity for each group. Instead of reading a global parity and all of the remaining data fragments, it is now more efficient to read the parity from the group and any two data fragments in that group to reconstruct the missing parity. The reconstruction of any single data fragment requires only three fragments, half the number required by the Reed Solomon Code. This is true for any single data fragment. The addition of a data parity block could be called a storage overhead but this is an altogether different tradeoff point from earlier.
To generalize the subgroups and the extra parity needed, a (k,l,r) LRC divides k data fragments into l groups, with k/l data fragments in each group. It computes one local parity against each group. In addition, it computes r global parities from all the data fragments. The reconstruction cost reduces from n = k to n = k/l with a storage overhead of (k+l+r)/k
#codingexercise
we discussed the codejam problem yesterday
we mentioned a solution as
List<int>GetOriginalSet(List<int> a)
{
var ret = new List<int>();
for(int i = 0; a.Count > 0 && i<a[a.Count-1]; i++)
{
if (ret.Combinations().Sums().Contains(i) == false && a.Contains(i)){
ret.Add(i);
}
}
return ret;
}

Saturday, March 12, 2016

Today we continue reading the paper "Erasure coding in Windows Azure storage" by Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin.
Erasure coding is a data protection method where data is broken into fragments, expanded and encoded with redundant data pieces and stored across a different set of locations or media. This paper talks about Erasure coding in windows Azure storage WAS. We saw that there are huge cost savings with this method because it saves three copies and uses less storage overhead. The tradeoff for using erasure coding instead of keeping three full copies is performance. The performance hit comes when dealing with a lost or offline data fragment and with hot storage nodes. In Erasure encoding, data is broken into k fragments and a set of parity fragments. In WAS, a data fragment may be lost due to a disk, node or rack failure. In addition, cloudservices may frequently be upgraded. This may cause data fragments to become offline for seconds to a few minutes. An online read would then need to be served by dynamically reconstructing the data being asked for to return the data to the client. This reconstruction needs to be optimized to be as fast as possible and use as little networking bandwidth and I/Os as possible and to meet the SLAs required. If the read is for a fragment that is on a specific storage node, it can become hot that leads to latency. One way to reduce this would be to age the fragments appropriately and tier the data between memory and external storage.Another way would be to load balance. But until the fragment becomes hot, read performance can suffer. Instead if the reconstruction could be done in parallel while the fragment was being fetched from storage node, then the one that returns earlier can be returned
The problem is that the reconstruction is only as fast as the slowest storage node to respond to reading. We also have to keep storage costs low as well. We could do this with Reed-Solomon codes that keeps 12 data fragments and 4 code fragments. However reconstruction would require all of those 12 data fragments and this will incur higher network costs and I/Os as well as the chances of hitting a hot node.
Therefore a new family of codes was required. This had the following design requirements: First, reduce the minimal number of fragments that need to be read from to reconstruct a data fragment.
Second provide significant reduction in storage overhead while maintaining higher durability than a system that keeps 3 copies of the data.
The authors call this the Local Reconstruction Codes.
#code jam problem
Given a set of integers, a powerset is a combination of all such elements including the null and the set itself
To generate the powerset of the elements we can use the combine method where we combine different numbers of elements. We just have to include the null set and the set itself.
Now we are given a powerset with the sums of each individual set in sorted order. We are also given the frequencies of the sum. Can we determine the original elements ?
For example if we have a powerset with sums
0,1,2,3,4,5,6,7, and frequencies
1,1,1,1,1,1,1,1,
will result in a set of 1,2,4
Solution:
The result set is also non-decreasing like the original power set sum
It has a few zeroes in the beginning both from null set as well as non empty set and this can be found from the first sum
And for each sum given, we can expand the sums as an element for occurrences that we can determine from the frequency
Next with the help of a running lists of sums computed so far and initialized by 1 at the position given by the current element, we add the value of the sum to a new list at the current location as well as the location offset element a. Since we are adding elements incrementally by 1 and including it in the result only when the sums tally, we get a final list of numbers found by incrementing 1s.

Friday, March 11, 2016

Today we start reading the paper "Erasure coding in Windows Azure storage" by Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin.

Erasure coding is a data protection method where data is broken into fragments, expanded and encoded with redundant data pieces and stored across a different set of locations or media. This paper talks about Erasure coding in windows Azure storage WAS where customers can store unlimited data for any amounts of time. They only pay for what they use. Erasure codes help with the durability of the data and to keep the costs low. This paper introduces a new set of codes for Erasure coding called Local reconstruction codes (LRC). LRC reduces the number of Erasure coding needed to reconstruct the data fragments that are offline. It also reduces the storage overhead involved. LRC reduces the bandwidth and io required for repair reads over prior codes. This together with the low overhead is the main benefit of LRC LRC also demonstrates consistently low read latencies.
WAS uses LRC. WAS stores all of its data in an append only distributed file systemcalled the stream layer. Data is appended on to the end of active extents which are replicated three times by the underlying stream layer. These three copies remain till the extent gets filled. Once the extents are full, they are sealed and are never modified again. A background process lazily generates Erasure codes for the sealed extents and deletes the three copies which are no longer needed.
Erasure codes can reduce the cost of storage by 50% and given the size of cloud storage is in the order of exabyte, this is a significant saving.
#code jam question
Given two ominous find the minimum number of rearrangement needed to transform one omino into another.
Ignore rotations and translations of both oninoes. Oninoes are groups of unit squares joined together on one or more edges.

Solution since the two ominoes are properly oriented and centered for comparision.
We proceed row wise parity from bottom to top. If the number of unit squares in the bottom row are short for being equal, then more is borrowed from the row above or returned otherwise. When the numbers are equal, the sequence has to be the same.

Thursday, March 10, 2016

Today we continue reading the paper "Horton: Online Query Execution Engine For Large Distributed Graphs" by Sarwat, Elnikety, He and Kliot.
This paper talks about large graphs that don't fit on a single server. The paper introduces Horton a graph query execution engine. We were reviewing its architecture.
It has four components - the graph client library, the graph coordinator, graph partitions and the graph manager.We saw that the implementation was in C# using Codebook graphs. This is a social network application that represents software engineers, software components and their interactions in large software projects. Codebook manages information about source code with its revisions, documentation and the organizational structure of developers. It can answer queries such as "Who wrote this function?" "what libraries depend on this library", "Whom should I contact to fix this bug"
We review the query processing phases. For example,
"Who wrote the specification for the MethodSquare code ?" translates to query such as
C:\ Horton -query *(Code MethodSquare) MentionedBy WordDocument AuthoredBy Person*
and "Who is the manager of the person who closed or created work item bug #673 ?" translates to
C:\Horton -query "Person Manages Person (Closed | Created) (WorkItemRevision #673)"
The finite state machine for the latter query makes the following changes:
Node_Type = person
Edge_Type = Manages
Node_Type = person
Edge_Type = closed Edge_Type = created
Node_Type = Node_Type =
WorkItemRevision Node_ID = #673 WorkItemRevision Node_ID = #673
The above finite state machine has a disjunction for closed and created cases.
If the query were to be optimized a '-query' will be added.
The system reports the execution time of each query. The query result is in the form of graph paths ( a sequence of graph nodes).
#codejam
Given two orientations of the same ominoes as in the earlier posts, determine whether the clockwise or the anticlockwise is the shorter transition from one to the other.
int countl = 0;
for (int i = 0; i < 360; i+=90)
var temp = rotate_matrix(matrix, i);
countl += 1;
if (compareTo(matrix, temp) == 0)
return count;
countr = 0;
for (int i = 0; i < 360; i+=90)
var temp = rotate_matrix(matrix, 0-i);
countr += 1;
if (compareTo(matrix, temp) == 0)
return count;
if (countl < countr) return -1;
if (countl > countr) return +1;
if (countl == countr && countl == 0) return -1;
return 0;