Cluster computing

Sunday, March 20, 2016

We continue reading the paper "Erasure coding in Windows Azure storage" by Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin.
Erasure coding is a data protection method where data is broken into fragments, expanded and encoded with redundant data pieces and stored across a different set of locations or media. This paper talks about Erasure coding in windows Azure storage WAS. It introduces new codes called the Local Reconstruction Codes. A (k,l,r) LRC divides k data fragments into l local groups. It encodes l local parities, one for each local group and r global parities. We now read the cost and performance trade-offs. Each (k,l,r) set yields one set of values of reliability, reconstruction cost and storage overhead. If we vary the parameters, we can get many set of values. Each fragment has to place on a different domain, the number of domains in a cluster limits the total number of fragments in the code. The authors use 20 as a limit since the storage stamps have upto 20 fault domains.Further they keep those set of parameters that yield equal or higher reliability than 3-replication. Then they plot storage and reconstruction cost of the remaining sets. Each individual point represent one set of coding parameters which in turn represents certain tradeoffs between storage cost and reconstruction performance. The authors found this to vary a lot so they only took the lower bound.

#coding exercise:
given a set of paths in a graph, normalize them to not contain repetitions.
We represent vertices and paths as string
then we apply regular expression between the two
void normalize(List<string>paths)
{
var ret = new List<string>();
for (int i = 0; i < paths.Count; i++)
for (int j = 0; j < paths.Count; j++)
{
// don't compare me to myself
if (i != j){
var match = GenerateIntersections(paths[i], paths[j]); //List<string>.Intersect() may give non-contiguous paths
ret.Add(path);
}
}
return ret;
}

The GenerateIntersections does not need to be O(n^2) where n is the length of the paths . It can be linear because the finite automaton for both strings will have overlaps and those are what we are interested in. There are string matching algorithms that generate such automaton in linear time and it is smaller than the lengths of paths.
With the given automaton, we can find overlaps and use them to find the overlapping paths.
Assuming that we have automaton passed in as character lists as parameters to the function below, we get intersections of the automaton as follows:
List<List<Char>> GenerateIntersections(List<Char> a, List<Char> b)
{
var ret = new List<List<char>>();
for (int i = 0; i < a.Count; i++)
for (int j = 0; j < b.Count; j++) // we can also make it linear by skipping ahead.
{
int m = i;
int n = j;
var match = new List<Char>();
while (a[m] == b[n] && m < a.count && n < b.count)
{
match.Add(a[m]);
m++;
n++;
}
if (match.IsEmpty() == False && ret.Contains(match) == false)
ret.Add(match);
}
return ret;
}

Cluster computing

Sunday, March 20, 2016

No comments:

Post a Comment