Cluster computing

Wednesday, July 6, 2016

#codingexercise

Remove duplicate characters from a list

void removeDuplicates(List<char> input)

{

return input.distinct();

}

void removeDuplicatesInPlace(ref List<char> input)

{

for(int i = 0; i < input.Count; i++)

{

int rep = nextindex(input, i);

while (rep != -1)

{

shift_chars_left_and_end_with_null(ref input, rep);

rep = nextindex(input,i);

}

}
void removeDupsSorted(ref List<char> str)
{
int result = 1;
int index = 1;
If(str.count ==0) return;
while (str[index])
{
if (str[index] != str[index-1])
{
str[result] = str[index];
result++;
}
index++;
}
str[result] = '/0';
}

We were discussing deduplication with data differencing:

Let us introduce the operator d for deduplication

If the original file can be represented by alpha, the deduplication results in

Beta = d (alpha)

After a point of time when alpha has undergone edits to become alpha'

the deduplication results in

Beta' = d(alpha')

Let the differencing operator be delta

Differencing now proceeds with delta(Beta' - Beta)

It can be shown that delta operation is lossless. This follows from the definition for deduplication being lossless in that the original file Alpha can be reconstructed with alpha = d-inverse(beta) where d-inverse is the reconstitution function.

We don't attempt to generate all possible sequences with manipulation of every byte of a full length segment because that can be of the order of 255 power 2048. It's not the size that is the deterrent for such a lookup table because that lookup table can become universal but the fact that most occurrences of real data are going to be in a fairly small subset with large variations between consecutive edits over time. Instead we rely on previously encountered patterns that are repeated.

Coming back to the discussion on lossless operation, the differencing operation can detect all the differences between the original files using their deduplicated transformations because the segments into which the files are divided are each read for comparision without omission.

Furthermore, the differencing proceeds segment by segment just as it would if the original file was read in byte ranges.

Now let us consider that we can generate the differencing codes by separating the original dataset into two equal sized groups x and y for which we compute two sets of differencing coefficients k and r. The purpose of separating them into two new groups of data fragments is that it helps generates the differences both globally as well as locally so that even if we consider some of the data fragments as missing and skip over these, the result is sufficient to helps reconstruct the target. In a way this works similar to error correction codes in that the transformations ensure that some of the original data can be tolerated as missing or deliberately skipped. Again 'data' here refers to deduplicated segments and not the actual file content. The idea is that we can aid differencing by tolerating upto m number of skipped differencing and use these to skip differencing of data fragments which we may guess to be low-entropy.

#codingexercise
static List<int> Distinct(this List<int> nums)
{
var h = new HashSet();
for (int i = 0; i < nums.Count; i++)
{
if (h.Contains(nums[i]) == False)
h.Add(nums[i]);
}
return h.ToList();
}

Cluster computing

Wednesday, July 6, 2016

No comments:

Post a Comment