Monday, July 4, 2016

Data differencing with deduplication

Data differencing is the technique of producing the difference between two sets of data  - a source and a target. The resulting difference data can be used together with the source data to reconstruct the target data.  Data compression can be considered a special case of data differencing where the source data is empty and the resulting compressed file corresponds to a “difference from nothing”. Compression usually results in a single file and can be uncompressed to get back a target.

Deduplication is about finding segments of data that may appear repeated and packing them once with their aliases used to reconstruct the target.  It differs from compression in the sense that it does not look at short repeated substrings but large sections of files that are identical, finds these unique chunks of data and uses their references instead of large data. Therefore deduplication works with larger sections of files or whole files itself. Certain backup systems attempt to use this optimization by using file references if the files have not changed but neither backup nor compression effectively reduces the storage footprint of data as much as deduplication.

Deduplication is generally peformed on all kinds of data including backups and compressed files as an after effect by certain deduplication and archival appliances. However deduplication can be intrinsic to data differencing or can even be done prior to differencing.

Consider the other way around where deduplication is done on a snapshot of a file. In this case, the entire file is available after the deduplication. Subsequent snapshots will increase storage even with deduplication. This is where differencing helps

Deduplication is widely understood to be done in the form fixed length segments that can be used to traverse data extents giving greater articulation of redundancies than either backup or compression alone.

One concern about using deduplication directly on primary storage is that there is a significant cost involved in the full reconstitution of files – a process also known as rehydration.  However, backups and data differencing are generally not applied to data that needs to be packed and unpacked frequently.

Moreover differencing applied to deduplicated data can be scaled to achieve faster and more efficient storage because we are not treating them in terms of changes to files but in segments or even with collections of segments.

This scaling could possibly be done in the form of scaling equations that uses a set of coefficients to determine the groups.

Remove spaces from a string
Void removeSpaces(stringbuilder str)
{
Int count = 0;
For(int i =0; i < str.count; i++)
{
If (str[i] != ' '){
     Str[count] = str[i];
     Count++;
}
}
Str[count] ='/0';
}

Find the longest repeated substring in text
Use traversal by building trie
Void doTraversal(node root, int h, ref int max, ref int start)
{
If (root==null) return;
Int i =0;
If root.isInternalNode(){
For(int i =0; i < max_char; i++)
{
If (root.children[i])
    DoTraversal( root.children[i], h +edge(root.children[i]), ref max. Ref start);
}
}
Else if (root.isLeaf&& max < h-edge(root)){
  Max = h-edge(root);
  Start = root.index;
}
  }

We extend the suffix tree as we encounter chars
Void extendTrie(char n, ref node root){
while( next_char to add){
If ( there is no outgoing edge) {
 Add new}
Else{
Extend
}
Reduce remaining char to add
}

}

No comments:

Post a Comment