Monday, July 4, 2016

Data differencing with deduplication

Data differencing is the technique of producing the difference between two sets of data  - a source and a target. The resulting difference data can be used together with the source data to reconstruct the target data.  Data compression can be considered a special case of data differencing where the source data is empty and the resulting compressed file corresponds to a “difference from nothing”. Compression usually results in a single file and can be uncompressed to get back a target.

Deduplication is about finding segments of data that may appear repeated and packing them once with their aliases used to reconstruct the target.  It differs from compression in the sense that it does not look at short repeated substrings but large sections of files that are identical, finds these unique chunks of data and uses their references instead of large data. Therefore deduplication works with larger sections of files or whole files itself. Certain backup systems attempt to use this optimization by using file references if the files have not changed but neither backup nor compression effectively reduces the storage footprint of data as much as deduplication.

Deduplication is generally peformed on all kinds of data including backups and compressed files as an after effect by certain deduplication and archival appliances. However deduplication can be intrinsic to data differencing or can even be done prior to differencing.

Consider the other way around where deduplication is done on a snapshot of a file. In this case, the entire file is available after the deduplication. Subsequent snapshots will increase storage even with deduplication. This is where differencing helps

Deduplication is widely understood to be done in the form fixed length segments that can be used to traverse data extents giving greater articulation of redundancies than either backup or compression alone.

One concern about using deduplication directly on primary storage is that there is a significant cost involved in the full reconstitution of files – a process also known as rehydration.  However, backups and data differencing are generally not applied to data that needs to be packed and unpacked frequently.

Moreover differencing applied to deduplicated data can be scaled to achieve faster and more efficient storage because we are not treating them in terms of changes to files but in segments or even with collections of segments.

This scaling could possibly be done in the form of scaling equations that uses a set of coefficients to determine the groups.

Remove spaces from a string
Void removeSpaces(stringbuilder str)
{
Int count = 0;
For(int i =0; i < str.count; i++)
{
If (str[i] != ' '){
     Str[count] = str[i];
     Count++;
}
}
Str[count] ='/0';
}

Find the longest repeated substring in text
Use traversal by building trie
Void doTraversal(node root, int h, ref int max, ref int start)
{
If (root==null) return;
Int i =0;
If root.isInternalNode(){
For(int i =0; i < max_char; i++)
{
If (root.children[i])
    DoTraversal( root.children[i], h +edge(root.children[i]), ref max. Ref start);
}
}
Else if (root.isLeaf&& max < h-edge(root)){
  Max = h-edge(root);
  Start = root.index;
}
  }

We extend the suffix tree as we encounter chars
Void extendTrie(char n, ref node root){
while( next_char to add){
If ( there is no outgoing edge) {
 Add new}
Else{
Extend
}
Reduce remaining char to add
}

}

Sunday, July 3, 2016

#codingexercise
Find an element in a sorted and rotated array
Int getindex(list<int> a, int start, int end, int val)
{
If (end < start) return -1;
If (a[start] <= a[end])
 Return binary_chop(a, start, end, val);
If (end == start +1){
If (a[start] == val) return start;
If (a[end] == val) return end;
Return -1;
}
Int mid = (start+end)/2;
int left = getindex(a, start,mid, val);
If (left != -1) return left;
Int right = getindex(a, mid+1, end, val);
If (right != -1) return right;
Return -1;
}
Find the minimum element in a sorted rotated array
Int getmin(list<int> a, int start, int end)
{
If (end < start) return a[0];
If (end == start) return a[start];
Int mid = (start +end)/2;
If (a[mid+1]<a[mid]) return a[mid+1];
If (a[mid]<a[mid-1]) return a[mid];
If (a[start] < a[mid]) return getmin(a,start,mid);
Return getmin(a, mid+1,end);
}
// alternative getindex using getmin
Int findindex(list<int> a, int val);
{
Int min = getmin(a, 0, a.count-1);
Int mindex = a.indexOf(min);
Int left = binary_chop(a,0, mindex,val);
If (left != -1) return left;
Int right = binary_chop(a, mindex+1,end, val);
If (right != -1) return right;
Return -1;
}
Int binary_chop(List<int> a, int start, int end, int val)
{
If (val < a[start] || val > a[end]) return -1;
If (a[start] == val) return start;
If(a[end] == val) return end;
Int mid = (start+end)/2;

Int left = binary_chop(a,0, mid,val);
If (left != -1) return left;
Int right = binary_chop(a, mid+1,end, val);
If (right != -1) return right;
Return -1;
}

Saturday, July 2, 2016

#codingexercise
Remove spaces from a string
Void removeSpaces(stringbuilder str)
{
Int count = 0;
For(int i =0; i < str.count; i++)
{
If (str[i] != ' '){
     Str[count] = str[i];
     Count++;
}
}
Str[count] ='/0';
}
Given a string that can be formed from the letters arranged as follows:
A B C D E
F G H  I  J
K L M N O
P Q  R  S T
U V  W X Y
Z
and with only the following positions

Void getPath(string s)
{
Int len = 0;
Int curx = 0;
Int cury = 0;
for( int i = 0; i < s.length; i++)
{
    Int nextx = (str[i] - 'A')/5;
    Int nexty = (str[i] - 'A')%5;
    While (curx > nextx){
        Curx--;
         len++;
     }
While (curx < nextx){
        Curx++;
         len++;
     }
While (cury < nexty){
        Cury++;
         len++;
     }
While (cury > nexty){
        Cury--;
         len++;
     }
}
Return len;
}
Get  longest Common Prefix in strings
String walkTrie(ref TrieNode root, int k)
{
Node cur = root;
Int index=0;
String prefix;
While(countChildren(cur, ref index) > k && isleaf(cur) == false){
Cur = cur.children[index];
 Prefix+= cur.data;

}
Return prefix;

}

Convert a binary tree into a binary search tree

void ConvertBTtoBST(ref Node root)
{
if (root == null) return null
var inorder = new List<int>();
InorderTraversal(root, ref inorder);
inorder.sort();
InorderToTree(inorder, ref root);
}

Friday, July 1, 2016

Given stock price of a company for some consecutive days. Need to find the maximum span of each day’s stock price. Span is the amount of days before the given day where the stock price is less than or equal to that of given day

void GetSpan(List<int> price, int n, ref List<int> span)
{
var st = new Stack<int>(); // stores indexes
st.push(0);
span[0] = 1;
for (int i = 1; i < n; i++)
{
while (!st.empty() && price[st.top()] < =price[i])
    st.pop();
span[i] = (st.empty()) ? (i+1) : (i - st.top());
st.push(i);

}
}
#Data day presentation topic
Backup services in a private cloud

A private cloud has several valuable assets in the form of virtual machine instances. A backup of these VMs helps with restore to point of time, disaster recovery and such other benefits. Backups are different from snapshots because they may include storage volumes and application data. Both backup and snapshots involve large files to the tune of  several hundred Gigabytes.

Many cloud providers offer their own backup products. However, a private cloud may be made up with more than one cloud provider. Moreover the private cloud may shift weight on its cloud vendors. It doesn't seem flexible to use one or the other of these cloud specific products. A combination of these products could work but each would introduce a new variable to the clients. Some commercial backup products successfully claim to provide a central solution against heterogenous clouds but they may impose their own restrictions on the backup activities. Consequently a Do-it-yourself backup product that can do all that we want ib a private cloud becomes the only satisfactory solution.
As is conventional with offerings from the private cloud, a service seems better than a product. In addition if this is written in a microservice model, it becomes much more manageable and tested given that this is a dedicated services
Moreover clients can niw call via command line, user interface or direct api calls. Each client offers a two kinds of back up service.  Backup service via scheduled jobs and an infrequent on demand push button service.  They differ in differentiating the workloads  from a customer and make it easier customer to realize their backups. A  
 frequent schedule alleviates the onus on the customer to take repeated actions and provides
differentiation to the workloads that can be siphoned off to automation.

#codingexercise
The GetSpan also has a simple compute only processing:
void GetSpan(List<int> price, int n, ref List<int> span)
{

for (int i = 0; i < n; i++)
{
for (int j =i; j >= 0; j--)
   if (price[j] <= price[i])
        span[i] = j-i+1;
}

}

Thursday, June 30, 2016

We were discussing maximum sum rectangle in a 2d array. There is a technique called Kadane's algorithm that we can use to find this sum. The algorithm returns the maximum sum  and stores the starting and ending indexes
int kadane(List<int>num, ref int start, ref int end, int n)
{
int sum = 0;
int max = INT_MIN;
end = -1;
int cur = 0,
for (int i = 0; i < n; i++)
{
  sum += num[i];
  if sum < 0){
     sum = 0;
     cur =  i + 1;
}else if (sum > max ){
   max = sum;
   start  = cur;
   end  = i;
}
}
if (end != -1)
    return max;
max = num[0];
start = 0;
end = 0;
// find the max element
for (int i = 1; i < n; i++)
  if (num[i] > max){
      max = num[i];
      start = i;
      end = i;
   }
return max;
}
This can also be written as
def max_subarray(A):
      max_ending_here = max_so_far = 0;
      for x in A:
           max_ending_here = max(0, max_ending_here+x)
           max_so_far = max(max_so_far, max_ending_here)
      return max_so_far
#courtesy online resources

2) Check if two strings are anagrams of each other

bool AreAnagrams(String first, String second)
{
var h1 = new HashTable()
var h2 = new HashTable()
for (int i =0;i < first.length; i++)
      h1[first[i]] +=1;
for (int i =0;i < second.length; i++)
      h2[second[i]] +=1;
return h1.IsEqual(h2);
}

Bool getRepeatedChars(string A)
{
var h1 = new HashTable()
for (int i =0;i < first.length; i++)
      h1[first[i]] +=1;
Var ret = new List<int>();
For (int i =0;i < h1.keys.length; i++)
{
If h1.vals[i] > 1
     Ret.add(keys[i]);
}
Return ret.sort();
}
We were discussing maximum sum rectangle in a 2d array. There is a technique called Kadane's algorithm that we can use to find this sum. The algorithm returns the maximum sum  and stores the starting and ending indexes
int kadane(List<int>num, ref int start, ref int end, int n)
{
int sum = 0;
int max = INT_MIN;
end = -1;
int cur = 0,
for (int i = 0; i < n; i++)
{
  sum += num[i];
  if sum < 0){
     sum = 0;
     cur =  i + 1;
}else if (sum > max ){
   max = sum;
   start  = cur;
   end  = i;
}
}
if (end != -1)
    return max;
max = num[0];
start = 0;
end = 0;
// find the max element
for (int i = 1; i < n; i++)
  if (num[i] > max){
      max = num[i];
      start = i;
      end = i;
   }
return max;
}
This can also be written as
def max_subarray(A):
      max_ending_here = max_so_far = 0;
      for x in A:
           max_ending_here = max(0, max_ending_here+x)
           max_so_far = max(max_so_far, max_ending_here)
      return max_so_far
#courtesy online resources

Wednesday, June 29, 2016

#codingexercises 
Write  a method to delete all the nodes from a binary tree that lie on a path whose sum from root to leaf is less than a given value K.  
Void delPathSums(Node root, int K, ref int cur) 
If (Root == null) return; 
Cur+= root.data; 
Int left = cur; 
Int right = cur; 
Root.left = delPathSums(root.left , K, ref int left); 
Root.right = delPathSums(root.right, K, ref int right); 
Cur += max(left, right); 
If (cur <k){ 
Tree_delete(root); 
Root = null; 
Return root; 
}
Merge k sorted arrays

List<int> mergeSorted(List<List<int>> sorted, int k, int m)
{
Var result = new List<int>();
Var cur = new List<int> (); // local index
For(int k =0; k <  m×k ; k++)
{
Var min = getMin(sorted, ref cur);
Result.add(min.first);
Cur[min.second]++;
}
Return result;
}

Tuple<int,int> GetMin(List<List<int>> sorted, ref List<int>cur)
{
var min = new List<int>();
for (int I = 0; I < cur.Count; i++)
{
if (cur[i] < sorted[i].Count)
min.Add(sorted[i][cur[i]]);
else{
min.Add(INT_MAX);
}
}
int val = min.min();
int index = min.index(x = > x == val);
return new Tuple<int, int>(val, index);
}

This could also be solved with a heap
List<int> mergeSortedWithHeap(List<List<int>> sorted, int k)
{
var ret = new List<int>();
var h = new Heap();
for (int i =0; i  < n*k; i++)
{
var root = h.getMin();
ret[i] = root.data;
if (root.localindex < n)
{
root.data = sorted[root.globalindex][root.localindex];
root.localindex += 1;
}
else root.element = INT_MAX;
h.replaceMin(root);
}
return ret;
}

Given a 2d array find the maximum sum rectangle
int FindMaxSum(int[,] nums, int row, int col)
{
int max = 0;
for (int i =0; i < row; i++)
  for (int j = 0; j < col; j++)
  {
     int cur = 0;
     var start = new Tuple<int, int> (i,j);
     // choose end to be one of the remaining points to the right and bottom of start
      for (int x = i; x < row; x++)
        for (int y = j; y < col; y++)
            if (x !==i && y !== j) 
                cur = GetSum(nums, i, j, x,y); 
                 if (cur > max) max = cur;
  }
}
return max;