Friday, April 19, 2019

The predicates on sequences 

Queries on sequences involve selection, projection and aggregation. In the case for selection, the number of sequences to select from, may be very large. A simple iteration-based selection itself can become very costly. Unless the range is restricted, selection can become sequential and slow. At the same time, partitioning the sequences can help parallelize the task. The degree of parallelism can be left to the system to decide based on range and the number of ranges assuming a fixed cost per range.  Selection can be helpful if we can discount sequences that do not have a particular element. Bloom filters help with this purpose. 

Queries for projection involve elements of the sequences or the attributes of the sequences. In these cases, it is helpful to use the elements as a column. Since the elements do not have to be limited to a small set, any collection to hold the elements encountered is sufficient. If the elements can be accessed by a well-known position in each sequence, it is helpful but this is rarely the case. Therefore, there is a transformation of a range of sequences in a matrix of elements which then makes it easier to operate on.  

A reverse lookup based on the inverted list of elements helps when there is a limited number of elements to process. All standard query operations may be performed on the lists against the elements. This is useful for aggregations such as counting. 

Aggregations can be performed using map-reduce method. Since they work on partitions, it is better than serial. Aggregations have the advantage that the results are smaller than the data so they consume less space. 

Prefix trees help with sequences comparisons based on prefixes. Unlike joins, were the values have to match, prefix trees help with unrelated comparisons. Prefix trees also determine the levels of the match between the sequences and this is helpful to determine how close two sequences are.  The distance between two sequences is the distance between the leaves of the prefix trees.  This notion of similarity measure is also helpful to give a quantitative metric that can be used for clustering. Common techniques for clustering involve assigning sequences to the nearest cluster and forming cohesive cluster by reducing the sum of squares of errors. 

#codingexercise
Node Remove(Node root, int k) 
{ 
If (root ==null) return null; 
If (k < root.data) 
Root.left = Remove(root.left, k); 
Else if (k > root.data) 
Root.right =  Remove(root.right, k); 
Else 
{ 
Node left = root.left; 
Node right = root.right; 
Delete root; 
If (!right) return left; 
Node min = FindMin(right); 
Min.right = Removemin(min); 
Min.left = left; 
Return Balance(min); 
} 
Return Balance(root); 

} 

No comments:

Post a Comment