Sunday, March 17, 2013

Longest common subsequence

When there is a recursive algorithm based on the subproblems but the total number of subproblems is not too great, it is possible to cache all solutions in a table. In the case of a longest common subsequence detection, running through all the subsequences would take exponential time. Instead we can define a recursion based on the following :
Let c[i, j] be the length of the longest common subsequence of the prefix  of X[1..i] and Y[1..j]. Recursion:
c[i, j] = { 0 if i =  0 or j = 0
          = { c[i-1, j-1] + 1 if i,j > 0 and xi = yj
          = { max( c[i,j-1], c[i-1,j] ) otherwise
This way we utilize the solutions we computed. We compute the table c[i,j] bottom-up and also store the value b[i,j] that captures whether c[i,j-1], c[i-1,j] or c[i-1,j-1] + 1is optimum.  We therefore find the longest common subsequence by walking backwards on the path from the last cell of b[i,j] to the first.

Saturday, March 16, 2013

decision trees


From the work of some sorting algorithms, we can derive a decision tree.  A decision tree can be visualized as nodes for each comparison and left and right sub-trees for yes or no choices. Each execution of an algorithm gives a downward path in the tree. Worst-case time is lower-bounded by longest downward path (height) in this binary tree. A tree of height h has at most 2h leaves. If a tree has L leaves, then its height is at least log L.

Greedy algorithms have no general recipe. In general, greedy algorithms assume that there is an objective function f(x1, x2, …, xn) to optimize ( say maximize) that depends on some choices. Second, there is a way to estimate the contribution of each choice to the final value but without taking into account how our choices will constrain later choices. Finally, the algorithm is greedy if it still makes the choice with the best contribution. Greedy algorithm is frequently not the best choice but sometimes it is.

There are cases of algorithm problems when nothing is better than trying out all possibilities. This is especially helpful when the number of possibilities is bounded. For example we can use exhaustive search to find elements of a maximum independent set in a graph. A maximum independent set is one in which no two vertices have a connecting edge. Independent set is NP-complete.

Courtesy study material  uploaded by Peter Gacs, Boston university.

Friday, March 15, 2013

How do you design a text mining system ?


We begin by encapsulating the core concept of a text mining system which is a concordance. Next we support a management layer over the concordance. We streamline the insertion, deletions and updates. We encapsulate every parameter we let the system depend on. We build supportability and stack trace capture for diagnosis. We write unit-tests side by side with the existing implementation. Sounds exciting.

Thursday, March 14, 2013

Red black trees revisited

Among the data structures in a text book for computer science, red and black trees possibly capture the imagination most with its colorful transformations during rotations and recolorings of sub trees.
So let's revisit this data structure and the insert and delete operations for these.
For a red-black tree:
1)  Every node is either red or black.
2) The root node is black.
3) The sentinel nodes are black.
4) if a nodes is red, then both its children are black.
5) For each node, all paths from that node down to the leaves have the same number of black nodes.

Left Rotate operation: The right sibling becomes the parent. The left subtree of this sibling becomes the right subtree of the node displaced.

Right Rotate operation: The left sibling becomes the parent.The right subtree of this sibling becomes the left subtree of the node displaced.

Insert operation : goes very much like a tree insert followed by a RB tree fixup. To insert a node z in a tree T, we walk down the tree with x as the root and y as the parent. Then we handle the case of appending the z as the left or the right child of y. We color the new node z as red. Then we fix up the tree.Fix up requires iterations as long as z's parent is red. If z's uncle is red, we recolor the nodes and move the pointer up the tree. If z and its parents are both are red but z's uncle is black, then if z is the right child of z.p then we perform a left rotation else if z is the left child of z.p, then we perform a right rotation.

Delete operation: also goes very much like a tree delete followed by a RB tree fixup.  If z is the node to be deleted from a tree T, we maintain a node y as the node either to be removed or to be moved within the tree. We also maintain y's original color since y's color could change.  We also maintain x as the sole sibling of z so that we can transplant it.  If z has both siblings, we choose y as the tree minimum on z's right subtree and x as the sibling of y so that we may perform the same transplant operation on x as earlier and then we transplant z. We set the y's color to z's color. If the original color of y was black, we perform RB-tree fixup for delete.  We begin the fixup with x.  We iterate while x is not the root and x's color is black. In each iteration, we take x and w as siblings and perform rotations and recolorings for fix up . If x is the left child and w's color is red, we color it black and do a left rotation of x's parent. If w's left color is black and w's right color is black, we trivially color w to red and set x to its parent. Else if only the right's color is black we set w's left color to black and right rotate w and keep w to the right of x's parent. If x's sibling w is black and w's right child is red, we set w's color to x's parent and color both x's parent and w's right to black, then we left rotate on x's parent. Since we started out with x being the left child, the same cases apply to the x being the right child but with left and right exchanged.

Data Structures

Heap data structure is very useful for organizing data for constant time retrieval of maximum or minimum entries and logarithmic to find elements.  The max heapify method works like this: You take the left sibling, right sibling and the root to find the maximum among the three. If the largest is not the root, you swap it with the root and recurse down to the swapped node subtree.

Insertion is also logarithmic. If you insert the element as the last, it can be floated up.
So for index ranging from N to 1, heapify the tree at index.

Wednesday, March 13, 2013

web API and mobile application clients

In my experience with working on a retail company's API for launching their mobile applications on Android and iOS devices, there were a few items worth recalling:
1) The APIs were external facing and hence they required authentication and authorization over the internet.
2) The APIs could be signed on from mobile applications as well as desktop application.
3) The APIs supported both encrypted as well as unencrypted access over Internet.
4) The APIs were versioned and had a versioning policy.
5) The APIs were categorized based on services they offered. They exposed resources for these services.
6) The APIs had a cache for improving performance. The cache used URI hashing as well as periodic refreshes from data sources.
7) The APIs were published through a web proxy that provided monitoring services.
8) The APIs were called by clients with different APIkeys.
9) The APIs were implemented to access data from different data sources.
10) The APIs had SLA defined and honored. These covered timeout and availability.
11) The APIs were performance tested with different workloads.
12) The API URIs were qualified with the resource and query parameters necessary for shopping.

Tuesday, March 12, 2013

keyword extraction using naive Bayes

Keywords could be considered local to the document they appear in. Consequently, keywords not only have an attribute via term frequency but also in their appearance in a given document as opposed to others. This has been utilized in papers such as Yasin et al in keyword extraction using naive Bayes to identify whether a word belongs to the class of ordinary words or keywords. The metric is called TFxIDF which combines Term Frequency and Inverse Document Frequency. TF*IDF(P,D) = P(word in D is W) x [ -log P(W in a document) ]. Assuming feature values are independent, Naive Bayes classifier has been proposed in thsi paper with the following model:
P(key | T, D, PT, PS) = P(T|key) x P(D|Key) x P(PT|key) x P(PS|Key) / P(T, D, PT, PS) where P(key) denotes the prior probability that the word is a key,  P(T|key) denotes the probability of having TFxIDF score T given the word is a key, P(D|Key) denotes the probability of having neighbor distance D to the previous occurance of the same word, P(PT|Key) denotes the probability of having relative distance D to the previous occurance of the same word given the word is a key.