Saturday, May 24, 2014

Friday, May 23, 2014

We explore pair wise relationships between words in text indexing and summarization using graph theory. Documents are often treated as  a bag of words and represented by term vectors with limited dimensions. They are then clustered to extract the co-occurrences or the semantic relationships between terms in a document, thus helping with keyword extraction, summarization or classification. Such techniques have relied on latent and probabilistic latent semantic analysis. However words in a text not only represent the content of a document but they bear some ranking and relationship with each other in a document that could be represented with graphs. Graphs enable the discovery of connected salient words in a bag. While similar in nature to clustering and occasionally used together with it, graphs can be used to model both syntax and semantics such as typed features and hierarchical ontology. However, the focus of this articles is the usage of graph to connect words within a document and to retrieve the salient words based on the weights and the minimum spanning trees. The assignment of weights enables keyword extraction and the spanning of those keywords enable summarization.
First we lay some premise on which we will model our approach. We state that text can be both verbose or terse but can have the same content in terms of the import. The same text can have more than one nuances thereby requiring disambiguation either with context or with categorization. Text need not have meaning and can be a clutter of jargons with no equivalent synonyms. That is why no matter how we represent the semantics of the text, the model should be anchored to the terms in the text. In other words, we take the keywords from within the documents even if we represent their semantic model with alternate words. We focus exclusively on a given language and leave the variations that arise in syntax or expressions out of the scope of the samples we investigate.  We take simple samples at first - say blog posts regarding a technical topic. We are keen about the format of the text that is we look at some syntax if it helps our model but we don't rely on the distribution or clustering. We continue to treat the document as a collection of terms without emphasis on where the text are tiled, segmented or paragraphed. We ignore the layout of the text and focus instead on the weights of the terms, the densities and the divergence of topics. Initially we begin with a single graph for our document while we recognize the possibility of several sub-graphs in a document. We will attempt to find a model and we will have metrics for our graph. For example if we have two sets of keywords constituting two different graphs for the same keyword, we will keep the one that has a smaller minimum spanning tree. We know there are usages for our graph but we establish the criteria to come up with a graph.

Thursday, May 22, 2014

When we discussed radix sort, we saw that it works well for keys varying over a bigger range. If the keys are in a smaller range, then bucket sort could do. Bucket sort works something like this:
we make an array of size r of buckets
we make one pass through the records. we insert object with key k into the bucket k
we make pass through the array of buckets and read off as we go.
The idea behind bucket sort is that we split the interval of the range of values into several buckets so that they are randomly distributed across the buckets.  With an auxiliary storage to hold the items in a linked list for each bucket, we rely on the fact that no one bucket will have a large number of elements so sorting each bucket is less expensive and we keep the cost linear.
The algorithm is as follows;:
void bucketSort(A, n)
{
For (int I =0 ; I < n; I++)
{
Insert (A [I], B [A [I]]);
}
For (int j =0; j < n; j++)
{
Insertion-sort (B [j]);
}
Concatenate (B);
}
The complexity for the bucket sort is O(n + r)
In bucket sort we are making an r-way decision at each step  with the use of indirect addressing and is a significant optimization over the comparision model.
Radix sort can work with a bigger range of r i.e it can be of the order of 10 to 256.
In Radix sort the complexity is slightly different . If the strings are of length L,  and we spend O(n) time at each level then the complexity is O(nL). It could end a lot earlier due to empty buckets.
In Most significant digit or character first radix sort, we distinguish the strings from just their first few characters. If we now try to find out how small L can be given that we distinguish only with the first few characters, then we just have to use the r-way split instead of the two way split. The lower bound is then log-r(n) and if all the strings are roughly this length, then the complexity is O (n*log-r(n))
In the Least significant first radix sort, the complexity is again different from Most significant first.
Here the running time is O(L*(r+n)) because we are concatenating L times and at each stage the sorting is that of a bucket sort which O(r+n) or simply O(n) giving the overall complexity as O(nL).
 The advantage of implementing LSR is that it doesn't need any recursion.
At implementation time we might want to consider padding all strings to the right . This would make the strings comparable and the padding will not affect the comparision value of the string since the white space character has lower value than the first literal 'a'.  Also, space optimized tries may require slightly more processing to navigate the pointers in the keys.

Wednesday, May 21, 2014

Tonight we will look at a data structure called trie.
A trie is an ordered tree data structure that can store a dynamic set or an associative array where the keys are usually strings. We search by prefixes and hence they are referred to as prefix tree or radix tree. In a binary search tree, a node keeps the keys associated with that node. In the trie, this is not the case. What we use is the position of the node in the tree.  This position is used to construct the prefix. All the descendants of a node have a common prefix of the string associated with that node and the root has  an empty string. Values are normally not associated with every node, only with leaves and some inner nodes. This lends the trie to be stored in a space optimized manner which are referred to as compact prefix tree.
The term trie comes from retrieval. The use of prefix implies that we can use the trie as a deterministic finite automaton without loops. The transitions from one node to the other is forward only
While characters can be used to represent the keys, this is not always necessary. Instead they can be stored as bits in which case we have a bitwise trie. Fixed size of bits such as an integer number or memory address is convenient for storage.
A trie is very often used as a replacement for Hash tables. There are several advantages.  First the lookup is faster because it's prefix based which means we traverse down the root node in O(m) time. The insertion and deletions are cheap.
There is no chaining and consequently no collisions.
This comes in very useful for applications such as auto complete and Spellchecker.
Tries can provide automatic ordering of keys which comes in useful for auto complete or predictive text. They also come in useful for approximation algorithms.
We could take an example of radix sort.
Tries and radix sort are similar to binary search trees and quick sort.
In particular, each node has an array of size r of child pointers.  When we store words in a tree, we keep the letters on the edges and walk down the tree reading the words.
When we insert, we end each tree with a $ and in some case strings are substrings of others. To do a lookup, we just walk down the tree letter by letter and then see if the node has the ending $ to the word. If we get to a null pointer we know that the key is not there.
There are a couple of ways of making it more practical. In particular, some things we can do are :
compress paths that don't branch into a single edge
and only split when needed.These techniques optimize storage.
Radix sort comes in different flavors. For example, there's the most significant digit/character first and the least significant digit/character first radix sort.
In the most significant first radix sort, we sort by the most significant to get the buckets. Then we recursively sort each bucket, starting with the next most significant digit/character and so on.
In the least-significant first radix sort, we sort by the least significant digit / character to get the buckets. Then we concatenate them. This method is trickier to follow but its easy to implement. The reason we can go from least significant onwards is that at each stage (significance) of the sorting, we preserve (aka stable) the results from the previous iteration. we can prove that the progress of the algorithm is stable by mathematical induction.
void RadixSort(Array a, int digits)
{
 for (int i = 1; i< =digits ; i++)
       InsertionSort(A, i);
}
void InsertionSort(A, d)
{
for (int i = 1; i < n; i++)
{
   int x = A[i];
      // if (x < A[k]) Insert(A,k,x)
      int j = i - 1;
      while (j >= 0 && A[j] > x)
       {  A[j+1] = A[j];
            j = j - 1;
        }
      A[j + 1] = x;
}
}

Let us discuss another Splunk application. We referred to a Splunk app that integrates log parser. We will discuss it a little bit more today
Log parser search queries as we know are SQL queries where as Splunk takes unix like search operators. Moreover Splunk exposes a query language for its indexed data whereas log parser can work with a variety of data sources.  They say data is sticky meaning that users like to use the same applications that they have been using for their data unless there is a compelling advantage or the cost and risk to move the data is negligible. In order that Splunk may work with the data that belonged to log parser,  let us look into query translation to log parser so that we can avoid the data import.

I will return to this shortly but I will digress to discuss an interesting data structure called trie

Tuesday, May 20, 2014

We will look at a few more examples of this regular expression search.We looked at how match called matchhere and matchhere in turn was recursive or called matchstar. We will now look at say grep functionality.
grep is implemented this way. We read strings from a file a buffersize at a time, null terminate the buffer.We run match over this buffer, increment the match count, print it and continue.
In other words, grep is a continuous search of the match over the entire file a buffer at a time.
Now if we want the matchstar to implement the leftmost longest search for
Suppose we were to implement the ^ and the $ sign implementation as well.
In this case, we will have to watch out for the following cases:
^ - match at the beginning
$ - match at the end
if both ^ and $ then match the string with the literal such that the beginning and the end is the same.
if both ^ and $ are specified and there is no literal, then match even empty strings.
literals before ^ and after $ are not permitted.
Let us now look at a way to use regex for validating a phone number. Note that phone numbers can come in different forms some with brackets, others with or without hyphens and some missing the country code etc.
Given a ten digit US phone number, where the 3 digit area code can be optional, the regex would look something like this:
^(\(\d{3}\))|^\d{3}[-]?)?\d{3}[-]?\d{4}$
Here the first character in the regex or the caret after the vertical bar denotes the start of a sentence with the phone number
The first ( denotes a group but the \( denotes the literal in an optional area code
\d matches a digit which will have the number of occurrences as specified in the { } braces
The | bar indicates that either the area code with paranthesis or without may be present.
Then there are characters to close the ones specified above
? makes the group optional while $ matches the end of a line.

Monday, May 19, 2014

Regular Expressions provide a compact and expressive notation for describing patterns of text. Their algorithms are interesting and are helpful in a variety of applications. They are accredited with making scripting an alternative to programming. They come in several flavors but almost all use wild cards which are special notations.
Some common wildcards include the following
* to denote everything
. to denote a single character
$ for the end of string
^ for the beginning of a string
^$ matches an empty string
[] to include specific sets of characters
+ for one or more occurrences
? for zero or more occurrences
\ as a prefix to quote a meta character
These building blocks are combined with parenthesis for grouping.

Even if we take the four elementary regular expression wild cards ^, $, . and * we can use those for a a modest regular expression search function. Kerninghan and Pike showed that with functions for match and matchhere this can be simple and easy. match determines whether a string matches a regular expression. matchhere checks if the text matches at each position in turn. As soon as we find a match, the search is over. If the expression begins with a ^, the text must begin with a match of the remainder of the pattern. Otherwise the characters are traversed one by one invoking match here at each position. Even empty text would need to be iterated because * can match empty strings.
Example 1:
Regular Expressions

by Brian W. Kernighan and Rob Pike
/* match: search for re anywhere in text */
int match(char *re, char *text)
{
   if (re[0] == '^')
      return matchhere(re+1, text);
   do {  /* must look at empty string */
      if (matchhere(re, text))
         return 1;
   } while (*text++ != '\0');
   return 0;
}

The matchhere  method searches for regular expression at the beginning of the text.  We first check for a null or empty string, then we check for the asterisk to see if all the text can be matched.  When checking for all the text the delimiter is either a null terminator or a character that doesn't match or the expression doesn't have a '.'. We call this the match star method. Next we check for the $ or the null terminator in the text. Lastly we check again for more text and call matchhere.

The matchstar method matches zero or  more instances by calling matchhere for the text one character at a time until the prefix doesn't match or the the prefix is a dot which means any character or until the end of the text is reached.

The matchhere method calls the match star method in one of the four cases that it checks. For all the other cases, it either returns a boolean corresponding to end of pattern reached or end of text reached or it calls itself recursively if the previous character in the regular expression is a period or a literal match.

We will look at a few more examples of this regular expression search.