Saturday, May 24, 2014

In this post, we will cover some graph theory problems from the book by Douglas B. West and another by Reinhard Diestel.
A graph consists of a vertex set V(G), an Edge set E(G) and a relation that associates with each edge two vertices that are its endpoints.
If we have three vertex sets 1,2,3,4 and 5. then there can be many edges that are associated with the vertices, some redundant and some forming loop. A graph is considered complete when set of vertex have all edges, i.e each pair has an edge. A simple graph is one that contains no loops and has no multiple edges. A simple graph may be either connected or disconnected. A connected graph is one in which there is a path from any one point to another point. A graph that is not connected is a disconnected graph.
Graphs have an interesting problem set where vertices can be colored. Together these sets of problems that involve associating a color to the vertex, are called coloring problems.
If there are only two colors to the paths, then their chromatic number is 2. Why is coloring important because it lets us discern some vertices from others. We can also arrange or treat these vertices differently. For example, a bipartite graph also called a bigraph is a set of graph vertices decomposed into two disjoint sets such that no two graph vertices within the same set are adjacent. By adjacent, we mean vertices joined by an edge.
In any case, returning to our discussion on why we wanted to visit graph theory for keyword extraction or text summarization. We started with an assumption that every keyword contributes a meaning to the text while holding some relation with the others. This relation can be described in the form of competition, weighted competition, cooperation or are isolated and independent from each other. In other words, there are linkages between the keywords we want to exploit with the help of graphs. Moreover, the graph in the given text can be partitioned based on the eigenvalues and eigenvectors of their adjacency matrices. To compute the eigenvalues we start with the Weighted Laplacian Matrix. A Laplacian matrix is one where the diagonal elements are the degree of each node. For example, if a node 1 has 4 edges then the degree of the node 1 is four. When the node i is adjacent to the node j, then the value is -1, otherwise the values are zero. We could also normalize the Laplacian Matrix. This matrix serves can serve as an input to an algorithm that partitions the graphs.  These ideas come from such studies in biodiversity.
We will also attempt to explore directed graphs in the text.
For the undirected graphs,  what I wanted to do was to establish relations between terms in a text using similarity or distance measures such as with a thesaurus or WordNet.  These become the edges in the graph and the higher the degree of the vertices the higher their candidature as a keyword.
Once we establish the set of keywords we want to work with, we could then work with that graph or partition it further.





I'm taking a break for this long weekend.

Friday, May 23, 2014

We explore pair wise relationships between words in text indexing and summarization using graph theory. Documents are often treated as  a bag of words and represented by term vectors with limited dimensions. They are then clustered to extract the co-occurrences or the semantic relationships between terms in a document, thus helping with keyword extraction, summarization or classification. Such techniques have relied on latent and probabilistic latent semantic analysis. However words in a text not only represent the content of a document but they bear some ranking and relationship with each other in a document that could be represented with graphs. Graphs enable the discovery of connected salient words in a bag. While similar in nature to clustering and occasionally used together with it, graphs can be used to model both syntax and semantics such as typed features and hierarchical ontology. However, the focus of this articles is the usage of graph to connect words within a document and to retrieve the salient words based on the weights and the minimum spanning trees. The assignment of weights enables keyword extraction and the spanning of those keywords enable summarization.
First we lay some premise on which we will model our approach. We state that text can be both verbose or terse but can have the same content in terms of the import. The same text can have more than one nuances thereby requiring disambiguation either with context or with categorization. Text need not have meaning and can be a clutter of jargons with no equivalent synonyms. That is why no matter how we represent the semantics of the text, the model should be anchored to the terms in the text. In other words, we take the keywords from within the documents even if we represent their semantic model with alternate words. We focus exclusively on a given language and leave the variations that arise in syntax or expressions out of the scope of the samples we investigate.  We take simple samples at first - say blog posts regarding a technical topic. We are keen about the format of the text that is we look at some syntax if it helps our model but we don't rely on the distribution or clustering. We continue to treat the document as a collection of terms without emphasis on where the text are tiled, segmented or paragraphed. We ignore the layout of the text and focus instead on the weights of the terms, the densities and the divergence of topics. Initially we begin with a single graph for our document while we recognize the possibility of several sub-graphs in a document. We will attempt to find a model and we will have metrics for our graph. For example if we have two sets of keywords constituting two different graphs for the same keyword, we will keep the one that has a smaller minimum spanning tree. We know there are usages for our graph but we establish the criteria to come up with a graph.

Thursday, May 22, 2014

When we discussed radix sort, we saw that it works well for keys varying over a bigger range. If the keys are in a smaller range, then bucket sort could do. Bucket sort works something like this:
we make an array of size r of buckets
we make one pass through the records. we insert object with key k into the bucket k
we make pass through the array of buckets and read off as we go.
The idea behind bucket sort is that we split the interval of the range of values into several buckets so that they are randomly distributed across the buckets.  With an auxiliary storage to hold the items in a linked list for each bucket, we rely on the fact that no one bucket will have a large number of elements so sorting each bucket is less expensive and we keep the cost linear.
The algorithm is as follows;:
void bucketSort(A, n)
{
For (int I =0 ; I < n; I++)
{
Insert (A [I], B [A [I]]);
}
For (int j =0; j < n; j++)
{
Insertion-sort (B [j]);
}
Concatenate (B);
}
The complexity for the bucket sort is O(n + r)
In bucket sort we are making an r-way decision at each step  with the use of indirect addressing and is a significant optimization over the comparision model.
Radix sort can work with a bigger range of r i.e it can be of the order of 10 to 256.
In Radix sort the complexity is slightly different . If the strings are of length L,  and we spend O(n) time at each level then the complexity is O(nL). It could end a lot earlier due to empty buckets.
In Most significant digit or character first radix sort, we distinguish the strings from just their first few characters. If we now try to find out how small L can be given that we distinguish only with the first few characters, then we just have to use the r-way split instead of the two way split. The lower bound is then log-r(n) and if all the strings are roughly this length, then the complexity is O (n*log-r(n))
In the Least significant first radix sort, the complexity is again different from Most significant first.
Here the running time is O(L*(r+n)) because we are concatenating L times and at each stage the sorting is that of a bucket sort which O(r+n) or simply O(n) giving the overall complexity as O(nL).
 The advantage of implementing LSR is that it doesn't need any recursion.
At implementation time we might want to consider padding all strings to the right . This would make the strings comparable and the padding will not affect the comparision value of the string since the white space character has lower value than the first literal 'a'.  Also, space optimized tries may require slightly more processing to navigate the pointers in the keys.

Wednesday, May 21, 2014

Tonight we will look at a data structure called trie.
A trie is an ordered tree data structure that can store a dynamic set or an associative array where the keys are usually strings. We search by prefixes and hence they are referred to as prefix tree or radix tree. In a binary search tree, a node keeps the keys associated with that node. In the trie, this is not the case. What we use is the position of the node in the tree.  This position is used to construct the prefix. All the descendants of a node have a common prefix of the string associated with that node and the root has  an empty string. Values are normally not associated with every node, only with leaves and some inner nodes. This lends the trie to be stored in a space optimized manner which are referred to as compact prefix tree.
The term trie comes from retrieval. The use of prefix implies that we can use the trie as a deterministic finite automaton without loops. The transitions from one node to the other is forward only
While characters can be used to represent the keys, this is not always necessary. Instead they can be stored as bits in which case we have a bitwise trie. Fixed size of bits such as an integer number or memory address is convenient for storage.
A trie is very often used as a replacement for Hash tables. There are several advantages.  First the lookup is faster because it's prefix based which means we traverse down the root node in O(m) time. The insertion and deletions are cheap.
There is no chaining and consequently no collisions.
This comes in very useful for applications such as auto complete and Spellchecker.
Tries can provide automatic ordering of keys which comes in useful for auto complete or predictive text. They also come in useful for approximation algorithms.
We could take an example of radix sort.
Tries and radix sort are similar to binary search trees and quick sort.
In particular, each node has an array of size r of child pointers.  When we store words in a tree, we keep the letters on the edges and walk down the tree reading the words.
When we insert, we end each tree with a $ and in some case strings are substrings of others. To do a lookup, we just walk down the tree letter by letter and then see if the node has the ending $ to the word. If we get to a null pointer we know that the key is not there.
There are a couple of ways of making it more practical. In particular, some things we can do are :
compress paths that don't branch into a single edge
and only split when needed.These techniques optimize storage.
Radix sort comes in different flavors. For example, there's the most significant digit/character first and the least significant digit/character first radix sort.
In the most significant first radix sort, we sort by the most significant to get the buckets. Then we recursively sort each bucket, starting with the next most significant digit/character and so on.
In the least-significant first radix sort, we sort by the least significant digit / character to get the buckets. Then we concatenate them. This method is trickier to follow but its easy to implement. The reason we can go from least significant onwards is that at each stage (significance) of the sorting, we preserve (aka stable) the results from the previous iteration. we can prove that the progress of the algorithm is stable by mathematical induction.
void RadixSort(Array a, int digits)
{
 for (int i = 1; i< =digits ; i++)
       InsertionSort(A, i);
}
void InsertionSort(A, d)
{
for (int i = 1; i < n; i++)
{
   int x = A[i];
      // if (x < A[k]) Insert(A,k,x)
      int j = i - 1;
      while (j >= 0 && A[j] > x)
       {  A[j+1] = A[j];
            j = j - 1;
        }
      A[j + 1] = x;
}
}

Let us discuss another Splunk application. We referred to a Splunk app that integrates log parser. We will discuss it a little bit more today
Log parser search queries as we know are SQL queries where as Splunk takes unix like search operators. Moreover Splunk exposes a query language for its indexed data whereas log parser can work with a variety of data sources.  They say data is sticky meaning that users like to use the same applications that they have been using for their data unless there is a compelling advantage or the cost and risk to move the data is negligible. In order that Splunk may work with the data that belonged to log parser,  let us look into query translation to log parser so that we can avoid the data import.

I will return to this shortly but I will digress to discuss an interesting data structure called trie

Tuesday, May 20, 2014

We will look at a few more examples of this regular expression search.We looked at how match called matchhere and matchhere in turn was recursive or called matchstar. We will now look at say grep functionality.
grep is implemented this way. We read strings from a file a buffersize at a time, null terminate the buffer.We run match over this buffer, increment the match count, print it and continue.
In other words, grep is a continuous search of the match over the entire file a buffer at a time.
Now if we want the matchstar to implement the leftmost longest search for
Suppose we were to implement the ^ and the $ sign implementation as well.
In this case, we will have to watch out for the following cases:
^ - match at the beginning
$ - match at the end
if both ^ and $ then match the string with the literal such that the beginning and the end is the same.
if both ^ and $ are specified and there is no literal, then match even empty strings.
literals before ^ and after $ are not permitted.
Let us now look at a way to use regex for validating a phone number. Note that phone numbers can come in different forms some with brackets, others with or without hyphens and some missing the country code etc.
Given a ten digit US phone number, where the 3 digit area code can be optional, the regex would look something like this:
^(\(\d{3}\))|^\d{3}[-]?)?\d{3}[-]?\d{4}$
Here the first character in the regex or the caret after the vertical bar denotes the start of a sentence with the phone number
The first ( denotes a group but the \( denotes the literal in an optional area code
\d matches a digit which will have the number of occurrences as specified in the { } braces
The | bar indicates that either the area code with paranthesis or without may be present.
Then there are characters to close the ones specified above
? makes the group optional while $ matches the end of a line.