Monday, May 26, 2014

In this post, I want to talk about a coloring problem of graph theory. Let us take the problem of coloring a grid. No grid cell may touch another cell of the same color. Each color when used has a cost. The goal is to color the map as cheaply as possible. If we take the cells to be represented by vertices, no adjacent vertices can have the same color. Such a coloring is known as minimum vertex coloring A vertex coloring of a graph k or fewer colors  is said to be  a k-colorable graph. The chromatic number is the number of colors used. There's said to be a theorem called the four color problem that states that any planar graph can have a chromatic number of at most 4. This was first conjectured in 1852 but its difficult to prove manually. Microsoft Research folks announced in 2004 that they had verified a proof by Robertson et al by formulating the problem in equational logic and conforming the validity of each step.
Some more interesting facts about a simple planar graph.
The number of edges <= 3V - 6
The degree of any vertex <= 6
The set E of the edges of a graph without any loops represents the adjacency relation on V. By definition such a relation is irreflexive ( not onto to the same vertex ) and symmetric.
Edges can be traversed in paths, cycles, trails and circuits.
A walk is an alternating sequence of vertices and connecting edges.
A path is a walk that does not included any vertex twice except that the first might be the last as well.
A cycle is a path that begins and ends on the same vertex.
A trail is a walk that does not pass over the same edge twice though it can cross the same vertex again.
A circuit is a trail that begins and ends on the same vertex.
Alongwith spanning trees, other graph traversal methods include Eulerian and Hamiltonian cycles, network flows from source to sink, and graph colorings where edges or vertices are colored. The network flow problem can be shown to be solved in O(n^3)
An Eulerian cycle or cirucit is a trail that starts and ends at the same vertex. It is a cycle which uses each graph edge exactly once.
Euler showed that a connected graph has an Eulerian cycle iff it has no graph vertices of odd degree.
A Hamiltonian cycle or circuit is a graph cycle that visits each node exactly once. A connected graph with two vertices does not have a Hamiltonian cycle. A graph with three vertices has two Hamiltonian cycles. A graph with four vertices has ten Hamiltonian cycles
The problem of finding a Hamiltonian cycle is NP-complete. If a general graph is given, then it is exhaustively searched to find the Hamiltonian cycle. Rubin's method describes an efficient search procedure that reduces backtracking and guesswork in finding some or all Hamiltonian paths. All Platonic solids also called regular solids which are composed of congruent convex regular polygonal faces have Hamiltonian paths
With the above description on graphs, the coloring of the vertexes and the different ways to traverse the graphs, we wanted to find the keywords in a text and to list them such that we can summarize the text.
However the path to the vertices and the traversal seem to be less useful than finding the keywords themselves either with coloring or by some other ways of costing. When we used the ontology of words such that we could find similarity measures, it considered only pair wise words along their semantical relationship but it did not consider the co-occurrences as it is available in the given text. If we find that the words are similar because they appear in a thesaurus or can be categorized in a hierarchical classification, it is less useful than when they appear with other words as the author intended. If the text is not short, then these co-occurrences matter more and could be considered more reliable than the graph method. So at the very least, we should not consider that the graph based traversal of words as a substitute for the higher dimension based clustering methods. That said graph based traversals need not be with ontologies only.

Saturday, May 24, 2014

In this post, we will cover some graph theory problems from the book by Douglas B. West and another by Reinhard Diestel.
A graph consists of a vertex set V(G), an Edge set E(G) and a relation that associates with each edge two vertices that are its endpoints.
If we have three vertex sets 1,2,3,4 and 5. then there can be many edges that are associated with the vertices, some redundant and some forming loop. A graph is considered complete when set of vertex have all edges, i.e each pair has an edge. A simple graph is one that contains no loops and has no multiple edges. A simple graph may be either connected or disconnected. A connected graph is one in which there is a path from any one point to another point. A graph that is not connected is a disconnected graph.
Graphs have an interesting problem set where vertices can be colored. Together these sets of problems that involve associating a color to the vertex, are called coloring problems.
If there are only two colors to the paths, then their chromatic number is 2. Why is coloring important because it lets us discern some vertices from others. We can also arrange or treat these vertices differently. For example, a bipartite graph also called a bigraph is a set of graph vertices decomposed into two disjoint sets such that no two graph vertices within the same set are adjacent. By adjacent, we mean vertices joined by an edge.
In any case, returning to our discussion on why we wanted to visit graph theory for keyword extraction or text summarization. We started with an assumption that every keyword contributes a meaning to the text while holding some relation with the others. This relation can be described in the form of competition, weighted competition, cooperation or are isolated and independent from each other. In other words, there are linkages between the keywords we want to exploit with the help of graphs. Moreover, the graph in the given text can be partitioned based on the eigenvalues and eigenvectors of their adjacency matrices. To compute the eigenvalues we start with the Weighted Laplacian Matrix. A Laplacian matrix is one where the diagonal elements are the degree of each node. For example, if a node 1 has 4 edges then the degree of the node 1 is four. When the node i is adjacent to the node j, then the value is -1, otherwise the values are zero. We could also normalize the Laplacian Matrix. This matrix serves can serve as an input to an algorithm that partitions the graphs.  These ideas come from such studies in biodiversity.
We will also attempt to explore directed graphs in the text.
For the undirected graphs,  what I wanted to do was to establish relations between terms in a text using similarity or distance measures such as with a thesaurus or WordNet.  These become the edges in the graph and the higher the degree of the vertices the higher their candidature as a keyword.
Once we establish the set of keywords we want to work with, we could then work with that graph or partition it further.





I'm taking a break for this long weekend.

Friday, May 23, 2014

We explore pair wise relationships between words in text indexing and summarization using graph theory. Documents are often treated as  a bag of words and represented by term vectors with limited dimensions. They are then clustered to extract the co-occurrences or the semantic relationships between terms in a document, thus helping with keyword extraction, summarization or classification. Such techniques have relied on latent and probabilistic latent semantic analysis. However words in a text not only represent the content of a document but they bear some ranking and relationship with each other in a document that could be represented with graphs. Graphs enable the discovery of connected salient words in a bag. While similar in nature to clustering and occasionally used together with it, graphs can be used to model both syntax and semantics such as typed features and hierarchical ontology. However, the focus of this articles is the usage of graph to connect words within a document and to retrieve the salient words based on the weights and the minimum spanning trees. The assignment of weights enables keyword extraction and the spanning of those keywords enable summarization.
First we lay some premise on which we will model our approach. We state that text can be both verbose or terse but can have the same content in terms of the import. The same text can have more than one nuances thereby requiring disambiguation either with context or with categorization. Text need not have meaning and can be a clutter of jargons with no equivalent synonyms. That is why no matter how we represent the semantics of the text, the model should be anchored to the terms in the text. In other words, we take the keywords from within the documents even if we represent their semantic model with alternate words. We focus exclusively on a given language and leave the variations that arise in syntax or expressions out of the scope of the samples we investigate.  We take simple samples at first - say blog posts regarding a technical topic. We are keen about the format of the text that is we look at some syntax if it helps our model but we don't rely on the distribution or clustering. We continue to treat the document as a collection of terms without emphasis on where the text are tiled, segmented or paragraphed. We ignore the layout of the text and focus instead on the weights of the terms, the densities and the divergence of topics. Initially we begin with a single graph for our document while we recognize the possibility of several sub-graphs in a document. We will attempt to find a model and we will have metrics for our graph. For example if we have two sets of keywords constituting two different graphs for the same keyword, we will keep the one that has a smaller minimum spanning tree. We know there are usages for our graph but we establish the criteria to come up with a graph.

Thursday, May 22, 2014

When we discussed radix sort, we saw that it works well for keys varying over a bigger range. If the keys are in a smaller range, then bucket sort could do. Bucket sort works something like this:
we make an array of size r of buckets
we make one pass through the records. we insert object with key k into the bucket k
we make pass through the array of buckets and read off as we go.
The idea behind bucket sort is that we split the interval of the range of values into several buckets so that they are randomly distributed across the buckets.  With an auxiliary storage to hold the items in a linked list for each bucket, we rely on the fact that no one bucket will have a large number of elements so sorting each bucket is less expensive and we keep the cost linear.
The algorithm is as follows;:
void bucketSort(A, n)
{
For (int I =0 ; I < n; I++)
{
Insert (A [I], B [A [I]]);
}
For (int j =0; j < n; j++)
{
Insertion-sort (B [j]);
}
Concatenate (B);
}
The complexity for the bucket sort is O(n + r)
In bucket sort we are making an r-way decision at each step  with the use of indirect addressing and is a significant optimization over the comparision model.
Radix sort can work with a bigger range of r i.e it can be of the order of 10 to 256.
In Radix sort the complexity is slightly different . If the strings are of length L,  and we spend O(n) time at each level then the complexity is O(nL). It could end a lot earlier due to empty buckets.
In Most significant digit or character first radix sort, we distinguish the strings from just their first few characters. If we now try to find out how small L can be given that we distinguish only with the first few characters, then we just have to use the r-way split instead of the two way split. The lower bound is then log-r(n) and if all the strings are roughly this length, then the complexity is O (n*log-r(n))
In the Least significant first radix sort, the complexity is again different from Most significant first.
Here the running time is O(L*(r+n)) because we are concatenating L times and at each stage the sorting is that of a bucket sort which O(r+n) or simply O(n) giving the overall complexity as O(nL).
 The advantage of implementing LSR is that it doesn't need any recursion.
At implementation time we might want to consider padding all strings to the right . This would make the strings comparable and the padding will not affect the comparision value of the string since the white space character has lower value than the first literal 'a'.  Also, space optimized tries may require slightly more processing to navigate the pointers in the keys.

Wednesday, May 21, 2014

Tonight we will look at a data structure called trie.
A trie is an ordered tree data structure that can store a dynamic set or an associative array where the keys are usually strings. We search by prefixes and hence they are referred to as prefix tree or radix tree. In a binary search tree, a node keeps the keys associated with that node. In the trie, this is not the case. What we use is the position of the node in the tree.  This position is used to construct the prefix. All the descendants of a node have a common prefix of the string associated with that node and the root has  an empty string. Values are normally not associated with every node, only with leaves and some inner nodes. This lends the trie to be stored in a space optimized manner which are referred to as compact prefix tree.
The term trie comes from retrieval. The use of prefix implies that we can use the trie as a deterministic finite automaton without loops. The transitions from one node to the other is forward only
While characters can be used to represent the keys, this is not always necessary. Instead they can be stored as bits in which case we have a bitwise trie. Fixed size of bits such as an integer number or memory address is convenient for storage.
A trie is very often used as a replacement for Hash tables. There are several advantages.  First the lookup is faster because it's prefix based which means we traverse down the root node in O(m) time. The insertion and deletions are cheap.
There is no chaining and consequently no collisions.
This comes in very useful for applications such as auto complete and Spellchecker.
Tries can provide automatic ordering of keys which comes in useful for auto complete or predictive text. They also come in useful for approximation algorithms.
We could take an example of radix sort.
Tries and radix sort are similar to binary search trees and quick sort.
In particular, each node has an array of size r of child pointers.  When we store words in a tree, we keep the letters on the edges and walk down the tree reading the words.
When we insert, we end each tree with a $ and in some case strings are substrings of others. To do a lookup, we just walk down the tree letter by letter and then see if the node has the ending $ to the word. If we get to a null pointer we know that the key is not there.
There are a couple of ways of making it more practical. In particular, some things we can do are :
compress paths that don't branch into a single edge
and only split when needed.These techniques optimize storage.
Radix sort comes in different flavors. For example, there's the most significant digit/character first and the least significant digit/character first radix sort.
In the most significant first radix sort, we sort by the most significant to get the buckets. Then we recursively sort each bucket, starting with the next most significant digit/character and so on.
In the least-significant first radix sort, we sort by the least significant digit / character to get the buckets. Then we concatenate them. This method is trickier to follow but its easy to implement. The reason we can go from least significant onwards is that at each stage (significance) of the sorting, we preserve (aka stable) the results from the previous iteration. we can prove that the progress of the algorithm is stable by mathematical induction.
void RadixSort(Array a, int digits)
{
 for (int i = 1; i< =digits ; i++)
       InsertionSort(A, i);
}
void InsertionSort(A, d)
{
for (int i = 1; i < n; i++)
{
   int x = A[i];
      // if (x < A[k]) Insert(A,k,x)
      int j = i - 1;
      while (j >= 0 && A[j] > x)
       {  A[j+1] = A[j];
            j = j - 1;
        }
      A[j + 1] = x;
}
}

Let us discuss another Splunk application. We referred to a Splunk app that integrates log parser. We will discuss it a little bit more today
Log parser search queries as we know are SQL queries where as Splunk takes unix like search operators. Moreover Splunk exposes a query language for its indexed data whereas log parser can work with a variety of data sources.  They say data is sticky meaning that users like to use the same applications that they have been using for their data unless there is a compelling advantage or the cost and risk to move the data is negligible. In order that Splunk may work with the data that belonged to log parser,  let us look into query translation to log parser so that we can avoid the data import.

I will return to this shortly but I will digress to discuss an interesting data structure called trie