Saturday, May 31, 2014

We read a paper on Social text normalization using Contextual Graph Random Walks by Hassan and Menezes. This paper describes text normalization that comes in useful for keywords extraction or summarization as well. Text normalization does away with noise, typos, abbreviations, phonetic substitutions, abbreviations and slang. This means the cleaned data will be uniform and more meaningful. Their approach uses  Random Walks on a contextual similarity bipartite graph constructed from n-gram sequences on large unlabeled  text corpus.
Natural language processing applications often have to deal with noisy text. Text that is dirty has a lot of problems in that it can break the machine learning algorithms, skew the results and may not be representative of the content. Text results are often measured by parameters such as precision and recall. We discussed these in our earlier posts and they are generally meant to determine the relevancy of the results. We saw that these measures were directly affected by the count of successful matches. If the matches are not possible because dirty text gets discounted, skipped or even worse included where it should not be, then the results are no longer accurate.
Furthermore, we saw that we had some elementary steps to mitigate these. For example, we considered, stemming and lemmatization to discover root words and conform different representations to the same word. Such techniques however can be overly general. If the modifications also are supposed to differentiate the words, then such technique does not alone help.
There are other techniques such as a lookup table that may be relevant to the domain of the subject. Such lookup tables simplify the language to a great deal in that we now have fewer and more consistent words.
Lookup tables are an interesting idea, we can associate one or more variations of the word against the same normalized term. This lets us add new variations such as slangs into the table for lookup. Every-time we see a new term , we can include it in the table for use later.
While Lookup tables are great for associations between terms and their normalized word, the normalized word is the neither the lemma nor necessarily the meaning of the word.
To improve that, we could now extend the idea to use a thesaurus or an ontology. These organize words by their related semantics. An Ontology goes further in describing a hierarchy or even categorizing As you can see these refinements are meant to bring in the dirty text into the mainstream for counting towards the measures we have for the language processing applications.


I will continue on my post on text graphs paper later today.

Friday, May 30, 2014

In today's post we look at the paper on a graph based approach to skill extraction from text by Kivimaki, Panchenko, Dessy and Verdegem. This paper presents a system that outputs a list of professional skills from a given text and as obtained from LinkedIn social network. The text is reviewed for similarities with the texts of Wikipedia pages and then uses a Spreading Activation algorithm on the Wikipedia graph in order to associate the input document with the skill.  We look at just this algorithm. This method consists of two phases : First the query document is associated with the Wikipedia articles using a vector space model and described as a text2wiki phase. Then with these pages and the hyperlink graph of Wikipedia, articles corresponding to the skills are found and related using the algorithm in what is described as a wiki2skill phase. One caveat with this method is that it has to avoid overly general topics or skills and this is done by biasing against hubs.
The text2wiki module relies on Gensim library for text similarity functions based on traditional vector space models. Each text is represented as a vector in a space of 300,000 most frequent terms in the corpus. This is then input to the wiki2skill module where the vector with initial activations a(0) is iterated upon and the activation spread into the neighbouring nodes. Activation refers to the initial work done to identify the nodes of interest. At the end of each iteration, we propagate the activations using the formula:
a(t) = decay_factor.a(t-1) + Lambda. W^pulses. a(t-1) + c(t)
where the decay factor controls the conservation of activation during time
Lambda is the friction factor which controls the amount of activations that nodes can spread to their neighbours
W is the weighted adjacency matrix

Thursday, May 29, 2014

In Today's Post I Will Talk about assigning weights to keywords in a text. We will use the same strategy as in the paper described earlier. Specifically we will use the number of categories together with the words as subgraphs.
In the paper to discover interesting messages spread across Twitter by using Link analysis, Yang, Lee and Rim, the retweets of messages are leveraged as implicit relationships between Twitter users.
However they look at more than just the sheer number of retweets  and use Link analysis.
The retweet count has been used by Hong et al as a measure of popularity and to present classifiers for predicting whether and how often new tweets will be presented in the future Alonso et al used the presence of a URL link as a single highly effective feature for distinguishing interesting tweets with more than eighty percent accuracy.  However this paper attempts to go beyond that by modeling the Twitter as a graph consisting of user and tweet nodes implicitly connected by retweet links, when one user retweets  what another user retweeted. They use a variation of the HITS algorithm that exploits the retweet link structure as an indicator of how interesting an individual tweet is. What is particularly interesting is that this paper does not treat all retweet links as equal but that some have more importance than others. They demonstrate their study on the real Twitter data. They score the interesting tweets with a ranking.
 They model the Twitter structure as a directed graph G with nodes N and directional edges E.
 The Graph G has two subgraphs one based only on the user nodes and another based only on the tweet nodes. Instead of running HITS on the tweet subgraph, they run it on the user subgraph and let tweets inherit the scores of their publishers.
 They first take the sum of the weighted  hub scores corresponding to each user with the weights based on user counts and this sum is treated as the authority score. Then they update the hub scores with the weights as before for the authority score but for a given user. Thus they do this iteratively. In each iteration, the scores tend to converge. After each iteration, the scores are normalized between 0 and 1 by dividing  each of them by the square root of the sum of squares of all authority/hub values. At the end we have a users authority score and hub score. This mechanism dampens the influence of users who devote most of their retweet activities towards very few other users and increase the weights of users who retweet to many more. The weights are the ratio of all other users that a user retweeted to all retweet outlinks from the user. 

Wednesday, May 28, 2014

Today I will write about a few more trivia on graph theory. There are always an even number of odd vertices in a simple graph. A large number of operations can be defined on collections of graphs. For example, graph sums, differences, powers and even graph eigenvalues  can be calculated on these collections. We wanted to use these collections in text mining.
We referred to finding the Laplacian and the eigenvalues. The eigenvalues of a graph are the eigenvalues of its adjacency matrix. The set of eigenvalues of a graph is called the graph spectrum. The eigenvalues of a graph are a special set of scalars associated with a matrix equation that decomposes a square matrix into an equivalent set Each eigenvalue is associated with a so called eigenvalues vector.
If we have a vector X then a square matrix A can be expressed as
AX=Lambda.X
or more succinctly as
(A-Lambda.I).X = 0
and the scalar Lambda is called the eigenvalue of A for the corresponding right eigenvector X.
Given a square matrix A, it can now be represented as A=PD(P^-1)
where P = eigenvalues vector
D= diagonal matrix constructed from eigenvalues
(P^-1) is the inverse of P
.
This form of decomposition is called eigenvalues decomposition.

What this helps us with is the canonicalization of a system to its simplest form where we reduce the number of parameters from n×n to n for the diagonal matrix.

Tuesday, May 27, 2014

This post covers some more interesting trivia about graph theory.
A graph diameter is the longest shortest path between any two graph vertices.
If we take a connected graph or network with a high graph diameter  and we add a very small number of edges randomly, the diameter tends to drop drastically. This is known as the small world phenomenon. It is also referred to as the six degrees of separation since any person in a social network of the world turns out to be linked to any other person by roughly six connections.
An acyclic digraph is a directed graph containing no directed cycles, also known as a  directed acyclic graph or a DAG. The number of DAGs on n = 1, 2, 3... vertices are 1, 2, 6, 31, ...
In the use of DAGs for communication networks, we frequently encounter problems of information dissemination described for a group of individuals. Gossiping and broadcasting are two such common problems. In gossiping, every person in the network knows a unique item of information and needs to communicate it to everyone else. In broadcasting, one individual has an item of information which needs to be communicated to everyone else.
To illustrate gossiping consider that there are n people each of whom know a specific scandal that's not known to others. They communicate by telephone and whenever two people place a call, they pass on as many scandals as they know. The question is how many calls are needed  before everyone gets to know every scandal. We could try this with a value of n = 4 where the participants are labeled A, B, C and D. Then the scandals could spread in this manner {A,B}, {C, D}, {A, C}, {B, D}. The n = 4 solution can then be generalized by adding another player X to the beginning and end of the previous solution thus resulting in {A,E} {A,B}, ... {B,D} {A,E }
We thus see that gossiping takes incremental numbers of calls between participants. Gossiping is also called total exchange and all to all communication. It has applications in communications and distributed systems. For example, a large class of parallel computing problems are addressed with gossiping including Fourier transforms and sorting. If f(n) denotes a function for the number of minimum calls necessary to complete gossiping among n people and where any pair of people can call each other, we see that for
n = 1, f(n) = 0
n = 2, f(n) = 1
n = 3, f(n) = 3
n = 4, f(n) = 4
n = 5, f(n) = 6
and in general f(n) = 2n - 4 for n >= 4
For one way communication where the graph is considered a DAG, the minimum number of calls = f(n) = 2n - 2 for n >= 4

Monday, May 26, 2014

In this post, I want to talk about a coloring problem of graph theory. Let us take the problem of coloring a grid. No grid cell may touch another cell of the same color. Each color when used has a cost. The goal is to color the map as cheaply as possible. If we take the cells to be represented by vertices, no adjacent vertices can have the same color. Such a coloring is known as minimum vertex coloring A vertex coloring of a graph k or fewer colors  is said to be  a k-colorable graph. The chromatic number is the number of colors used. There's said to be a theorem called the four color problem that states that any planar graph can have a chromatic number of at most 4. This was first conjectured in 1852 but its difficult to prove manually. Microsoft Research folks announced in 2004 that they had verified a proof by Robertson et al by formulating the problem in equational logic and conforming the validity of each step.
Some more interesting facts about a simple planar graph.
The number of edges <= 3V - 6
The degree of any vertex <= 6
The set E of the edges of a graph without any loops represents the adjacency relation on V. By definition such a relation is irreflexive ( not onto to the same vertex ) and symmetric.
Edges can be traversed in paths, cycles, trails and circuits.
A walk is an alternating sequence of vertices and connecting edges.
A path is a walk that does not included any vertex twice except that the first might be the last as well.
A cycle is a path that begins and ends on the same vertex.
A trail is a walk that does not pass over the same edge twice though it can cross the same vertex again.
A circuit is a trail that begins and ends on the same vertex.
Alongwith spanning trees, other graph traversal methods include Eulerian and Hamiltonian cycles, network flows from source to sink, and graph colorings where edges or vertices are colored. The network flow problem can be shown to be solved in O(n^3)
An Eulerian cycle or cirucit is a trail that starts and ends at the same vertex. It is a cycle which uses each graph edge exactly once.
Euler showed that a connected graph has an Eulerian cycle iff it has no graph vertices of odd degree.
A Hamiltonian cycle or circuit is a graph cycle that visits each node exactly once. A connected graph with two vertices does not have a Hamiltonian cycle. A graph with three vertices has two Hamiltonian cycles. A graph with four vertices has ten Hamiltonian cycles
The problem of finding a Hamiltonian cycle is NP-complete. If a general graph is given, then it is exhaustively searched to find the Hamiltonian cycle. Rubin's method describes an efficient search procedure that reduces backtracking and guesswork in finding some or all Hamiltonian paths. All Platonic solids also called regular solids which are composed of congruent convex regular polygonal faces have Hamiltonian paths
With the above description on graphs, the coloring of the vertexes and the different ways to traverse the graphs, we wanted to find the keywords in a text and to list them such that we can summarize the text.
However the path to the vertices and the traversal seem to be less useful than finding the keywords themselves either with coloring or by some other ways of costing. When we used the ontology of words such that we could find similarity measures, it considered only pair wise words along their semantical relationship but it did not consider the co-occurrences as it is available in the given text. If we find that the words are similar because they appear in a thesaurus or can be categorized in a hierarchical classification, it is less useful than when they appear with other words as the author intended. If the text is not short, then these co-occurrences matter more and could be considered more reliable than the graph method. So at the very least, we should not consider that the graph based traversal of words as a substitute for the higher dimension based clustering methods. That said graph based traversals need not be with ontologies only.

Saturday, May 24, 2014

In this post, we will cover some graph theory problems from the book by Douglas B. West and another by Reinhard Diestel.
A graph consists of a vertex set V(G), an Edge set E(G) and a relation that associates with each edge two vertices that are its endpoints.
If we have three vertex sets 1,2,3,4 and 5. then there can be many edges that are associated with the vertices, some redundant and some forming loop. A graph is considered complete when set of vertex have all edges, i.e each pair has an edge. A simple graph is one that contains no loops and has no multiple edges. A simple graph may be either connected or disconnected. A connected graph is one in which there is a path from any one point to another point. A graph that is not connected is a disconnected graph.
Graphs have an interesting problem set where vertices can be colored. Together these sets of problems that involve associating a color to the vertex, are called coloring problems.
If there are only two colors to the paths, then their chromatic number is 2. Why is coloring important because it lets us discern some vertices from others. We can also arrange or treat these vertices differently. For example, a bipartite graph also called a bigraph is a set of graph vertices decomposed into two disjoint sets such that no two graph vertices within the same set are adjacent. By adjacent, we mean vertices joined by an edge.
In any case, returning to our discussion on why we wanted to visit graph theory for keyword extraction or text summarization. We started with an assumption that every keyword contributes a meaning to the text while holding some relation with the others. This relation can be described in the form of competition, weighted competition, cooperation or are isolated and independent from each other. In other words, there are linkages between the keywords we want to exploit with the help of graphs. Moreover, the graph in the given text can be partitioned based on the eigenvalues and eigenvectors of their adjacency matrices. To compute the eigenvalues we start with the Weighted Laplacian Matrix. A Laplacian matrix is one where the diagonal elements are the degree of each node. For example, if a node 1 has 4 edges then the degree of the node 1 is four. When the node i is adjacent to the node j, then the value is -1, otherwise the values are zero. We could also normalize the Laplacian Matrix. This matrix serves can serve as an input to an algorithm that partitions the graphs.  These ideas come from such studies in biodiversity.
We will also attempt to explore directed graphs in the text.
For the undirected graphs,  what I wanted to do was to establish relations between terms in a text using similarity or distance measures such as with a thesaurus or WordNet.  These become the edges in the graph and the higher the degree of the vertices the higher their candidature as a keyword.
Once we establish the set of keywords we want to work with, we could then work with that graph or partition it further.





I'm taking a break for this long weekend.

Friday, May 23, 2014

We explore pair wise relationships between words in text indexing and summarization using graph theory. Documents are often treated as  a bag of words and represented by term vectors with limited dimensions. They are then clustered to extract the co-occurrences or the semantic relationships between terms in a document, thus helping with keyword extraction, summarization or classification. Such techniques have relied on latent and probabilistic latent semantic analysis. However words in a text not only represent the content of a document but they bear some ranking and relationship with each other in a document that could be represented with graphs. Graphs enable the discovery of connected salient words in a bag. While similar in nature to clustering and occasionally used together with it, graphs can be used to model both syntax and semantics such as typed features and hierarchical ontology. However, the focus of this articles is the usage of graph to connect words within a document and to retrieve the salient words based on the weights and the minimum spanning trees. The assignment of weights enables keyword extraction and the spanning of those keywords enable summarization.
First we lay some premise on which we will model our approach. We state that text can be both verbose or terse but can have the same content in terms of the import. The same text can have more than one nuances thereby requiring disambiguation either with context or with categorization. Text need not have meaning and can be a clutter of jargons with no equivalent synonyms. That is why no matter how we represent the semantics of the text, the model should be anchored to the terms in the text. In other words, we take the keywords from within the documents even if we represent their semantic model with alternate words. We focus exclusively on a given language and leave the variations that arise in syntax or expressions out of the scope of the samples we investigate.  We take simple samples at first - say blog posts regarding a technical topic. We are keen about the format of the text that is we look at some syntax if it helps our model but we don't rely on the distribution or clustering. We continue to treat the document as a collection of terms without emphasis on where the text are tiled, segmented or paragraphed. We ignore the layout of the text and focus instead on the weights of the terms, the densities and the divergence of topics. Initially we begin with a single graph for our document while we recognize the possibility of several sub-graphs in a document. We will attempt to find a model and we will have metrics for our graph. For example if we have two sets of keywords constituting two different graphs for the same keyword, we will keep the one that has a smaller minimum spanning tree. We know there are usages for our graph but we establish the criteria to come up with a graph.

Thursday, May 22, 2014

When we discussed radix sort, we saw that it works well for keys varying over a bigger range. If the keys are in a smaller range, then bucket sort could do. Bucket sort works something like this:
we make an array of size r of buckets
we make one pass through the records. we insert object with key k into the bucket k
we make pass through the array of buckets and read off as we go.
The idea behind bucket sort is that we split the interval of the range of values into several buckets so that they are randomly distributed across the buckets.  With an auxiliary storage to hold the items in a linked list for each bucket, we rely on the fact that no one bucket will have a large number of elements so sorting each bucket is less expensive and we keep the cost linear.
The algorithm is as follows;:
void bucketSort(A, n)
{
For (int I =0 ; I < n; I++)
{
Insert (A [I], B [A [I]]);
}
For (int j =0; j < n; j++)
{
Insertion-sort (B [j]);
}
Concatenate (B);
}
The complexity for the bucket sort is O(n + r)
In bucket sort we are making an r-way decision at each step  with the use of indirect addressing and is a significant optimization over the comparision model.
Radix sort can work with a bigger range of r i.e it can be of the order of 10 to 256.
In Radix sort the complexity is slightly different . If the strings are of length L,  and we spend O(n) time at each level then the complexity is O(nL). It could end a lot earlier due to empty buckets.
In Most significant digit or character first radix sort, we distinguish the strings from just their first few characters. If we now try to find out how small L can be given that we distinguish only with the first few characters, then we just have to use the r-way split instead of the two way split. The lower bound is then log-r(n) and if all the strings are roughly this length, then the complexity is O (n*log-r(n))
In the Least significant first radix sort, the complexity is again different from Most significant first.
Here the running time is O(L*(r+n)) because we are concatenating L times and at each stage the sorting is that of a bucket sort which O(r+n) or simply O(n) giving the overall complexity as O(nL).
 The advantage of implementing LSR is that it doesn't need any recursion.
At implementation time we might want to consider padding all strings to the right . This would make the strings comparable and the padding will not affect the comparision value of the string since the white space character has lower value than the first literal 'a'.  Also, space optimized tries may require slightly more processing to navigate the pointers in the keys.

Wednesday, May 21, 2014

Tonight we will look at a data structure called trie.
A trie is an ordered tree data structure that can store a dynamic set or an associative array where the keys are usually strings. We search by prefixes and hence they are referred to as prefix tree or radix tree. In a binary search tree, a node keeps the keys associated with that node. In the trie, this is not the case. What we use is the position of the node in the tree.  This position is used to construct the prefix. All the descendants of a node have a common prefix of the string associated with that node and the root has  an empty string. Values are normally not associated with every node, only with leaves and some inner nodes. This lends the trie to be stored in a space optimized manner which are referred to as compact prefix tree.
The term trie comes from retrieval. The use of prefix implies that we can use the trie as a deterministic finite automaton without loops. The transitions from one node to the other is forward only
While characters can be used to represent the keys, this is not always necessary. Instead they can be stored as bits in which case we have a bitwise trie. Fixed size of bits such as an integer number or memory address is convenient for storage.
A trie is very often used as a replacement for Hash tables. There are several advantages.  First the lookup is faster because it's prefix based which means we traverse down the root node in O(m) time. The insertion and deletions are cheap.
There is no chaining and consequently no collisions.
This comes in very useful for applications such as auto complete and Spellchecker.
Tries can provide automatic ordering of keys which comes in useful for auto complete or predictive text. They also come in useful for approximation algorithms.
We could take an example of radix sort.
Tries and radix sort are similar to binary search trees and quick sort.
In particular, each node has an array of size r of child pointers.  When we store words in a tree, we keep the letters on the edges and walk down the tree reading the words.
When we insert, we end each tree with a $ and in some case strings are substrings of others. To do a lookup, we just walk down the tree letter by letter and then see if the node has the ending $ to the word. If we get to a null pointer we know that the key is not there.
There are a couple of ways of making it more practical. In particular, some things we can do are :
compress paths that don't branch into a single edge
and only split when needed.These techniques optimize storage.
Radix sort comes in different flavors. For example, there's the most significant digit/character first and the least significant digit/character first radix sort.
In the most significant first radix sort, we sort by the most significant to get the buckets. Then we recursively sort each bucket, starting with the next most significant digit/character and so on.
In the least-significant first radix sort, we sort by the least significant digit / character to get the buckets. Then we concatenate them. This method is trickier to follow but its easy to implement. The reason we can go from least significant onwards is that at each stage (significance) of the sorting, we preserve (aka stable) the results from the previous iteration. we can prove that the progress of the algorithm is stable by mathematical induction.
void RadixSort(Array a, int digits)
{
 for (int i = 1; i< =digits ; i++)
       InsertionSort(A, i);
}
void InsertionSort(A, d)
{
for (int i = 1; i < n; i++)
{
   int x = A[i];
      // if (x < A[k]) Insert(A,k,x)
      int j = i - 1;
      while (j >= 0 && A[j] > x)
       {  A[j+1] = A[j];
            j = j - 1;
        }
      A[j + 1] = x;
}
}

Let us discuss another Splunk application. We referred to a Splunk app that integrates log parser. We will discuss it a little bit more today
Log parser search queries as we know are SQL queries where as Splunk takes unix like search operators. Moreover Splunk exposes a query language for its indexed data whereas log parser can work with a variety of data sources.  They say data is sticky meaning that users like to use the same applications that they have been using for their data unless there is a compelling advantage or the cost and risk to move the data is negligible. In order that Splunk may work with the data that belonged to log parser,  let us look into query translation to log parser so that we can avoid the data import.

I will return to this shortly but I will digress to discuss an interesting data structure called trie

Tuesday, May 20, 2014

We will look at a few more examples of this regular expression search.We looked at how match called matchhere and matchhere in turn was recursive or called matchstar. We will now look at say grep functionality.
grep is implemented this way. We read strings from a file a buffersize at a time, null terminate the buffer.We run match over this buffer, increment the match count, print it and continue.
In other words, grep is a continuous search of the match over the entire file a buffer at a time.
Now if we want the matchstar to implement the leftmost longest search for
Suppose we were to implement the ^ and the $ sign implementation as well.
In this case, we will have to watch out for the following cases:
^ - match at the beginning
$ - match at the end
if both ^ and $ then match the string with the literal such that the beginning and the end is the same.
if both ^ and $ are specified and there is no literal, then match even empty strings.
literals before ^ and after $ are not permitted.
Let us now look at a way to use regex for validating a phone number. Note that phone numbers can come in different forms some with brackets, others with or without hyphens and some missing the country code etc.
Given a ten digit US phone number, where the 3 digit area code can be optional, the regex would look something like this:
^(\(\d{3}\))|^\d{3}[-]?)?\d{3}[-]?\d{4}$
Here the first character in the regex or the caret after the vertical bar denotes the start of a sentence with the phone number
The first ( denotes a group but the \( denotes the literal in an optional area code
\d matches a digit which will have the number of occurrences as specified in the { } braces
The | bar indicates that either the area code with paranthesis or without may be present.
Then there are characters to close the ones specified above
? makes the group optional while $ matches the end of a line.

Monday, May 19, 2014

Regular Expressions provide a compact and expressive notation for describing patterns of text. Their algorithms are interesting and are helpful in a variety of applications. They are accredited with making scripting an alternative to programming. They come in several flavors but almost all use wild cards which are special notations.
Some common wildcards include the following
* to denote everything
. to denote a single character
$ for the end of string
^ for the beginning of a string
^$ matches an empty string
[] to include specific sets of characters
+ for one or more occurrences
? for zero or more occurrences
\ as a prefix to quote a meta character
These building blocks are combined with parenthesis for grouping.

Even if we take the four elementary regular expression wild cards ^, $, . and * we can use those for a a modest regular expression search function. Kerninghan and Pike showed that with functions for match and matchhere this can be simple and easy. match determines whether a string matches a regular expression. matchhere checks if the text matches at each position in turn. As soon as we find a match, the search is over. If the expression begins with a ^, the text must begin with a match of the remainder of the pattern. Otherwise the characters are traversed one by one invoking match here at each position. Even empty text would need to be iterated because * can match empty strings.
Example 1:
Regular Expressions

by Brian W. Kernighan and Rob Pike
/* match: search for re anywhere in text */
int match(char *re, char *text)
{
   if (re[0] == '^')
      return matchhere(re+1, text);
   do {  /* must look at empty string */
      if (matchhere(re, text))
         return 1;
   } while (*text++ != '\0');
   return 0;
}

The matchhere  method searches for regular expression at the beginning of the text.  We first check for a null or empty string, then we check for the asterisk to see if all the text can be matched.  When checking for all the text the delimiter is either a null terminator or a character that doesn't match or the expression doesn't have a '.'. We call this the match star method. Next we check for the $ or the null terminator in the text. Lastly we check again for more text and call matchhere.

The matchstar method matches zero or  more instances by calling matchhere for the text one character at a time until the prefix doesn't match or the the prefix is a dot which means any character or until the end of the text is reached.

The matchhere method calls the match star method in one of the four cases that it checks. For all the other cases, it either returns a boolean corresponding to end of pattern reached or end of text reached or it calls itself recursively if the previous character in the regular expression is a period or a literal match.

We will look at a few more examples of this regular expression search.
In this post, we conclude how we can map raw function pointers to symbols offline. This means we no longer depend on debugging sessions to resolve a stack. When we encounter a stack trace for our executable that is not resolved. We simply pass the raw function pointers to our tool along with the load address of our executable and each of these function pointers are then looked up based on their RVA. The RVA is the relative address of the function pointer from the load address.
RVA =  eip  - load address
if the stack trace was already available in the format executable name + offset, the offset translates to RVA
the raw function pointers are easier to see in a crash log and hence and hence the tool takes the eip directly. Note that the eip points to the return address in a debugger, here we mention it as the function pointer
 When we reviewed the win32 process startup stages, we saw how the process was initialized and what constitutes its address space. This helped us calculate the offsets.
The PDB is laid out in contiguous memory regions identified by section numbers. Each section has an address range and the addresses are found as offset from a given section. This is different from the Relative Virtual Address which indicates a position from the load address. The function pointers in an unresolved stack is a pointer that is specifically an offset equal to RVA from the load address. We use this to find the symbols.  All symbols have a type information associated. We filter the result to only the type Function because that's what we are after.
Note that the symbols can be found by different methods such as the two lookups we have mentioned above. At the same time we can also iterate over all the symbols to dump the information based on tables. Since the PDB keeps track of all information regarding symbols in tables, we can exhaustively scan the tables for information on all symbols.
Dumping all the information on the symbols helps to get the information in text format which can then be searched for RVA, offset, section number or name of a function.
The latter approach is quite useful when we have heavy text parsing utilities.



Sunday, May 18, 2014

Today I want to try out the dia2dump sample.
        IDiaSymbol* pFunc; // initialized elsewhere
:
        DWORD seg = 0;
        DWORD offset = 0;
        DWORD sect = 0;
        g_pDiaSession->findSymbolByAddr( sect, offset, SymTagFunction, &pFunc );
or

        BSTR bstrName;

        if (pCompiland->get_name(&bstrName) != S_OK) {
            wprintf(L"(???)\n\n");
        }

       else {
            wprintf(L"%s\n\n", bstrName);

            SysFreeString(bstrName);
      }

And we specify the load address for the executable file that corresponds to the symbols in this symbol store.

HRESULT put_loadAddress (
   ULONGLONG retVal
);

It is important to call this method when you get an IDiaSession object and before you start using the object.

A section contrib is defined as  a contiguous block of memory contributed to the image by a compiland.
A segment is a portion of the address space. A section contrib can map to segments.

An offset is the difference between a given raw instruction pointer and the load address of the process.

If you don't know the section and the offset, you can put the entire address as an offset from the load address in the offset and specify section as zero.
or you can iterate over the section numbers
eg:
/Users/rrajamani/Downloads/dia2dump.txt:Function: [00447B60][0001:00446B60] ServerConfig::getSSLConfig(public: struct ssl_config __cdecl ServerConfig::getSSLConfig(void) __ptr64)

Here the RVA is 00447B60  [ eip - Process Load Address ]
         the Segment is 0001
         the offset is 00446B60





           DWORD64  dwDisplacement = 0;

        DWORD64  dwAddress = _wcstoui64(argv[i], NULL, 16);

        DWORD64 dwRVA  = dwAddress - dwLoadAddress;

        long displacement = 0;

        IDiaSymbol* pFunc = 0;

        error = (DWORD)g_pDiaSession->findSymbolByRVAEx(dwRVA, SymTagFunction, &pFunc, &displacement );

 

        if (!error && pFunc)

        {

            BSTR bstrName;

 

            if (pFunc->get_name(&bstrName) != S_OK) {

                wprintf(L"(???)\n\n");

            }

 

            else {

                wprintf(L"%s \n\n", bstrName);

                if (displacement)

                    wprintf(L"+ 0x%x \n\n", displacement);

                else

                    wprintf(L" \n\n");

                SysFreeString(bstrName);

            }

        }

 

Saturday, May 17, 2014

A look at win32 virtual address for a process and its creation.
Lets begin by reviewing the steps involved in creating a process using a call such as CreateProcess()
First the image file (.exe) to be executed is loaded inside the creating process (caller)
Second the executive process object is created.
Third the initial thread (stack, context and executive thread object) is created
Then the Wn32 subsystem is notified of the new process so that it can set up for the new process and thread.
Then the calling process can start execution of the initial thread (unless the Create Suspended flag was specified) and return
 In the context of the new process and thread, the initialization of the address space is completed ( such as the load of the required DLLs) and then the execution at the entry point to the image is started. ( main subroutine )
Its probably relevant to mention here that in the first stage of loading the image, we find out what kind of application it is. If it's a Win32 application, it can be executed directly. If it's a OS/x we run the OSx.exe. If its POSIX we run the Posix.exe and if its MSDOS we run the MS-DOS directly.
In the second stage, we create the executive process object.  This involves the following substages:
Setting up the EPROCESS block
Creating the initial process address space
Creating the kernel process block
Concluding the setup of the process address space
Setting up the PEB
and completing the setup of the executive process object.
In the substages above, the one involving the creation of the initial Process Address Space will consist of three pages:
the Page directory
the Hyperspace page and
the Working set list
In the sixth stage which perform process initialization in the context of the new process, the image begins execution in user mode. This is done by creating a trap frame that specifies the previous mode as a user and the address to return to as the main entry point of the image. Thus, when the the trap that causes the thread to start execution in kernel mode is dismissed, it begins execution in user mode at the entry point.
Inside the x86 process virtual address space starting from low to high memory, the kernel portion at the high memory consists of HAL usage sections, crash dump information, nonpaged pool expansion, system PTEs, paged pool, system cache, system working set list, etc. The user mode consists of hyperspace and process working set list, page tables and page directory, system PTEs, mapped views and system code.  The kernel and the user mode addresses are separated by unused areas. In Linux, we have a memory mapped region in this space. The kernel space is reserved in the higher memory. The stack grows downwards adjacent to kernel space. The heap grows upwards from the user space.
Note that the instruction set is static and so we have
    if (pFunc->get_addressSection ( &seg ) == S_OK &&
        pFunc->get_addressOffset ( &offset ) == S_OK)
    {
        pFunc->get_length ( &length );
        pSession->findLinesByAddr( seg, offset, static_cast<DWORD>( length ), &pEnum );
    }
IDiaSymbol* pFunc;
pSession->findSymbolByAddr( isect, offset, SymTagFunction, &pFunc ); 
Today we will discuss how to use the DIA interfaces for reading the PDB when compared to our previous post here
MSDN says the following steps:

//initialize
CComPtr<IDiaDataSource> pSource;
hr = CoCreateInstance( CLSID_DiaSource,
                       NULL,
                       CLSCTX_INPROC_SERVER,
                       __uuidof( IDiaDataSource ),
                      (void **) &pSource);

if (FAILED(hr))
{
    Fatal("Could not CoCreate CLSID_DiaSource. Register msdia80.dll." );
}
 
//load
wchar_t wszFilename[ _MAX_PATH ];
mbstowcs( wszFilename, szFilename, sizeof( wszFilename )/sizeof( wszFilename[0] ) );
if ( FAILED( pSource->loadDataFromPdb( wszFilename ) ) )
{
    if ( FAILED( pSource->loadDataForExe( wszFilename, NULL, NULL ) ) )
    {
        Fatal( "loadDataFromPdb/Exe" );
    }
}
 
CComPtr<IDiaSession> psession;
if ( FAILED( pSource->openSession( &psession ) ) ) 
{
    Fatal( "openSession" );
}
 
CComPtr<IDiaSymbol> pglobal;
if ( FAILED( psession->get_globalScope( &pglobal) ) )
{
    Fatal( "get_globalScope" );
}
 
CComPtr<IDiaEnumTables> pTables;
if ( FAILED( psession->getEnumTables( &pTables ) ) )
{
    Fatal( "getEnumTables" );
}
 
CComPtr< IDiaTable > pTable;
while ( SUCCEEDED( hr = pTables->Next( 1, &pTable, &celt ) ) && celt == 1 )
{
     // Do something with each IDiaTable.
}
 
And the following lines indicate how to translate addr to function lines:
void dumpFunctionLines( IDiaSymbol* pSymbol, IDiaSession* pSession )
{
    ULONGLONG length = 0;
    DWORD isect = 0;
    DWORD offset = 0;
    pSymbol->get_addressSection( &isect );
    pSymbol->get_addressOffset( &offset );
    pSymbol->get_length( &length );
    if ( isect != 0 && length > 0 ) {
        CComPtr< IDiaEnumLineNumbers > pLines;
        if ( SUCCEEDED( pSession->findLinesByAddr( isect, offset, static_cast<DWORD>( length ), &pLines ) ) ) {
            CComPtr< IDiaLineNumber > pLine;
 
:
:
 
 
 

Thursday, May 15, 2014

Today   I want to talk about auditing in the context of software. Auditing is not only for security but also for compliance and governance. Typically they are used for hashing or signing as well as to create an audit trail. When different accounts access a protected resource, an audit trail can reconstruct the timeline of actors and their operations. This is done with the help of two keys a private key and a public key. The private key is for encrypting a message so that the message is not transparent.  Anyone with a public key  though can decrypt it. Of course the public key has to be shared in a trustworthy way and hence a certification authority is involved as an  intermediary. A receiver with a public key can then decrypt the message back to its original content.
Unlike other products functionality features auditing has a special requirement. It is said that incomplete security is worse than no security. The main concern here is how to audit events and sign them such that they have not been tampered with. Auditing explains 'who' did 'what' 'when'. As such it must be pristine. If there is tampering, it lays waste all the information given so far. Moreover, information for a specific event may be critical. If this information is tampered with, there could be a different consequence than the one intended. Even though audit events are generally not actionable except for regulations and compliance, it is another data point and is often demanded from software systems. Hence auditing is as important a feature as any, if not more. To add to the requirements of auditing,  one important requirement is that security events may also be a consumer of auditing. Information about actions, actors and subjects are also events and as such are met with the same signing and hashing that auditing does.

One of the concerns with the sensitivity of the information is addressed by preventing any changes to audited event such as update or delete. This we see is helpful to guaranteeing that the data has not been tampered with. But can it be enforced. what mechanism can we put in place for this ?
Some techniques include hashing. A hashing function such as SHA256 for example will generate the same hash for the same input. Moreover the input may not be deciphered from its hash. Given a hash and the original from different sources, we can now check whether they are the same.
Another technique is to use the public key cryptography mentioned above. A public key and a private key can be used to enable only a selected set of receivers to understand the message. The public key is distributed while the private key is secured. When the public key is handed out, it is assumed to come from the sender that is why it is called a certificate.
If you would like to verify that the certificate did come from the sender, we can reach the certificate authority.The private key is left to the issuing party to secure.The keys can be generated with tools but its left for the party to secure. That said, if the users for the software do not interact with the keys generally or are admins themselves, then this is less severe.
In general its good to be a little paranoid about systems when it comes to security. 

Wednesday, May 14, 2014

Tonight I will discuss some more on function pointer to symbol translation using displacement from base.  because we found that such translation seems difficult without SymInitialize method that takes the handle of the target process. We will look at getting that information from dumps. MSDN says that the handle passed to SymInitialize must be the same value passed to all other symbol handlers functions called by the process. It is the handle that the functions use to identify the caller and locate the correct symbol information. Only that a process that acts like a debugger should not call the GetCurrentProcess to provide the handle and instead provide the handle of the target process. Thus the handle merely seems to be a token for a context. The other parameters to this function are UserSearchPath and fInvadeProcess. The UserSearchPath are the series of paths to be searched for a symbol file. This is an optional parameter. The fInvadeProcess when enumerates the loaded module for the process.
When you try out the SymInitialize with a bogus non-zero handle and follow it with a SymFromAddr call, it returns the error indicating that the handle is invalid.
Now we could use the

DWORD64 WINAPI SymLoadModuleEx(
  _In_  HANDLE hProcess,
  _In_  HANDLE hFile,
  _In_  PCTSTR ImageName,
  _In_  PCTSTR ModuleName,
  _In_  DWORD64 BaseOfDll,
  _In_  DWORD DllSize,
  _In_  PMODLOAD_DATA Data,
  _In_  DWORD Flags
);

but that method also doesn't help us specify the handle of a non-existent process.
So the only option left is to try creating a process with CreateProcess and DEBUG_PROCESS flag to specify that the created process is in debugging mode.

BOOL WINAPI CreateProcess(
  _In_opt_     LPCTSTR lpApplicationName,
  _Inout_opt_  LPTSTR lpCommandLine,
  _In_opt_     LPSECURITY_ATTRIBUTES lpProcessAttributes,
  _In_opt_     LPSECURITY_ATTRIBUTES lpThreadAttributes,
  _In_         BOOL bInheritHandles,
  _In_         DWORD dwCreationFlags, <------------------------------------- DEBUG_PROCESS
  _In_opt_     LPVOID lpEnvironment,
  _In_opt_     LPCTSTR lpCurrentDirectory,
  _In_         LPSTARTUPINFO lpStartupInfo,
  _Out_        LPPROCESS_INFORMATION lpProcessInformation
);


So I tried out with :
// Raw2Sym.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include "windows.h"
#include "dbghelp.h"

#pragma comment(lib, "dbghelp.lib")

int _tmain(int argc, _TCHAR* argv[])
{

//HMODULE hDbgModule = LoadLibrary(L"dbghelp.dll");
if (argc < 3) 
{
printf("Usage: raw2sym base_addr func_ptr func_ptr ...\r\n");
return -1;
}

DWORD  error = 0;
HANDLE hProcess = 0;
bool initialized = false;
hProcess = GetCurrentProcess();
printf ("hProcess = %ld\r\n", hProcess);
_TCHAR wszPath[MAX_PATH+1] = {0};
WCHAR  szImageName[MAX_PATH] = TEXT("foo.exe");
DWORD64 dwBaseAddr = _tcstoui64(argv[1], NULL, 16);
printf("Base Address: %p\r\n", dwBaseAddr);
HANDLE hExe = 0;
HANDLE hFile = 0;
if (!GetCurrentDirectory(MAX_PATH, wszPath))
{
error = GetLastError();
printf("GetCurrentDirectory returned error: %d\r\n", error);
goto exit;
}
else
{
wprintf(L"Current directory is %s\r\n", wszPath);
}

STARTUPINFO si;
        PROCESS_INFORMATION pi;

        ZeroMemory( &si, sizeof(si) );
        si.cb = sizeof(si);
        ZeroMemory( &pi, sizeof(pi) );

        if(!CreateProcess(L"foo.exe", NULL
    NULL, NULL, false,DEBUG_PROCESS, NULL, wszPath, &si, &pi))
{
error = GetLastError();
printf("CreateProcess returned error: %d\r\n", error);
goto exit;
}

hExe = pi.hProcess;

if(!hExe)
{
printf("raw2sym must have splunkd exe and pdb in the same folder as the tool.\r\n");
return -1;
}
else
{
printf ("hExe = 0x%x\r\n", hExe);
}
if (!SymInitializeW(hExe, wszPath, FALSE))
{
// SymInitialize failed
error = GetLastError();
printf("SymInitialize returned error : %d\r\n", error);
goto exit;
}
else
{
initialized = true;
}
        SymSetOptions(SYMOPT_UNDNAME | SYMOPT_DEFERRED_LOADS);

hFile = CreateFile(L"foo.pdb", 0, 0, NULL, 0, 0, NULL);
DWORD64 loadedAddress = SymLoadModuleEx(hExe, hFile, "foo.pdb", NULL,dwBaseAddr , 0, NULL, SLMFLAG_VIRTUAL);
if(loadedAddress == 0)
{
// SymLoadModuleEx failed
error = GetLastError();
printf("SymLoadModuleEx returned error : %d\n", error);
goto exit;
}
else
{
printf("Loaded Address is %p: \r\n", loadedAddress);
}

for (int i = 2 ; i < argc; i++)
{
DWORD64  dwDisplacement = 0;
DWORD64  dwAddress = _tcstoui64(argv[i], NULL, 16);

char buffer[sizeof(SYMBOL_INFO) + MAX_SYM_NAME * sizeof(TCHAR)];
PSYMBOL_INFO pSymbol = (PSYMBOL_INFO)buffer;

pSymbol->SizeOfStruct = sizeof(SYMBOL_INFO);
pSymbol->MaxNameLen = MAX_SYM_NAME;

if (!SymSetScopeFromAddr(hProcess, loadedAddress))
{
DWORD error = GetLastError();
printf("SymSetScopeFromAddr for %s returned error : %d\r\n", argv[i], error);
}

if (SymFromAddr(hProcess, dwAddress, &dwDisplacement, pSymbol))
{
printf("%s\r\n", &(pSymbol->Name));
}
else
{
// SymFromAddr failed
DWORD error = GetLastError();
printf("SymFromAddr for %s returned error : %d\r\n", argv[i], error);
}
}

exit:
if (initialized && hExe)
SymCleanup(hExe);
if (hExe)
CloseHandle(hExe);
if (hProcess)
CloseHandle(hProcess);
if (hFile)
CloseHandle(hFile);
return 0;
}


But that returns an error:
c:\downloads\Raw2Sym\x64\Release>Raw2Sym.exe 000007f79b1b0000 000007f79b1bbb90
hProcess = -1
Base Address: 000007F79B1B0000
Current directory is c:\downloads\Raw2Sym\x64\Release
hExe = 0x34
Loaded Address is 000007F79B1B0000:
SymSetScopeFromAddr for 0 returned error : 6
SymFromAddr for 0 returned error : 6
Error code: (Win32) 0x6 (6) - The handle is invalid.


Instead found a helpful resource here : http://www.debuginfo.com/examples/dbghelpexamples.html
and giving it a try. The code mentioned under SymLoadFromPdb.cpp gives some suggestions but doesn't work either.
c:\downloads\Raw2Sym\x64\Release>Raw2Sym.exe c:\downloads\Raw2Sym\x64\Release\splunkd.pdb 00
1b0000 000007f79b1bbb90
Loading symbols for: c:\downloads\Raw2Sym\x64\Release\splunkd.pdb ...
Error: SymLoadModule64() failed. Error code: 0
A better option might be to use DIA sdk instead as mentioned here : http://msdn.microsoft.com/en-us/library/4sy4ebfw.aspx .
 Conceptually, I don't know if any technique will allow this method since the process/dump needs to be available in memory or disk to associate the base address and the offset. The instruction set is unchanged after the compile so the pdb and exe will be the same. Resolving the symbols from the instruction doesn't vary
As Dia SDK shows here in the article querying the pdb: http://msdn.microsoft.com/en-us/library/eee38t3h.aspx

Software support folks in the field often encounter crash dumps at the customer's premise. There they have some dumps and they don't have any access to symbols particularly when the customer's operating system is Windows.What they see in the dumps or crash log reports are stack frames with raw addresses such as 0x7ff34555 and these are not yet resolved as module!class::method. To investigate the issue, they need to resolve the stackframes. To do this sometimes they have to open the crashdump with a debugger. But even the debugger needs the symbols to resolve the stack frame and the support folks can't take the symbols to the customer's machine. So there is a round-trip involved in taking the dumps to the developers because the symbols and the raw stacktrace are not together. Resolving raw function pointers to symbols in an offline manner is therefore desirable.
This can be achieved by using the same module that the debuggers use to translate the stack trace - dbghelp.dll. This module exports methods such as StackWalk and StackWalk64 that gives the stack trace when pointed to the dump. But more interestingly it has methods SymFromAddr and SymFromAddrW that just seems to do what we want. To use this method we call SymInitialize first so that we have a handle to the process. The method takes the following parameters :
- a handle retrieved from the SymInitialize method
- an Address pertaining to a frame in the stack for which the symbol should be located.
- a displacement from the beginning of a symbol or zero and this is optional
- a buffer to receive the SYMBOL_INFO resolved from the address.
If the function succeeds, it returns true. The SYMBOL_INFO struct has a member called MaxNameLen that should be set to the size of the name in characters. The name can be requested to be undecorated for easy readability.
The code from MSDN is something like this:

DWORD64  dwDisplacement = 0;
DWORD64  dwAddress = SOME_ADDRESS;

char buffer[sizeof(SYMBOL_INFO) + MAX_SYM_NAME * sizeof(TCHAR)];
PSYMBOL_INFO pSymbol = (PSYMBOL_INFO)buffer;

pSymbol->SizeOfStruct = sizeof(SYMBOL_INFO);
pSymbol->MaxNameLen = MAX_SYM_NAME;

if (SymFromAddr(hProcess, dwAddress, &dwDisplacement, pSymbol))
{
    // SymFromAddr returned success
   printf("%s\n", buffer);
}
else
{
    // SymFromAddr failed
    DWORD error = GetLastError();
    printf("SymFromAddr returned error : %d\n", error);
}

When we give some address it should be the relative offset from the base address.

Tuesday, May 13, 2014

In this post we will continue to look at logparser and Splunk. Consider a case when Splunk can translate searches to SQL queries. This would mean that all the operators that Splunk enables on the search bar such as regex, rex, top, stats, multikv, collect etc all work without seeing any difference between Splunk indexes or SQL data providers.  Splunk seamlessly parses the search operations on data without importing it into its indexes. In such cases there would have to be a translation of Splunk search operators into LINQ or SQL depending on where the data resides. A cursory look at the operators will suggest that the predicates be pushed down as close to the data as possible. In this case, Splunk keeps its index and data operators as close as possible. If the operators were to target the data on an external source there would be several copies of the data and translations involved. This will be similar to the pipe operation in Splunk. Splunk exposes several semantics that work well in pipe operations. This is very helpful to IT world for administration as well as for automation. What Splunk provides for analysis is significantly improved by its search language. While we can generalize the same operators for other data sources, the search language works well for Splunk data because of the fields extraction and event indexing.

LINQ on Splunk


Language-Integrated Query (LINQ) is a set of features introduced in Visual Studio 2008 that extends powerful query capabilities for use by applications. LINQ is widely used in web applications and services in a variety of industries such as finance, retail, telecommunication and insurance. LINQ provides two features. It enables constructs in the language that have become very popular with web developers. And it abstracts the underlying data source while enabling the same query logic to operate against any data source. The language syntax for querying are available as standard query operators and they are very much in nature to SQL ANSI standard which have an established foundation in data storage and retrieval. While SQL is ubiquitously used for relational data, the standard query operators are not restricted to work against a database. LINQ can be executed on XML, plaintext, CSV, databases etc. so long as the standard query operators can see a sequence of objects on which to perform their operations such as selection, conversion, merging etc. This enumeration of the items in a sequence is defined by an interface popularly known as IEnumerable.  The behavior demanded from collections implementing this interface is that the items can be extracted via iterations. At the same time, the data providers implement an interface called the IQueryable that can take an expression tree and execute it on the data source. Together these interfaces connect an application to the data source with the help of powerful querying capabilities.
Splunk also provides programmable interfaces to its data source via what are known as Entities and Collections, which provide a common mechanism for a plethora of resources. Most of the SDK makes REST API calls and hence operates on a single resource or a collection. Entity and EntityCollection both derive from common base class Resource.  Note that this pattern exposes the resources per se and not their querying capabilities.  When objects are instantiated they are local snapshots of values as read and copied from the server. Changes made on the server are not available until a refresh is called. Compare this with the object relational mapping (ORM) software that use LINQ patterns. The ORMs are able to interpret that the state of an object has become ‘dirty’ from local updates and needs to save to disk seamlessly.  At the same time, an observer pattern notifies the application of the updates made to the server data.
Splunk does have its own powerful querying capabilities.  Splunk search operators are similar to Unix style piped operators and are also available for use from the same SDK. These are exposed as searches and jobs and they can take a string containing the search query and execute it on the server. The results are then returned to the caller. Searches can be either one shot or real time.  The one-shot is a synchronous API and it returns a stream. Real-time searches return live events as they are indexed, and this type of search continues to arrive. To view the results from a real-time search, we view the preview results.  Some parameters to the Search Job enable this mode. In the normal execution mode, a Search Job is first created with a search string and then the job is polled for completion. A built-in XML parser can then render the stream of results.
Thus we note that there is a mismatch in the way the items are retrieved from the server from the server to the application between LINQ and Splunk search. That said, the power of Splunk can be made available via adapters via LINQ patterns.