Monday, June 30, 2014

In the review of the book power of habit mentioned earlier we covered the findings by the author on what it is and why we do it. We now look at some of the applications discussed in the book. The author gives examples from successful organizations. He describes habits that are important as keystone habits. He gives the example of Alcoa - an aluminium manufacturing giant that had been dwindling and brought on a new  CEO O'Neill. O'Neill was a former government bureaucrat who wanted the employees to focus on safety and touting it as a habit of excellence. Further, he wanted to make it an indicator of progress throughout the institution.His comment was initially perceived as totally out of context and relevance to the situation the company was in but within a year to the date from his speech, ONeill grew Alcoa' market capitalization to $27 billion and created record profits.  The value of the stock had risen five times larger than what it was. The key to his success was that he attacked one habit and watched the changes ripple through the organization. O'Neill observed that some habits were more important than others. They had the power to start a chain reaction. These are called Keystone habits. Keystone habits say that success is dependent on a few key priorities and fashioning them into powerful levers. Organizational behavioral patterns are quite common. In fact researchers have found institutional habits in almost every organization or company they've scrutinized. What ONeill found was that the decision making was being ceded to a process that occurred without actual thinking. Consequently the workers were not nimble and the quality of the products was poor. When previous CEO had tried to mandate improvements, effigies of managers were burned. ONeill wanted something that everybody including unions would agree was important. By declaring safety as the priority with a metric as zero injuries, ONeill had brilliantly mandated that Alcoa be more streamlined. He used a simple cue : worker injury to institute an automatic routine where the unit president had to put in a place a plan to never let it happen again. And there was a reward, the only people who got promoted were those who embraced the system. The ripple that this created - unions embraced productivity measurements, workers were given more autonomy, faulty equipment were replaced, loss of raw material was reduced with upgrades etc.
The author says that studies have shown such chain reactions with habits even in family life. Families that eat dinner together seem to raise children with better homework skills, higher grades, greater emotional control and more confidence. Habits create a new structure that establish cultures where change becomes contagious.
Habits can be complex and old. Yet every habit is malleable. Its the power in remaking a habit that this book advocates.

Sunday, June 29, 2014

The power of habit is a book by Charles Duhigg. The author explores why habits exist and how they can be changed. He includes anecdotes on companies and individuals who struggle to change and who seem to make the change overnight. The author argues that most of the choices we make today seem to be well considered decision making but are in fact just habits. By understanding how habits work, we can rebuild those patterns in whichever way we choose.
Researchers have found that an organ inside the brain is responsible for forming habits called basal ganglia  They studied rats running through a maze to form their opinions on how rats internalize the maze. Converting a sequence of actions into an automatic routine is called chunking. There are dozens if not hundreds of behavioral chunks and they happen because brain is trying to figure ways to save effort. The process is a three step loop involving a cue, a routine and a reward which when repeated becomes a habit. Some peoples habits are obnoxious. If we want to create new habits we should look at Claude C Hopkins who was in the advertising industry and had  a set of rules he coined to create new habits among consumers. Among his rules, he first created a craving to a power of habit. He illustrated this by making
America brush with pepsodent.  The cue he used was a tooth film which was universal and impossible to ignore. He was selling a sensation.To understand next how to change the habit, the author illustrates the golden rule with the example of Tony Dungy who as a coach wanted to change the behavior of the players. Dungy wanted to attack only the middle step in how habits form - the routine. The cue and the reward are kept the same but the routine is changed. When this rule is applied correctly, even habits can be reversed. 

Saturday, June 28, 2014

Good to great is a book written Jim Collins. The following is a summary review of the book.  The title comes as a sequel to the author's earlier book called Built to Last which he attributed as greatness and hence the suggestion that many companies that are good don't get to the next stage of being great.  Here he focuses on why more companies don't do what worked for others or do it more often. He picked eleven companies Abbott, Fannie Mae, Kimberly Clark, Nucor, Pitney Bowes, Wells Fargo, Circuit City, Gillette, Kroger, Phillip Morris and Walgreens. These companies had to meet the criteria that they had 15 year cumulative stock returns that were at or below the general stock market, punctuated by a transition point, then cumulative returns with three times the market over the next fifteen. These companies were selected over their competitors because they either do not show such a leap or did not sustain it.  From these selections, the author describes patterns that emerged from this behavior.
The first topic was about leadership. Contrary to the belief that celebrities may have influenced the performance in the market with their charisma or influence, these companies had leaders who demonstrated Level 5 leadership. The term refers to those who unlike the celebrities had personal humility and intense professional will. Many such leaders often setup successors for success. Their humility came with a steely resolve and intolerance for mediocrity. Abbott Laboratories attacked nepotism in its company to make the leap. The levels from 1 to 5 are graded as follows: level 1 - comprises highly capable individuals who make productive contributions through talent, knowledge, skills and good work habit; level 2 comprises members who contribute at the team level and work with others; level 3 is a competent manager who organizes people and resources towards objectives; level 4 is a leader who draws commitment and pursuit to a clear vision; level 5 is those who establish the vision independent of themselves to make the company move to the next plane.
However, the author argues vision or strategy is not the primary prerequisite to make the leap, instead its the selection of people to get on the bus before deciding where to go. As an example, Wells Fargo continuously hired the best to prepare for changes in its industry. The book claims the following truths:
If we focus on who are with you rather than the mission, the changes is going to be easier.
If we have the right people, we spend less time on managing them.
If we have the wrong people, we cannot make the most of the right direction.
Essentially its about people who practice a culture where they don't worry about their positions.
It also implies not to hire too soon too fast and to put existing people not on the biggest problems but the biggest opportunities.
Another pattern that emerges from this company is the notion of disciplined thoughts. Self-assessment by companies is a pre-requisite to knowing what transition to make. Called brutal facts, these establish the weaknesses that are clear enough. The author highlights that some leaders were open enough to say "I don't know" and that is welcome. It simply argues for evaluations and positive feedback more often to keep refining it till it becomes clear. Clearer facts are easier to execute on, be measured and put in a cycle of improvements. Such facts could also be opened up to the public. To do this the book suggests the following:
- Lead with questions, not answers such that we often discover more and more than what we perceive.
- And engage in a dialogue not a debate, sermon, ridicule or worse coercion.
- Review past work but without blame and with a resolve to do better
- Keep metrics and thresholds so that they can translate to actionable alerts.
The book also introduces hedgehog concept which divides the world into two groups - the hedgehogs who translate the complex world into a single idea or a principle and the foxes who take the complex world for what it is and work on different levels often missing a unified direction. The hedgehogs win in a match with the foxes.
The use of the hedgehog concept is to illustrate that leaders with a simple mission can help drive the company from good to great. To find this mission, the concept talks about identifying the intersection of three dimensions : what we can be best at, what drives the economic engine, and what we are deeply passionate about.
Another pattern mentioned in this book is about  disciplined action. These involve building a culture around the idea of freedom and responsibility, within a framework. It also involves getting passionate people who will take their responsibilities forward. The degree of adherence to the discipline should be reasonable and more towards the hedgehog concept than religion.
Lastly, the book mentions a pattern to use technology to accelerate and encourages the use of right technologies as opposed to new technologies. In fact, it cautions against the zeal to use new technologies.
With the patterns mentioned, the book suggests that there is a flywheel effect that can result in good to great transformations. The opposite of the flywheel is the doom loop which can be seen in the cases where the patterns are not followed.
  

Friday, June 27, 2014

Having looked at the bottom up approaches to finding sub-networks in a graph, I will now read the top down approaches from the Hanneman online text.  The bottom up approaches help us understand the processes by which actors build networks. On the other hand, the top down approach helps us find holes, vulnerabilities, or weak spots that define lines of division in a larger group. These help to describe the levels of group selection and the constraints under which actors build networks.
We look at components first in this methodology. Components are subgraphs that have high cohesion and less adhesion.Components can have weak ties whether the direction of the ties don't matter and strong ties which are directed. If we have a metric that establishes connections between components as one where the actors participated in something common, then for a very high threshold, we may have few or no independent actors.  and for a lower threshold, we may group them into a component. This therefore is too strong a definition to find weak points.
Blocks and Cutpoints are alternative approaches to finding key weak spots in graphs. If a node is removed and the structure is divided into disjoint parts then that node is called a cutpoint and forms a broker among otherwise disconnected groups. These groups are called blocks. We can find the blocks by the cut points While component analysis looks at missing links, this bi-component analysis looks at vulnerable links.
Lambda sets and bridges is another alternative approach for the same. This ranks each of the relationships in the network in terms of the importance by evaluating how much of the flow goes through each link and when disconnected would greatly disrupt the flow between nodes.
Factions are ideal groupings where the members are closely tied to one another but not to anybody else. This concept helps us assess the degree of factionalization in the population.
With the bottom up and top down approaches, we find sub-networks in a graph. The groupings or cliques - their number, size and connections can tell us about the behavior of the network as a whole. 

Wednesday, June 25, 2014

A graph of knoke information shows strong symmetric ties. It also answers questions such as how separate are the sub-graph ? How large are the connected sub-graphs.Are there particular actors that appear to play network roles ?
We now look at ways to find sub-graphs.  One approach is the bottom up manner. A clique extends the dyads by adding members that are tied to all the members in a group.The strict definition can be relaxes to include nodes that are not quite so tied as we will see shortly with n-cliques, n-clans and k-plexes. The notion however is to build it outwards to construct the network. The whole network can then be put together by joining cliques and clique like groupings.
Formally, a clique is the maximum number of actors who have all possible ties among themselves. It can be considered to be  a maximal complete sub-graph. Cliques can be viewed in conjunction with the Knoke information matrix mentioned earlier. We might be interested in the extent to which these sub-structures overlap and which actors are more central or more isolated than cliques.
We can examine these by evaluating the actor by actor clique co-memberships.Then we can do hierarchical clustering of the overlap matrix which gives us an idea of the closeness of the cliques. Visually we can see the co-membership and the hierarchical clusterings as matrices formed from the actor and cliques and levels and cliques respectively.
 Two major approaches to relaxing the definition for cliques are the N-cliques and N-clan approaches.
In N = 2, cliques we say that an actor is a member of a clique if it is connected to every other member of the clique at a distance greater than one, and in this case we choose two. The path distance of two corresponds to the actor being a friend of a friend.
The cliques that we saw before have been made more inclusive by this relaxed definition of group membership. 
The N-clique approach tries to find long and string like groupings instead of the tight discrete ones by the original definitions. It is possible for actors to be connected through others who are themselves not part of the cliques.
To overcome this problem, some restriction is imposed additionally on the total span or path distance between any two members. This is the N-clan approach where all ties are forced to occur by means of others members of the n-clique.
If we are not comfortable with the idea of using a friend of the clique member as a member of the clique, we can use the K-plex approach. In this approach we say that a node is a member of a clique of size n if it has direct ties to n-k members of that clique. This tends to find a relatively large number of small groupings. This shifts focus to overlaps and centralization rather than solidarity and reach.
Courtesy: Hanneman notes

Monday, June 23, 2014

In this post, we will cover an interesting topic : cliques and sub-groups.
In graphs and networks, one of the common interests is the presence of "sub-structures" that may be present in a network. Neighborhoods and groups of nodes fall into this category. When small compact sub-networks are joined to form large networks in a bottom up manner, we form extended network known as cliques. In cliques there's generally more interaction between the members than with others.
A clique can be considered a closed elite.  We can also look for this substructure from the top down. The idea that some regions of graph may be less connected than the whole  may lead to insights into cleavage and division. Weaker parts in the social fabric also create opportunities for brokerage and less constrained action.Most computer algorithms for locating sub-structures operate on binary symmetric data. We can use Knoke information exchange data to illustrate these sub-networks with strong ties. Knoke information exchange can be viewed as a binary connectedness values on the adjacency matrix of a directed graph.

I'm taking a short break today as I'm taking my time on another task from work today.
The use of matrices to describe social relations is as follows:
Transform a block operation allows us to select a matrix to be blocked, a row and/or column partition and a method for calculating the entries in a resulting block. We first split the row and column partition. These are just data sets which we then group to form partitioned data sets.  This operation requires a method for summarizing the information within each block. The operation outputs two new matrices. The PreImage data set contains the original scores, but permuted. The reduced image data set contains a new block matrix containing the block densities.
The Transform collapse method allows us to combine rows and/or columns by specifying which elements are to be combined and how. Combinations can be maximum, minimum and sum. The result of the combinations is a new matrix with specified operation performed.
The Data -> Permute allows us to re-arrange the rows and/or columns and/or matrices.  This operation requires us to list the new orders method needed.
The Data->Sort re-arranges the  rows, columns or both of the matrix according to a criterion we select.
The Data-> Transpose re-arranges the data in a way that is very commonly used in matrix algebra and switches the columns with the rows.

Sunday, June 22, 2014

In tonight's blog post, we revert to the discussion on open graphs and  matrix operations. We talked about matrix multiplication which is written in the form : (a1, a2, ... an  in the same row) (x1, x2 ... xn in the same column)  as resulting in a1x1 +  a2x2 + ... + anxn. Note that this product of matrix can also be represented by A(b1, b2,  .. , bn) = Ab1, Ab2, Ab3, .. Abn that is the matrix A acts separately on each column of B.
Row reduction is another operation.
This can be done in one of three different ways :
1) Multiplying a row by non-zero scalar.
2) adding a multiple of one row to another
3) swapping the position of two rows.
Each of these steps are also reversible so if you start from one state and go to the other, you can undo the change.  This is called row-equivalent operations.
A matrix is said to be in a row-echelon form if any rows made completely of zeroes lies at the bottom of the matrix and the first non-zero entries of a staircase pattern. The first non-zero entry of the k+1 th row is to the right of the kth row.
If the matrix is in a row-echelon form, then the first non-zero entry of each row is called a pivot and the columns in which pivots appear is called pivot columns.It is always possible to convert a matrix to row-echelon form. The standard algorithm is called Gaussian elimination or row reduction.
 The rank of a matrix is the number of pivots in its reduced row-echelon form.
The solution to Ax = 0 gives the pivot variables in terms of the free variables.

Saturday, June 21, 2014


In today's post we take a short break to discuss spatial and temporal control for  a storage management. First we discuss spatial control. In terms of access to and from disk, sequential access is ten to hundred times faster than random access and more. Disk density has been doubling. Also bandwidth increases as the square root of density. As a result storage managers often organize large data in a way that it can be accessed sequentially. In fact database management systems exercised full control on how the database are partitioned on disk.
The best way for a storage manager such as a DBMS to do that is to lay it out on the disk itself and avoid the file system. This works especially well when raw disk device addresses correspond to physical proximity. But avoiding the file system has the following drawbacks. First it requires attention by a DBA and resource usage such as an entire disk partition. Second raw disk access is often OS specific and introduces portability concerns Third logical file managers such as RAID and SAN have become popular. Due to the presence of these interceptors to raw disk addresses that virtualize these, alternatives have to be considered. For example, we can create a very large file on disk and manage offsets. This is facilitated by the file system and also by virtualized storage managers.This improves performance to the point where the degradation in using a single file on a commercially large system was found to be about 6%
We now look at temporal buffering which is about when the data gets written and not about where the data gets written. If the DBMS wants to control when to write, it can get harder with OS implementing buffering mechanisms. Some of the challenges include the difficulty in guaranteeing ACID transaction promise because the transaction manager cannot guarantee atomic recovery on software and hardware failures without explicitly controlling the timing and ordering of disk writes. For example, the writes to the log device must precede corresponding write to the database device such as with the writeahead logging protocol. The second set of concerns with OS Buffering is about performance as opposed to correctness. The file system wants to read ahead speculatively and write behind with delayed batch writes which are poorly suited for a storage manager. A DBMS for instance can predict the IO decisions based on future read requests such as when reading ahead to scan the leaves of the B+ tree. Similarly when writing we could control say when we flush the log tail by making decisions about the locks or IO throughput. Finally, there may be double buffering and CPU overhead of memory copies. Copies degrade performance by contributing to latency, consuming CPU cycles and potentially flooding the CPU.
Courtesy: Paper by Hamilton, Krishnamoorthy et al.

Friday, June 20, 2014

I is another operation  ill review matrix operation on graphs. When we represent graphs as matrices, we can now use mathematics and software tools to find patterns. This is easier than visually analyzing a complicated graph. The matrix representations are usually square but they can be three dimensional also consisting of rows columns and levels or slices. Matrices can also be symmetric or unsymmetric. In symmetric matrix the value in row I column t is the same as row t column I. As an example of an unsymmetric matrix, consider social relations where jack may have a liking for Jill but Jill may not have a liking for Jack.
Operations permitted on a matrix include the following
Permutation here the rows and columns are rearranged such as in a symmetric matrix to find patterns. With these reorderings, the pairwise relationships are not affected.
Partitioning is another operation where the matrix is split into blocks or images.
Embedding is another operation that helps finds groups of nodes that participate in different patterns

Thursday, June 19, 2014

In the previous post, we described a Chinese Whispers clustering algorithm. We now study the algorithm with two different approaches. The first approach is what we had discussed earlier as using small world edge weights.The outcome of this is similar to the min-cut algorithm where dense regions are considered clusters. But unlike min-cut there is no hierarchical clustering and instead gives a flat partition. It does not require any threshold as input parameter and is more efficient. The second approach is based on Matrix operations. This algorithm can be considered a special case of Markov Chain Clustering. MCL is the parallel simulation of all possible random walks up to a finite length on a graph G. Walks will terminate usually in the same cluster that they started from.  This is an observation that comes from MCL.  MCL simulates flow in the graph by repeatedly updating transition probabilities between all nodes eventually converging to a transition matrix after k steps. The updates to the transition probabilities between all nodes is done in two steps and then these two steps are iterated. The first step is the expansion step. This step involves the matrix multiplication of the graph matrix with the current transition matrix.  The second step is the inflation step where the contrast between small and large transition probabilities is increased and normalized to unity thus providing a way to segment the matrix. The inflation step occurs in column wise operations.
The k-matrix multiplications of the expansion step has a significant cost.
In the early stages of the iterations, the matrices operated on are dense. But as the iterations proceed and with the help of a strong inflation operator, matrices in the later steps tend to be sparse. As an optimization the inflation step cost can be lowered by using only a few entries per column and can even be reduced to a single largest value.
The Chinese Whispers algorithm can now be described in matrix operations as follows:
Lets start with the identity matrix of dimension n
for each of the t iterations
   we run a maxrow operator on all entries of a row to zero except the largest entry which is set to 1
    then we multiply this resulting matrix with the adjacency matrix of the graph and use the result as the input to the next iteration.
Since the time complexity is dependent on the n non-zero entries in the result of the max-row operation, in the worst case this could equal the time complexity of MCL which is a fully connected graph.
We continue our discussion on Chinese Whispers graph clustering program. Its a bottom up clustering algorithm where the nodes start out with different class labels and coalesce to their strongest neighborhood class labels. We determine the strongest labels by the node with the maximal sum of edge weights. The clustering usually proceeds for a small number of iterations. The clustering here is different from clustering vectors because there is no distance metric and consequently no centroid or average. The similarity is measured in terms of edges. Chinese Whispers is one such graph clustering algorithm. It is often used with small world graphs. A small world graph is a kind of graph in which most nodes are not neighbors of each other and yet most nodes can be reached by every other by a short number of hops. As an example consider of a small world graph, consider a set of nodes on the circumference of  a circle.
In the small world graphs clustering, sometimes a random mutation rate is introduced that assigns new classes and this is done with decreasing probability This greatly improves the convergence in early iterations.  If we take a graph with eleven nodes where the nodes are arranged in two connected sub-graphs with some edges connecting to each other to form a small world graph, this technique converges the class labeling in just two iterations because the mutation acts as a test that then facilitates the proper labeling.
The algorithm does not cross component boundaries because there are no edges between nodes belonging to different components. This means that nodes that are not connected by any edges are discarded by this algorithm which leaves a portion of the nodes unclustered.
We see that the algorithm does not always converge as in the case of labeling a node with a tie i.e it could be assigned one or another class label with the same edge weights.  This wouldn't happen if the edges were weighted in the graph. Also the classes generally don't update much after a few iterations.
Note that this simple technique does not take any parameters and the class labels are hard. If soft labels are required then we can assign a class distribution.This can be done as a final step by taking a weighted distribution of hard labels.

Tuesday, June 17, 2014

In the JoBim Text project Gliozzo et al introduced an interactive visualization component.  JoBim is an open source platform for large scale distributional semantics based on graph representation. A distributional thesaurus is computed bipartite graphs of words and context features. For every sentence, a context is generated for semantically similar words.  Then the capabilities of the conceptualized text is expanded in an interactive visualization.
The visualization can be used as a semantic parser as well as disambiguation of word senses that is induced by graph clustering.
The paper comes from the view that the meaning in a text can be fully defined by semantic oppositions and relations between words. To get this knowledge, co-occurrences with syntactic contexts are extracted from a very large corpora. This approach does not use a  quadratic to compute the word similarities. Instead it uses the contemporary MapReduce algorithm. The advantage is that the MapReduce algorithm works well on sparse context and scales to large corpora.
The result is a graph that connects the most discriminative context to terms with explicit linking between the most similar terms. This graph represents a local model of semantic relations for each term. Compare this with a model of semantic relations with fixed dimensions .In other words, this is an example of ego networks. This paper describes how to compute a distributional thesaurus and how to contextualize distributional similarity.
The JoBim framework is named after the applied observations of terms (Jos) and context (Bims) pairs with edges. This operation of splitting observations into JoBim pairs is referred to as the holing operation.
The significance of each pair (t,c) is computed  and then only the p most significant pairs are kept per term t resulting in the first order graph. The second order graph is extracted as the similarity between two Jos. This similarity is based on the number of salient features the two Jos share. The similarity over the Jos defines a distributional thesaurus and the paper says this can be computed efficiently in a few MapReduce steps and is said to be better than other measures. This can be replaced by any other mechanism as well as the paper proceeds to discuss the contextualization which as we know depends a lot on smoothing. There are many term context pairs that are valid and may or may not be represented in the corpus. To find similar contexts, they expand the term arguments with similar terms. But the similarity of these terms depend on context. The paper therefore leverages a joint inference to expand terms in context using a marginal inference in conditional random field (CRF) CRF works something like this. A particular word, x is defined as a single definite sequence of either original or expanded words. Its weight depends on the degree to which the term context associations present in this sentence are present in the corpus as well as  the out of context similarity of each expanded term to the corresponding term in the sentence. The proportion of the latter to the former is specified a tuning parameter.

We will now look at word sense induction, disambiguation and cluster labeling. With the ranking based on contextually similar terms, there is some implicit word sense disambiguation. But this paper addresses it explicitly with word sense induction.

OK so we will cover some more of this in tonight's post.
The authors mention a WSI technique and use information extracted by IS-A pattern to label clusters of terms that pertain to same taxonomy or domain.  The aggregated context features of the clusters help attribute the terms in the distributional thesaurus with the word cluster senses  and these are assigned in the context part of the entry.
The clustering algorithm they use is called the Chinese Whispers graph clustering algorithm, which finds the number of clusters automatically  The IS-A relationships between terms and their frequency is found from part of speech. This gives us a list of IS-A relationships between terms and their frequencies. Then we find clusters of words that share the same word sense.  The aggregates for the IS-A relationships for each cluster is found by summing the frequency of the hypernyms and multiplying this sum by the number of words in the cluster that elicited this hypernym. This results in the labels for the clusters that is taxonomic and provides an abstraction layer over the terms. For example, jaguar can be clustered into the cat sense or car sense and the highest scoring hypernyms provide a concise description of these senses. The occurrences of ambiguous words in context can now be disambiguated to these cluster senses.
The visualization for the expansion of the term context pairs uses the Open Domain Model which is trained from newspaper corpora.

We will next talk about Interactive Visualization Features  but before we do that let us first talk about the Chines Whispers clustering algorithm.

The Chinese Whispers is a randomized graph clustering algorithm (Biemann). The edges are added in increasing numbers over time. The algorithm partitions the nodes of weighted undirected graphs. The name is derived from a children's playing game where they whisper words to each other. The game's goal is to arrive at some funny derivative of the original message by passing it through several noisy channels.
The CW algorithm aims at finding groups of nodes that broadcast the same message to their neighbors.
The algorithm proceeds something like this:
First assign all the vertices to different class
while there are changes,
     for all the vertices taken in randomized order:
       class(v) = highest ranked class in neighborhood of v;
Then the nodes are processed for a small number of iterations and inherit the strongest class in the local neighborhood. This is the class whose sum of edge weights to the current node is maximal.


Having discussed centrality measures in social graph, let us look at some more social Graph techniques.  I'm going to discuss ego network. Ego network has to do with the individual as the focus. But before we delve into that, I will review the idea of embedding in a network. Embedding is the extent to which individuals find themselves in dense strong ties. These ties are reciprocative and they suggest some measure of constraints on individuals from the way that they are connected to others. This gives us an idea of the population and some subtleties but not so much about the positives and negatives facing the individual. If we look at the context of an individual, then we are looking at a local scope and this is the study of ego networks. Ego means an individual focus as in a node of the graph. There are as many egos as there are nodes.  An ego can be a person, group, Or organization. The neighborhood of an ego is the collection of all egos around the individual to which there is a connection or a path. The connections need not be one step but they usually are. The boundaries of an ego network are defined in terms of neighborhood.  Ego networks are generally undirected because the connections are symmetric. If they are different,  then it's possible to define an in network and an out network.  The ties in an in network are inwards and those in the other network are outwards. The strength and weakness of the ties can be defined in probabilities. Using these we can define thresholds for the ones we want to study. Finding these ties is a technique referred to as data extraction.  The second technique is subgraph extraction. The network density of an ego network can be represented by the number of indexes for each ego in a dataset. We will review more graph techniques shortly.
Courtesy: Hanneman online text.

Monday, June 16, 2014

The benefits of registering a handler to create tracker events as described in the previous posts is that it can be invoked from different sources such as UI, CLI ( command line interface) and configuration files. The handler adds/deletes/lists the trackers. Trackers are listed based on the pipeline data they are registered for. We start with a global tracker so we may want to list just one.
In the previous two posts we have discussed adding a tracer/marker that has tracking and operation log information to the event data in Splunk. We mentioned all events have host and timestamp information and we covered why we wanted this special event. The crux of the implementation is the introduction of this special events to data pipeline. To create the event, we follow the same mechanism as we do for audit events.
Before :
            1     2     3
            __     __      __
           |__|    |__|    |__|               Events --->
------------------------------------

After :
            1     2      Tracker              3
             __     __           __           __
            |__|    |__|         | __|         |__|      Events --->
-------------------------------------

When we generate these events we want to inject them locally to each stream. But we could start with a global tracker that traces the route of the data between various instances in the deployments. The events make their way to internal index. The event is written by creating an instance of the pipeline data from a model that's pre specified. Fields are added to the event to conform it as a regular event.
These are then sent on their way to a queue. Initially we could send the events directly to the queue serviced by and Indexing thread but in order to not hold up that queue or deadlock ourselves, we could create a separate queue that feeds into the other queue.


Sunday, June 15, 2014

In addition to the previous post, where we described in some detail how trackers work when they are injected into a pipeline for data in Splunk,  let us now look at the changes to the indexer to support adding labels or trackers in the data pipeline When the data flows into an input processor, it is typically initiated by a monitor or a processor in the forwarder. When it arrives in the indexer, it is typically handled by one or more processors that apply the configurations specified by the user. When the raw data is written to the Splunk database, it is received by an Indexed Value Receiver that writes it to the appropriate Index. As an example, if we take the signing processor that encrypts the data before it is written to the database, we see that it takes the data from the forwarder and forwards it to Splunk database.  Here the data is changed before it makes its way to the database.
Addition of this tracker is via a keyword that can be added to any existing pipeline data. Adding a keyword does not need to be user visible. This can appear in the transformation or expansion of user queries or inputs. For example at search time, the use of the keyword tracker can be used to display the tracking data that was injected into the data stream. This can be implemented by a corresponding Search Processor. Similarly on the input side, a corresponding processor can inject the trackers into the data pipeline. The data might go to different indexes and its debatable whether data should go only into the internal index or to the one specified for data. The choice of internal index lets us persist the trackers separately from the data so that the handlers for the different processors do not affect the data itself and this requires little code change. On the other hand, if we keep the trackers and the data in the same index, then almost all components reading the data from the index will need to differentiate between the data and the trackers. Associations between the data and the trackers even if kept separately are easy to maintain because we keep track of timestamps and trackers pertaining to an input can show where they are inlined with the data by means of the timestamps.
IndexProcessor has methods to retrieve buckets and events by timestamps. We add a method to IndexProcessor to add the tracker to internal buckets. 

Saturday, June 14, 2014

Why do we discuss a tracer in the event data if all data carry hostname and timestamp field ? We will answer this shortly but tracers enable to generate artificial markers of varying size that test the flow of data in production without affecting or requiring any user changes to the existing data or its manipulation. In other words we add new data with different payloads. As an indicator of the payload, it could journal ( yes i'm using that term ) not only what hosts are participating at what time but also record all operations taken on that data flow so that we can see if the record of operations matches the final result on the data.  If the tracer weren't there we would have to id the machines that participated and the logs on those machines to reconstruct the timeline, whereas these records are automatically available. This is no replacement to the logger of existing components but just an addition for each processor to indicate what it executed.
There are three components to enabling a tracer in the event data for Splunk. First is the forwarder which creates the event.  Second is the indexer which indexes the data. Third is the search-head or peer that searches the data.
To create a tracer in the data, an admin handler  is registered. This takes the following methods:
to create/update
to delete
to list
and such others as reload etc.
The tracer can be triggered via UI, conf file and and even CLI.
The intention of the tracer is to show how the data is flowing and can be put in any pipeline.
Therefore its a processor that can be invoked in different sources and streams. However, we will deal with creating a dedicated data flow for tracer that can be started from any forwarder.
To add  the processor and the admin handler, we just implementing the existing interfaces.
To enable the indexer to add its data is slightly different.
We will cover that in more detail shortly.
First the important thing is that the data has to be labeled.  We can create a custom data record with a header that identifies this type of data uniquely.
The second thing is to create a record for each host as the data makes its way. This host adds a separate entry with its hostname and time stamp. Further details can be added later as appropriate.
The destination index for this kind of data should always be internal since this is for diagnostics. The destination could be switched to nullQueue if specified but this is not relevant at this time. 
The third thing is to create a mechanism to turn this on and off. This could be done via the controls that the admin handler processes.
Initially, there needs to be only one tracer for a Splunk instance but this can be expanded to different inputs as desirable. The config section entry that specifies it globally to  the Splunk instance can be specified local to any input.
The presence of an entry in the config for tracer indicates that Splunk will attempt to send a tracer data every 15 minutes through its deployments to its indexes which can be viewed globally.
The tracer data is very much like audit data except for being more public and user friendly. It has information that enables a holistic view of all actively participating Splunk instances.


Friday, June 13, 2014

How do we implement a tracer event data in Splunk ?
First, we create a modular input that can take a special kind of data. We know the data structures to do this because we have seen how to use them before
Second, we need a destination or sink where this data will be indexed. We should not be using the null queue since these tracers carry useful information and gathering them over time will help with statistics.
The information is added per node during transit. and different types of nodes such as search peers or indexers or forwarders can stamp their node type.
The overall implementation is just a class declaration and definition that registers itself as a processor for all incoming data and kicks into action only when a specific kind of data is encountered.
The data is very simple and has a reserved field to used to identify the tracer.
The payload in the tracer consists of a simple structured data involving the timestamp, the node type, the node id, and duration of time spent.
Also in any topology, the tracer data will flow from one source to one destination For multicast, there will be many more of the copies of the tracers made. Once they are all indexed we can group them. Also the final destination for the data can be one index or all indexes. In other words we flood the topology to cover all the indexers.
Where the tracer differs from the existing heartbeat functionality is that this is more for the entire route rather than between adjacent source destination. A tracer is triggered to flow through a path consisting of one or more nodes. It should be triggered by the user or periodic runs.

Today we look at a little bit more on this feature. We want to see how the parser, processor, and indexer will treat this data.
Let's start with the parser.

Today I will implement this some more.

Thursday, June 12, 2014

Tracers in distributed computing.
When there are many peers participating in a task, sometimes the only way to determine the actions done by each is to send a tracer through the system. This is also the same functionality that is so niftily used in networking to find the participating hosts in a route. I'm referring to the traceroute tool. This tool lists the hosts in the order that the IP packet travels and each host adds an entry with its ipaddress. When we are interested only in source destination reachability, we use the ping tool. The ping tool sends multiple echo requests to the destination server and calculates the packet loss. If the other side is responding properly, we can collect the responses. The traceroute tool on the other hand is a way to find what path is taken to reach the destination and is for a single packet only. There are no multiple packets sent. This comes in useful for say DNS path resolution.
In distributed computing however, we use tracers as markers to see how the diffusion propagates through the system. In gossip for example, we can use a dummy message to see the participants involved in sending the message.
The gossip propagates by selecting peers at random. Each peer when selected and received a task, can update it The goal in gossip is to spread the information as quickly as possible. This can happen even over unreliable network.
Gossip protocols can also be one that aggregates. To compute the aggregates, the values from the nodes are combined to arrive at a system wide value. The key idea in aggregation is that the aggregate must be computed as fixed size pairwise  exchanges. The exchanges terminated after a few rounds and usually by the end of all to all information flow. The gossip algorithms can arrange the node in a tree and computes aggregates such as sum or count by gossiping in a pattern that matches the tree.
Typically what we stamp in the information dissemination such as  a gossip protocol is node id, timestamp and the actual message pertaining to the information. In our cases the information is a reserved message that mentions to the nodes to add their stamp to the information and forwards it on.
The node ids could be a sorted list and each node adds its entry to the list provided. When a node sees two sorted lists, it merges them.

In today's post I want to take a break and talk about exception translator.
In windows, we have a method called the set_se_translator and it provides a convenient way to handle exceptions as C++ typed  exceptions. First a C++ exception wrapper class is defined  that can be used to attribute a specific class type to a  exception.  Each time an exception is raised, a custom C++ exception is raised, a custom exception translator function installed is called by the internal exception handling mechanism. The translation occurs from any kind of exception to the type registered;.
The translator function however does not log to a file because generally speaking we don't want to delay the exception handling
 In a multithreaded environment, translator functions are maintained separately for each thread. This means we can use the thread local storage to save state. At the same time, if this is done for a worker pool where the workers participate in different tasks, the translator may need to be toggled for each such task. The _set_se_translator works with /EHa only.
 The _set_se_translator takes an unsigned integer and a pointer to a Win32 _EXCEPTION_POINTERS structure as arguments and these can be found by the Win32 API GetExceptionCode() and GetExceptionInformation()


As an example


void translator(unsigned int, EXCEPTION_POINTERS*);
6

int main()
{
  try
{
    _set_se_translator( translator );
    func();
 }
catch (SE_Exception e )
{
     printf("Caught\n");
}
}


void translator (unsigned int u, EXCEPTION_POINTERS* p)
{
printf ("In translator\n");
throw SE_Exception();
}

Tuesday, June 10, 2014

We discuss bootstrapping in Natural Language Processing. We read from the paper by Ehara, Sato, Oiwa, and Nakagawa and from nlp blogger. Bootstrapping is a technique by which we use a binary classifer in an iterative manner.  In Bootstrapping unlabeled instances can be labeled using the initial labeled "seed" set with iterations The  steps involved are
build a classifier with rules to select the positive answer with high precision and
build a classifier with rules to select the negative answer with high precision
Run the classifiers  on a large data set to collect a labeled set.
Train the classifier on the labeled set.
Run the classifier on the original data.
 The choice of the seed set is important and hence its chosen based on two requirements - the choice of seed set affects the accuracy and there's labeling cost associated with it. The authors present a model called the expected model rotation that works well on realistic data. With this model, they propose an "iterative seeding framework" . At the end of each iteration, the seeds are assessed for quality and improved in response to the current human labels and characteristics. This involves a criteria that improves the seeds as well as a score to keep track of the similarity of the seeds that is updated with each iteration.
In bootstrapping given a dataset we can consider the first l data to be labeled and the rest u data to be unlabeled. The dataset consists of m dimensional feature vectors and the first l have labels that belong to a set C of semantic classes.Suppose we were to pick out the animals names from the given instances then the set C consists of animal or not animal labels and is binary. The ranking is given by the score vector y-animal - y-not-animal seed vectors. The simple bootstrapping proceeds by setting the score vector with the first classifier rule for the first label and continues until all of the rules for each label is identified.
If X is the feature matrix pertaining to a given text, the score vector after c iterations is obtained by (1/m.1/n. X X-Transpose) ^c. y
To make f seed dependent, Komachi et al 2008 mentioned using an equation to include the score vector as (-Laplacian)^c.y   since a Laplacian is defined as 1-D^(-1/2)X.X-TransposedD^(-1/2) . This method is called the Laplacian propagation and the score vector is given by (I+ Beta L)^(-1) y

We will shortly see how this Laplacian propagation works. The thing to note here is that bootstrapping works with supervised and unsupervised  modes only and even though it is said to work well, it doesn't have the same rigor as Vector space model or graph based efforts. The reason we are looking into bootstrapping is merely to see the usage of Laplacian operators.

Let us look further into the paper for the Laplacian operator. The authors propose a criteria for iterative seeding. For an unlabeled instance, they denote the goodness of seed as gi and select the instance with the highest goodness of seed as the next seed added in the next iteration. Each seed selection criterion defines each goodness of seed gi and the unlabeled instance influences the model. They show that the Komachi label propagation score can be considered as the margin between each unlabeled data point and the hyperplane obtained by ridge regression. They denote this with a transformed equation. 
Before we discuss word similarity measures and the OOV term normalization that we wanted to continue from the previous post, I want to take a minute to mention Centrality of the graph which is another useful concept that can help with the word similarity. Centrality is best discussed in Sociology and I refer from the notes of the course in  UCLA by Prof. McFarland and Motoki Watabe.
There are three different centrality measures:
1) degree of centrality
2) closeness centrality
3) betweenness centrality

Given a graph with n nodes, the centrality of  a node, refers to the number of edges attached to the node. To normalize the score for each node, we divide the number of edges attached to each node by n-1.

First, the degree of centrality:
If we have a graph A with four vertices that contains only one central node to which all other nodes attach, then that central node has a centrality of 3 and the other have a centrality of 1.
The degree of centrality is calculated as [ (3-3) + (3 -1) + (3 -1) + (3 -1) ] / max-possible-which-is-(n-1)(n-2) in this case = 6/6 = 1
For a graph B with all the four vertices where each node is connected to every other as in a square with diagonals, each of the vertices now have a centrality of score of 3 and the degree of centrality for the overall graph is [ (3-3) + (3 -3) + (3 -3) + (3 -3) ] / (n-1)(n-2) = 0
The numerator is calculated as the ratio of the individual sum of the differences between the max centrality score and that of individual scores

It may appear that the higher the normalized score the higher a node's likely relevance in the graph. However this is not always true. Cook et al 1983 demonstrated that this is not the case. We will look into the applications of each measure as we define it.

To calculate the closeness centrality, we find the farness of each node and take its inverse. The farness of a node s is defined as the sum of its distance to all other nodes. The more central a node is the lower its total distance to all other nodes. Closeness can be taken as  a measure of how fast the information can spread from a node to all other nodes sequentially.
For network A, the centrality score for node 1 is 3 / 3 and for the other nodes is 3 / 5
Since the maximum score in the above is 1,
the closeness centrality for network A is [( 1 - 1 ) + ( 1 - 3/5 ) + (1 - 3 / 5) + (1 - 3 / 5 )] / [n(n-1)/2 / 5]  = 1
and for network B, the centrality score for each node is uniformly 3 / 3 = 1
and closeness centrality for the overall network B is (0 + 0 + 0 + 0) / (6/5) = 0
 

To calculate the betweenness centrality, we find the number of times a vertex acts as a bridge in the shortest path between two other nodes. It was introduced as a measure for the controllability of a node in the dissemination of a network.

For network A, the centrality score for the most connected node is 3 / 3 = 1 and for all the other nodes is 0
The maximum score in the above is 3 / 3 = 1
The betweenness score for network A is [(1-1) + (1-0) + (1-0) + (1-0)]/ [(n - 1)(n-2)/2] = 3 /3 = 1
and
for network B the centrality score is for each node is the same which is 0 since the node can be  bypassed.
The maximum score is 0
and the overall score for network B is 0 / 3 = 0


Thus we see that in all the three cases, network A is more centralized than network B.
For information flow, network A has high controllability and high efficiency in that there is no redundant edges. At the same time, network A is riskier.
For network B, there is high flexibility and better tolerance to failures. However the cost from redundant edges is higher.

We saw different scores for vertices. To determine the admittance matrix for the vertices, we have to calculate a different kind of score.


Monday, June 9, 2014

In today's post we look at some of the details on graph based word similarity measures. While we looked at word similarity measure in the Saluja, Navratil 2012 paper, I'm going to focus on normalization of words. We mentioned the holing system in Bieman and Riedl. This holing system is only part of their system but they describe a way to capture distributional and contextual similarity.  The idea here is to represent observations as <t,c> pair where t is the term (at a certain offset) and c is the context feature. The position of t in c is denoted by the @ hole symbol. The authors give an example where dependency (nsub;gave2;I1) could be transferred to <gave2, (nsub; @; I1) > Notice how the holing operator is used to give the position of the term. and this in turn could be represented by <I1, (nsub; gave2;@)>. Note the change in the context with the terms and the use of the holing operator. In the second dependency representation, the @ symbol has been introduced to give the position of gave2 and in the third representation the @ symbol gives the position of I1. The last two dependency representations are in the <t,c> pairs.
This paper also uses a term-context feature bipartite graph just like the last paper we saw. In addition it calculates a significance score sig for each pair <t,c> using the counts of the terms, the counts of the context features and the counts of their co-occurrence. If the significance score is negative, the corresponding edge is not added. Also features that have more than a thousand words is removed because they are overly general. This way the distributional similarity is calculated but the contextual similarity is calculated instead from the second order graph.
This way the authors showed that with the sentence "I caught a nasty cold" the holing system produced <cold5, (amod,@, nasty4)>, and <cold5, (dobj;caught2; @)> and the scores were <heat, (dobj; caught; @)> with an LMI score of 42.0 and <weather, (amod; @; nasty)> with a score of 139.4. The word disease came highest in rank for the above sentence.

we will continue on this discussion.

Saturday, June 7, 2014

I've been looking at some objective functions for graph based similarity between words and I came across "Graph based Unsupervised Learning of Word Similarities using Heterogeneous Feature Types" by Saluja and Navratil. This seems a simple way to express the similarity. The graph they use is bipartite between words and feature sets. Edges connect two words or two feature sets or a word and a feature set.
To learn the graph, the following objective function in minimized:
Omega(Wv, Wf, Wz) = a0 Sum ( Wfp fq - (Wfp fq)-initial )^2
                                   + a1 Sum over vi Sum over fp ( Zvi fp - Zvi-initial fp) ^2
                                   + a2 Sum over vi, vj Sum over fp fq  Zvi fp Zvj fq ( Wvi, vj - Wfp,fq) ^ 2
                                   + a3 Sum over vi, vj  Sum over fp, fq Wvivj WfpWfq (Zvi,fp - Zvj, fq) ^ 2
where
Wfp,fq is the current similarity between feature fp and fq
Wvi vj is the current similarity between word vi and vj
Zvi, fp is the current importance weight of feature fp for word vi
a0 to a3 are constants that add up to 1 for the relative importance of each term
The first two terms are described as minimizing the l2-norm between the initial and the current values of Wfp,fq and Zvi, fp
The third term minimizes the similarity between the words vi and j and the features fp and fq
The fourth term minimizes the similarity between the importance weights in proportion to the similarities
As you can see the terms are progressive and enable smoothing over the two graphs.
But can we express this with a different equation that enables Laplacian smoothing and matrices
The Laplacian matrix sometimes called the admittance matrix or Kirchoff matrix of a graph G which is undirected and unweighted, is an n * n symmetric matrix with one row and column for each node defined by L = D - A where  D is the degree matrix, which is the diagonal matrix formed from the vertex degrees and A is the adjacency matrix.
A normalized vector of the Laplacian matrix of a graph is implemented by
L(G) = 1 if i == j and di != 0
         = -1 / sq rt (di.dj) if i and j are adjacent
         = 0 otherwise.
The Laplacian matrix measures to what extent a graph differs at one vertex from its values at nearby vertices. 
I'm going to take a break to discuss a topic that I've been asked to look into. A python library is not able to make an SSL Connection with an application that uses libssl. The error message received is that the 'The handshake operation timed out'. This is because the server is not completing the handshake in some cases.
We first look at what all steps are involved in the TLS handshake. TLS and SSL are different in that they don't interpolate and TLS 1.0 is often referred to as SSL 3.1 but at the same time the differences are not that much in our discussion and TLS 1.0 has a way to fall back to SSL 3.0. The handshake we describe is responsible for the authentication and the key exchange necessary to establish or resume secure sessions. In establishing a secure session, the following steps are involved :
cipher suite negotiation
authentication of the server and optionally the client
session key information exchange
The first step refers to the exchange of the cipher suite that both the client and the server will be using for the duration of the session.
The second step refers to the step where the server proves its identity to the client. The client might also need to prove its identity to the server.
The third step refers to the exchange of random numbers and the selection of a special number called the Pre-Master Secret which helps in creating a Master Secret and that in turn helps create the MAC secret which is the session key used for hashing and the write key, which is the session key used for encryption.
After the initial client hello, server hello and the server hello done message, the pre-master secret is generated by the client and the master secret is generated by both the client and the server. The client sends the change cipher spec notification to the server
Cryptographic algorithms can vary. Some fall in the bucket of Federal Information Processing Standards also called FIPS standard. These can be enabled or disabled  with FIPS mode and often is used to troubleshoot SSL issues. There are dedicated libraries  designed to support cross-platform development of security enabled client and server applications with optional support for hardware SSL acceleration on the server side and smart cards on the client side. These are called Network Security Services.

Friday, June 6, 2014

The previous post discussed a paper that formalized the usage of Graphs for describing reasoning chain relationships. It introduced the TRN graph and described two different types of nodes and three different edges.The nodes were either text nodes and section nodes and the edges were structural, similarity and causal. This graph is constructed from the text with the help of a syntactic parser which adds the text nodes Section nodes are attached to the sentence nodes and all the text nodes are added under these. Similarity measures help add edges for similarity to the node pairs. Only those measures that are above a threshold are added. Graphs that are extracted from different sections are combined Next Causal relations are added with the help of a discourse parser. With the structural, similarity and causal edges added we now have a complete TRN graph. Then with this graph, we can choose source destination pairs and find paths. Paths are combined, post-processed and visualized.
The dataset used for this study was the aviation investigation reports from Transportation Safety Board of Canada. Each report in this collection is an incident report and has sections titled summary, factual information, analysis and findings. The documents are available for download from the TSB of Canada and then they were pre-processed by the authors using GATE NLP platform. The preprocessing steps included tokenization, sentence splitting and part of speech tagging.  The sections on the summary is a brief description, the section on the factual information is the details of the aircraft or circumstance, the section on the analysis is a discussion on what might have happened and the section on the findings describes the causes. Therefore the reasoning chain was chosen as the shortest path from Summary to Causes. The constraints for the reasoning chain path was defined in the Backus-Naur Form as one comprising of a summary-path with an edge to a factual-path with an edge to an analysis path with an edge to the causes-path add the intermediaries in this representation are optional. Each of the paths can be recursively defined as one or more nodes with edges.  While these edges can be one of the  three aforementioned edges such as partof-edge, contains-edge, similar-edge and cause-edge,  only the causes path explicitly has part-of edges. A post-processing steps shrinks the path by eliminating transitives. Then the compressed graphs are laid out for visualization.

We read the paper :  automatic extraction of reasoning chains from textual reports by Sizov and Ozturk. They extract reasoning chains from documents such as medical patient reports, accident reports. Reasoning chains contain information useful to analyze the problem at hand. They use a graph based approach that makes the relations between textual units explicit. Reasoning chains are a sequence of transitions from piece of information to another that connect the problem with the resolution and thus are very valuable in future problem triages.  Moreover they determine the validity or the fallaciousness of the arguments presented.

This paper decomposes a document into text units and discovers the connections between the text units and makes them explicit. TRN is automatically acquired from text using a syntactic parser and a semantic similarity measure. We now discuss this TRN in detail.  A reasoning chain is extracted as a path from the graph-based text representation. TRN is a graph with two types of nodes. text nodes and section nodes and three types of edges - structural, similarity and causal. This graph is contructed from the text in the following manner: 1) syntax trees obtained from a syntactic parser are added to the graph, 2) section nodes are attached to the sentence nodes, 3) similarity edges are added between similar text nodes and 4) cause relations identified by a discourse paper are added. When representing text units, graph based approaches have used terms or short phrases as text units that are too small for this approach. Another popular choice is to use a whole sentence or a few. However, such text units may contain several pieces of information where only one is relevant for the reasoning chain. In this paper, the authors use a Stanford parser to extract the syntax tree and pick out the S(sentence, clause), NP (noun phrase) and VP (verb phrase) which are referred to as text nodes. Text nodes are identified by a set of stemmed words. Along with these structural relations such as Contains and PartOf are also added as edges. Graphs extracted from different sentences in a document are combined into one. To obtain similarity relations, a similarity value is computed for each pair of text nodes of the same category(S,VP, NP) and Similar edges are added to the graph for node pairs where the similarity is above a threshold. Their similarity measure finds a one to one alignment of words from two text units The LCH for these words are computed using the shortest path between the corresponding senses in WordNet. Then a complete bipartite graph is constructed. Nodes in the bipartite graph represent words from the text units while the edges have weights that correspond to similarities between words. A maximum weighted bipartite matching finds a one-to-one alignment that maximizes the sum of similarities between aligned words.  This sum is then normalized and thresholded to add corresponding Similar edges to the graph. Next Causal relations are added to the graph. Causal relations are found by a discourse parser. In this case, they used a PDTB-styled End to End discourse parser. Cause relations found by the parser are added to the TRN graph as Cause edges. This is how the TRN graph is built.
Once the TRN graph is constructed, the reasoning chain is generated based on a three stage process:
1) a report is converted from text into a TRN graph
2) given a start and an end node, several paths are extracted from the graphs and
3) paths are combined, post processed and visualized.

Thursday, June 5, 2014

In the previous post we saw that we could represent distributional similarity and contextual similarity with data structures. The first is represented by three graphs We first create a term context feature graph with T set of terms, C the set of context features and edges based on the counts of the terms and the context features. Based on this graph, a first order graph is created, This first order graph is denoted by T , C and the significance score sig. The significance score sig is calculated for each pair (t, c) using a formula for a metric called the Lexicographer's Mutual Information LMI(t,c) = count(t,c).log-base-2(count(t,c)/count(t)count(c)). This is described in the paper by Evert, 2004. To find the significance score, the edges with LMI < 0 are removed and only the p  most significant pairs per term t are retained. Features that are general because they have too many terms are also removed.
The second order similarity graph between terms is defined with the similarity score as the number of salient features the two terms share.  This approach does not count how often the feature occurs with a term but instead uses a significance ranking This makes it more scalable to data as it does not need the full list of features for a term pair at any time.
The contextual similarity is calculated differently.  This is also computed by a ranking method. Given the set of candidate expansions from the second order graph, such that the most similar term in context will be ranked higher and the non-compatible ones should be ranked lower. First the terms and context features are extracted as pairs and for all pairs for the given target word from the document by the holing operator.  Then a new graph is defined with context features CON that belong to the previous graph. For the target word,  the second order similarity graph is looked up to find the top n similar terms and these are added to the CON. Then edges between target word and context features are labeled with the significance score from the first order graph. If the edges are not in the first order graph then they don't get significance score. Then a ranking score for each term is calculated with the harmonic mean and a smoothing factor. Then the entries are re-ordered based on this ranking score.

Wednesday, June 4, 2014


We saw that global and local scopes can matter in finding similarities. We discuss a different paper in this approach: This is the paper on a Graph Based Contextualization method using Distributional Thesauri by Biemann and Riedl. This paper discusses a new contextual similarity finding method that generates for each term occurrence in a text, a ranked list of terms that are semantically similar and compatible with the context. The framework is instantiated by the definition of the term and context. The dependencies are parsed to get the context. Contextual similarity based approach has trumped the non-contextualized baseline across all parts of speech. Moreover this is an unsupervised generative method for similarity in context  and does not rely on the existence of lexical resources as a source for candidate expansion. This approach made use of an existing distributional thesaurus (DT) that has entries consisting of a ranked list of the globally most similar terms for a target term. This approach uses the DT in a graph representation and moves from a global notion to local. The prevalent way to represent word meanings has been to use the vector space model which uses a corpora. However this approach does not require a domain knowledge and is unsupervised in the sense that it does not require training.
They introduce an operation called the holing which splits any sort of observations on the dependencies into pairs of terms and context features. Then these pairs are used for calculating the similarity and for finding the contextual similarity. This operator is used in the system that requires preprocessing steps. The notation is the @ symbol. This is used in more unified representations of dependencies.

The distributional similarity is represented using three graphs. This operation involves using a bipartite term-context feature graph, with T the set of terms, C the set of context features, and e(t,c,f) where f = count(t,c) .count is the count of terms and count of c Based on the graph TC , a first order graph e(T,C, sig) is created and the significance score sig calculated. The significance score sig is calculated for each pair (t,c) using a formula defined in Evert 2004 using counts.
Then all the edges with score (t,c) that is < 0 is removed and only the p most significant pairs per term t and removing the remaining edges. Additionally overly general features that have more than a thousand words are removed.
The second order similarity graph between terms is defined as SO(T,E) for t1,t2 belonging to T with the similarity score which is the number of salient features two terms share. This graph gives the distributional similarity.  
In the paper titled "Merging word senses" by Bhagwani, Satapathy and Karnick, a semi supervised approach to learning Wordnet synsets using a graph based recursive similarity definition is presented. The synsets help provide input on sense similarities of all word sense pairs. Then this method shows coarse sense inventories at arbitrary granularities
Unlike previous approaches where they generated a coarse sense inventory by merging fine grained senses, this approach proposes a framework by learning a synset similarity metric The problems with the earlier approaches included the following: first, it requires a stopping critierion for each word such as the number of final classes and these cannot usually be predetermined. Second the inconsistent clusters are obtained because coarse senses are independently generated for each words. The inconsistent clustering causes transitive closure errors. With the approach discussed in this paper, the coarsening of noun synsets is improved upon. But to learn similarities between synset pairs which do not share a word, they use a variant of the SimRank framework that gives a non-zero similarity.  The SimRank is a graph based similarity measure applicable in any domain with object to object relationships and shows that two objects are similar if they are related to similar objects. The SimRank equations can be solved to find a similarity between all pairs of words and this was proved in a separate paper by Jeh and Wisdom 2002. For a graph G(V,E) the solution is reached by iteration to a fixed point. For each iteration |V|^2 entries are kept where Sk(a,b) is the estimate of similarity between a and b at the kth iteration. This works well when we know complete information on the objects.
In many scenarios, this may not be known. So the SimRank is personalized by initializing it and by knowing the similarities of some pairs, this approach fixes them in the set of equations and let the rest of the values be automatically learnt by the system. To begin with supervised labels, the mapping labels of the Omega ontology is used. To coarsen the WordNet, an undirected graph is constructed which contains the synsets of  WordNet and edge set E comprising of edges obtained by thresholding the similarity metric  learnt using the personalized SimRank model. On these graphs, connected components are found which gives us a partition over synsets. All the senses of a word occurring in the same component are grouped as  a single coarse sense. To make it more coarse, denser graphs are obtained with fewer connected components. The small number of components translates into more coarser senses. This threshold provides a way to control the granularity of the coarse senses.
We now look at Feature Engineering The feature space is constructed this way The features are broadly classified into two parts one that is derived from the structure of WordNet and another that is derived from other corpora. The features derived from WordNet are further subdivided into similarity measures and features. Among the WordNet similarity measures, the authors used path based similarity measures. Other synsets and sense based features include number of lemmas common in two synsets. Other synsets and sense based features include number of lemmas common in two synsets, maximum polysemy degree among the lemmas shared by the synsets, whether two synsets have the same lexicographer file, number of common hypernyms, whether the two synsets have the same lexicographer file, number of common hypernyms etc. Hyperynyms/Hyponyms mean super/sub ordinate relationships in the structure of the WordNet.
 Polysemous means having more than one sense in the syntactic category.  Features derived from External corpora include a score of a synset with respect to 169 hierarchically organized domain label as available from eXtended WordNet domain project. BabelNet is another external corpora that provides the translation of a noun word senses in 6 languages.

Tuesday, June 3, 2014

we will continue with our discussion on graph based random walks today. We saw how we can establish relationships in a bipartisan graph. We saw how to perform random walks. We also saw how to compute costs.
What we will see next is how this translates to keyword extractions. This part is more a conjecture at this point.  We want to reuse the concept of contextual similarity but we want to define the context as not just adjacent words. We don't want to use a local vector space model either because it will be computational ly expensive to do both random walk and VSM clustering. Is there a way to define collocation as something other than Adjacent words and VSM
One suggestion is that we use something like co-occurrence. Collocation is different from co-occurrence.  One defines proximity by boundaries  and the other defines counts and groupings. If we can come up with different groups of words and we find patterns in them such as they tend to repeat, that's cooccurrence. How do we discover different groups and different patterns is closely associated with clustering. 

Monday, June 2, 2014

When we discussed the Random walk based lexicon building in the algorithm from the earlier post, we had a few stages of computations.
First we built the bipartite graph. ( this was mentioned before the last one )
We built it this way:
     We extracted N-grams from the text corpus.
      For each N-gram denoted by n.
      we checked if the center word  is noisy or a normalized word.
      if it is noisy we add it to the source node
      else we add it to the sink node and add context
      we add the context and
      then we add the edge weight 
Second we perform the random walk. The goal was to identify pairs of noisy and normalized words that can be considered as equivalences. The walk starts at a noisy word and ends at a normalized word. We normalize the results for each node that is dirty.
Third we calculate the costs and then prune the list to the top few.
When we calculated the cost, we factored in a component for the similarity cost. This cost function was described as the ratio of the longest common sub sequence ratio and edit distance between two strings.
We had a cost for the random walk as well which was called the hitting time between those nodes. Furthermore, we averaged this hitting time between two nodes and normalized it with that of all other nodes linked to that noisy node.
In the algorithm above, we first see that we do iterations over the lexicons list. By doing these repeated iterations we wanted to choose the most probable paths and refine the results.
This way we now have a list of the top few contextually relevant words.
This method lets us use pair wise contextual similarity to normalize social media text.



Having discussed the paper on normalizing social media text in the previous post, we will conclude reviewing it today. We mentioned that the approach in the paper uses a bipartite graph and performs a random walk to connect words to their normalized equivalences.  We mentioned briefly about how this helps us identify pair wise relationships between words and their contexts. We were looking to establish contextual similarity in other ways. This requires exploiting several relationship considerations such as semantic relationship, part of speech relationship, categorization or ontology based groupings etc. Perhaps we could also consider local vector space models instead of just relying on adjacent words in  a sentence. This means we use a local vector space models to establish contextual similarity and a global graph and random walk model to establish pair wise relationships.
The subject of these discussions is a different topic. However, we will just finish reviewing the paper first. The number of steps taken to traverse between any two nodes is called the hitting time.  The average hitting time of all walks that connect the source and sink nodes is considered H(n,m) = Sum[hr(n,m) / R] where the numerator is the hitting time between the pair n,m over  a walk r.
The relative frequency of the average hitting of those two nodes H(n,m) and all other nodes linked to that noisy node is then described as L(n,m) = H(n,m) / Sum-i H(n,m). The is one cost.
The cost between a noisy word and a normalized word is also the lexical similarity cost SimCost(n,m). This cost function is described as the ratio of Longest common subsequence ratio (LCSR) and Edit distance between two strings.
The algorithm to induce the lexicons is something like this:
The lexicon is first initialized.
for each node in the graph
     if this node in the graph is noisy
         Initialize R(n)
     for i = 0 to K
     do
       R(n) = RandomWalk(n)
       L(n) = Normalize R(n)
Calculate Lexical Similarity Cost
Ln = SimCost(Ln)
Ln = Prune(Ln)
Lexicon = ADD(Ln)
In the algorithm above we see that we first initialize the Lexicons and add to it by performing iterations. As we do the random walks, we pick the more probable paths more often and select the normalization equivalence candidates. Then we calculate the average hits and normalize.  We find the similarity costs and together with the average hits, we construct the lexicon using the entries with the final costs and prune it by selecting the top few.

Sunday, June 1, 2014

Today we will cover some more from the paper we mentioned yesterday. The paper mentioned that the social media text is noisy and contains typos, abbreviations, phonetic substitutions and slangs.
The paper proposed a text normalization approach that can adapt to such variations automatically. This technique can adapt to different sources any genre of social media.  This is different from listing out of vocabulary words because such a list does not suit social media text. Many words and named entities  that do not exist in a  given vocabulary should not be considered for normalization.
OOV word may have many appropriate normalization depending on the context and the domain. Also, text normalization is a pre-processing step should have high accuracy.
With these challenges, the technique proposed addresses these by finding the best normalization sequence according to an n-gram language model. The normalization candidates are automatically picked from unlabeled text data in an unsupervised manner. The approach is scalable, accurate and adaptive to any domain and language. It begins by constructing a lattice from the normalization candidates. Only one-to-one word mappings are considered are considered. The candidates are populated with two generators. One uses a dictionary based spelling correction and the second is based on the trie based approximate string matching. As an aside, we mentioned the trie based string approximation used in mobile devices in an earlier post. A trie is used for prefix or suffix based representation and organization of strings that enables lengthening, letter substitution and letter-number substitutions and other such approximations. Together with the trie based approximation and the spell checker, the words can be normalized.
The dictionary based normalization methods prove inadequate for social media text for many reasons. First, they are too general corrections Second the context less in the sense that they don't consider the nature of the word but the minimum edit distance for any word and their named entity. Third social media text is dynamic where slangs get introduced on a daily basis.
We now describe the graph based random walks mentioned in this paper. Here the authors use the approach that normalization equivalences share similar context. By similar context, they mean a pattern in a say n-grams where the words to the left and to the right are the same between equivalences. If we take a five gram sequence of words, then the  pattern is similar to word1.word2*word4word5. This pattern can now be represented by a bipartite graph where the first partite represents the words and the second partite represents the n-gram contexts shared by the words. A word node can be either normalized or noisy since this decision limits the candidate noisy words to be normalized.
The selection of candidates for normalizing (noisy words) is something like this. There is a vocabulary constructed from a large clean corpus. Any word that does not appear in the vocabulary more than a predefined threshold i.e 10 times is a candidate.
The bipartite graph is composed of W which includes all nodes representing normalized words and noisy words, C that represents the shared context and E that represents the edges connecting word nodes and context nodes. The weight on the edges is the number of occurrences of a word in a context.
The algorithm to construct the graph is based on something like this:
Extract N-grams from the text corpus.
Foreach N-gram denoted by n
   check if the centerword is a noisy or normalized word
   if it is noisy add it to the source node
   else add it to the sink node and add context
   add the context
   add the edge weight (Context, word, count)
Once the bipartite graph is constructed, we can then generate lexicons using Markov Random Walk. The goal is to identify pairs of noisy and normalized words that can be considered as equivalences. A random walk starts at a noisy word and ends at a normalized word. The walker starts from a source node Ni and moves to connected node Mj with probability Pij . This probability is the normalized co-occurrence counts of the word and the corresponding context. Random walks are repeated until either a normalized form is found or k attempts are exhausted. There can be many paths between the same source and the same sink. The paths are generally selected based on transition probability.  In nature this is similar to moving between different forms of words in edit distances.
We note that this graph based method uses similar context. In natural language processing applications, the context is often described as a set of words to the left and right of the candidate.
Is there a better way to describe context and not just based on adjacency in a sentence. This is where the vector space model and clustering helped. However, can we define a graph based on something other than word adjacency for the same context similarity ? Perhaps, we could consider that words can map to different parts of speech or different thesaurus or different domain ontology. Sometimes one solution does not work for all. This is probably where we could consider multiple measures at the same time. While the vector space model attempts to do this without explicit consideration for the factors contributing to the word pair relationships, we could attempt to have graph based explicit considerations for each.