Saturday, April 27, 2013

A developer’s guide to topic analysis of documents


Why Topic Analysis?

Topic analysis for a document has many different applications. In my search, I’ve not found any commercial product that does all of the text processing that’s outlined here. For example, one can construct an index of the text based on topic analysis. Topics can be scoped to text documents or even text blocks. Main topics and sub-topics can be indicated in the index and one can get a sense of the contents by looking at the topic structure. The methods mentioned here has many applications such as text mining, text summarization, information extraction, question answering and other text processing.

Why Developers?

The applications mentioned above have potential for use by millions of users. Text Summarization alone has had sensational news coverage for a product. This document attempts to list the implementations involved which is otherwise missed from the complex equations and models mentioned in the research papers. That said, this document derives most or all of the ideas from the research papers mentioned in the references section.

Besides, research focuses on finding metrics and measures to evaluate the document for keyword or topic extraction or clustering but developers treat them as factors or algorithms that can be interchanged. Further developers focus on the fast and efficient implementations such as for clustering. While any text mining document has references to clustering, which is a popular technique for information retrieval from documents, this document introduces clustering techniques to developers and at the same time making use of it in context of the goal for topic analysis.

Moreover, developers will find that there are many different implementations possible by simple interchange of algorithms and methods. Different techniques may need to be tried out. Besides, none of the techniques yields a perfect solution. Research calls this domain “a non-convex optimization”.

What are the steps involved?

The steps in the text processing involve:

1)      Keyword detection.  There are several well defined methods for keyword extraction.

One of this uses a factor called Term Frequency – Inverse Document Frequency (also known as TF-IDF). A TF-IDF value is calculated for each word in a text and those words having the largest TF-IDF values are chosen as keywords. Another way to select words would be to find their Shannon Information which is proportional to term frequency and the probability of the occurrence of the word in the corpus data. The corpus data is a collection of documents that have already been parsed and tagged and represents a comprehensive collection of topics.

There are other similarity or distance measures as well. For example, similarity can be based on statistical distribution of words in a corpus of texts. These lead to a weighted sum of terms, with the crucial difference that non-negative weights with sum 1 are allowed which have a natural interpretation as conditional probabilities.

Python’s NLTK framework allows us to experiment with these conditional probabilities with the following sample code from their online documentation.

featuresets = [(document_features(d), c) for (d,c) in documents]

train_set, test_set = featuresets[100:], featuresets[:100]

classifier = nltk.NaiveBayesClassifier.train(train_set)

print nltk.classify.accuracy(classifier, test_set) [1]

 

However, our approach will be based on an input of the word occurrence and co-occurrence data of terms and an output of a set of keywords in a cluster that indicate the topic of the document. We will get this data from a corpus of text in natural language processing kit that has this data.

 

2)      Topic analysis by Clustering:

Clustering involves grouping of data points together based on their co-occurrences or on their similarity measures. If a cluster with a seed word w contains a co-occurring word w1 and another cluster with the seed word w1 has a word w, we can merge the clusters together.  When we cluster based on similarity measures, we are looking for the similarity of a word with that of cluster centers.  We could start with a set of finite clusters and group data points to these clusters adjusting their centers as data points are added to them. Alternatively, we can create a tree of clusters that can be traversed for each data point to find the appropriate cluster where the data point can be inserted. When building a hierarchical cluster, we can treat the data points as one cluster and split them into sub clusters repeatedly in a top down manner or merge them repeatedly from one data point per cluster in a bottom up manner.

 

There is a fast and efficient clustering described as BIRCH which stands for Balanced Iterative Reducing and Clustering using hierarchies. This clustering works in the agglomerative style where the cluster centers are not decided initially. Only the maximum number of cluster summaries and the threshold for any cluster is decided. These two factors are chosen because we want to keep the cluster summaries in memory even for arbitrarily large data sets.

 

The algorithm reads each of the data points sequentially and proceeds in the following manner:

Step 1: Compute the distance between record r and each of the cluster centers. Let i be the cluster index such that the distance between r and Center is the smallest.

Step 2: For that i'th cluster, re-compute the radius as if the record r is inserted into it. If the cluster threshold is not exceeded, we can proceed to the next record. If not, we start a new cluster with only record r

Step 3: Check that the step 2 does not exceed the maximum cluster summaries. If so, increase threshold such that existing clusters can accommodate more records or they can be merged to fewer cluster summaries.

 

The data structure used in BIRCH is a Clustering feature tree (CF-tree). Each entry in the CF-tree represents a cluster of objects and is characterized by a 3-tuple : (N, LS, SS) where N is the number of data points, LS is the linear sum of the N data points,  SS is the sum of squares of the N data points.  CF entries are additive so clusters can be merged incrementally and consistently.

 

In a CF-tree, each non-leaf node has at most B entries where B stands for branching factor. Each leaf node has at most L CF entries, each of which satisfies threshold T.  Node size for both leaf and non-leaf is determined by input page size P. A leaf node is a collection of CF triplets and links to next and previous leaf nodes. Operations include CF-tree insertion and rebuilding.

 

A CF tree insertion proceeds like this:

First, traverse down from root recursively, find the appropriate leaf. Next, modify the leaf – if it cannot absorb otherwise make a new CF-entry. If there is no room for the new leaf, split the parent node and proceed up the tree if necessary. Lastly, traverse back and update the CF’s on the path or split nodes. To split the node, take two farthest CFs and create two leaf nodes, put the remaining CFs (including the new ones that caused the split) into the closest node.

 

A CF tree rebuilding addresses step 3 mentioned above. If the number of clusters is exceeded, increase threshold and push CFs over to a new group and reduce the tree clusters. Since rebuilding operates on the height h of the tree, at most h extra pages of memory are required.

 

 

3)      Text Segmentation: A segmentation method called Text Tiling generally segments a text into blocks (paragraphs) in accord with topic changes within the text.  Another is a cluster based approach. We could start with each sentence as candidates in a relatively short text of say two sentences. We could merge candidates to form clusters and maximize distances between two probability distributions.

 

All the three steps mentioned above are almost independent of each other and could have variations in terms of algorithms and measures involved.

 

Conclusion:

                There are several methods for topic analysis often involving steps as indicated in this document. Implementing these methods prepares a product with applications in index generation, topic analysis, summarization etc.

 

 

References:

  • H. Li and K. Yamanishi. Topic analysis using a finite mixture model. Information Process Management 39(4):521–541, 2003.
  • Christian Wartena and Rogier Brussee. Topic Detection by Clustering Keywords. http://MultimediaN.NL
  • Zhang, Ramakrishnan, Livny. BIRCH: A New Data Clustering Algorithm and Its Applications. Data Mining and knowledge discovery. Also presentation on the same by Zhao Li was referenced.
  • Martin Warin. Using WordNet and Semantic Similarity to Disambiguate an Ontology.
  • Annette Hulth. Improved Automatic Keyword Extraction Given More Linguistic Knowledge.
  • S. A. Hojjatoleslami and J. Kittler. Region Growing: A New Approach. IEEE transactions on Image Processing.

Friday, April 26, 2013

WPF routed events and commands


When controls are added to a form, the event gets hooked up to the XAML markup and the handlers are added to the code behind. This should also be similar. When you right click on InitializeComponent method and go to definition, the editor will display the generated code file and you can see the routed event handlers added.
Each of the elements in the form are declared as a hierarchy of element forms which is called the logical tree. Most WPF controls are either ItemControls or ContentControls and they can have child elements.
Routed events get routed primarily on the visual tree. They support a routing strategy of Bubble, Tunnel or Direct. Bubble is the most common and events propagate up the visual tree. Tunnel goes in the other direction starting at the root and going down the leaf. Direct events behave like normal events and are handled only by the delegate attached to the event.
To walkthrough how a button click event is processed, we go through the following steps. A user will initiate a click event with a MouseLeftButtonDown event that happens inside the Image element. That event tunnels down from the root to the Image. If no handlers set the Handled flag to true for the preview event, the event then bubbles up from the image to the button. The button handles the event, set the Handled flag to true, and raises its own click event. Even when writing custom composite controls, the button handles these events to which the template is attached. In order to enable elements to handle events that are declared in a different element, you can attach events by specifying the element and the event in the markup.
Routed commands give a specific mechanism for hooking up UI controls such as toolbar buttons and menu items to handlers without introducing a lot of tight coupling or repetitive code in the application. Among the advantages of using routed commands as opposed to regular event handlers, the first advantage is that the source elements can be decoupled from the command targets. They do not need direct references to one another. Secondly, the commands will automatically enable or disable all associated UI controls when the handler indicates the command is disabled. Lastly, commands enable you to associate keyboard shortcuts and other forms of input gestures to invoke the command.

Commands are specified with the command property on the control. The command property is supported by a number of controls such as MenuItem, button, RadioButton, CheckBox, hyperlink, and a number of other controls. CommandBinding is specified for an element that acts as the event handler. The CanExecute and Executed properties of a command binding point to methods in the codebehind. The main thing here is that who calls whom is not important.
The CanExecute is called to determine whether or not the command should be enabled. When the executed method is set and the CanExecute is missing, the default behaviour is that the command is enabled.

To make this more concrete and to quickly see the benefits of routed commands, visualize a text box with controls that cuts and paste any selected text between two text boxes. The conventional approach would be to define the click handler and reference the two text boxes and call appropriate clipboard actions and this becomes harder and messier when the control is deep inside a custom control. Instead we could just specify a builtin command to do the same. After you select some text in one of the text boxes, the tool bar button becomes enabled. It could work for any text box anywhere in the UI. Under the covers, these routed commands use routed events to route messages between the command invokers and the command handlers. Once a single command handler is invoked, no other handlers will be called.

There are five builtin command classes in WPF and these are Application Commands, NavigationCommands, EditingCommands, MediaCommands, ComponentCommands etc. RoutedCommands implement the ICommand interface.

Courtesy: MSDN magazine article by Brian Noyes.

Wednesday, April 24, 2013

R-Tree and clustering to find geographical locations

We discussed R-Tree previously in this post here. Today we want to use it together with clustering to search for restaurants across the nation to find areas that have highest number of restaurants with best customer satisfaction ratings for a particular cuisine. Here we will use the R-Tree for storing geographic co-ordinates of the restaurants and to use the minimum bounding rectangle that comes with it. In addition, we will be using agglomerative clustering to group these restaurants into a hierarchy

Tuesday, April 23, 2013

Implementation of WPF event handlers

WPF event handlers can be scoped to the objects that are changing. For example, if there's a collection of items that changes and an event handler registered for OnChange then the event handler is invoked for any updates to the items of the collection. However, if the collection reference itself is updated via a new, this event handler subscription is lost. In such a case, it is better to add a event handler to the parent holding the collection and adding a event handler there as say OnRefreshed. This event handler should unsubscribe from the old OnChange event handler, update the new reference to the collection and add the new OnChange event handler.

Monday, April 22, 2013

Hang Yamanishi stochastic model

Hang Yamanishi propose a stochastic topic model where w denotes a set of words and K a set of topics. A probability distribution of words P(W|K). Here the value of P(W|K) will be zero if w is not included in K and the probability w in W is Sum(P(k)P(w/k)).  Each such stochastic topic model (STM) is defined one for each block. Word clusters are created from a large data corpus. All words in a vocabulary are treated as seed words and for each seed word, matching words are found from the corpus that co-occur with the seed word and these are grouped in a cluster. The match is based on a value deltaSC which exceeds a threshold gamma and P(w|s) > P(w) is satisfied as that which occurs significantly frequently. Since the (s,w) pairs are built from a large corpus data, this does not have to rebuild for every candidate in a given document. From the results of the clustering, select any cluster topic whose seed word is included among the selected keywords and merge two clusters ki and kj where the seed word of ki is included in kj and vice versa. Merge only if the words belong to different word sets. Lastly, topic segmentation is performed on n candidates based on similarity measures by finding local minima between similarity measures of consecutive candidates say at i and then finding two most distant candidates P1, P2 adjacent to i and on either side such that P1-S(i) > theta and P2-S(i) > theta and then segmenting there.

Sunday, April 21, 2013

Birch clustering


BIRCH clustering works in the agglomerative style where the cluster centers are not decided initially. Only the maximum
number of cluster summaries and the threshold for any cluster is decided. These two factors are chosen because we want to keep the cluster summaries in memory even for arbitrarily large data sets.
The algorithm reads each of the data points sequentially and proceeds in the following manner:
Step 1: Compute the distance between record r and each of the cluster centers. Let i be the cluster index such that the distance between r and Center is the smallest.
Step 2: For that i'th cluster, recompute the radius as if the record r is inserted into it. If the cluster threshold is not exceeeded, we can proceed to the next record. If not, we start a new cluster with only record r
Step 3: Check that the step 2 does not exceed the maximum cluster summaries. If so, increase threshold such that existing clusters can accomodate more records or they can be merged to fewer cluster summaries.

Friday, April 19, 2013

Building a search page for a web application.

Let us assume that we have a three tier stack to create the search page for a web application. Next let us abstract the store for the data and focus on the controller and views instead. We will require a view model and some WPF controls. We could consider an editable control that filters the list of entities from the data store for each character input. This control merely has a textbox for search. The ViewModels handle the bindings to the Name attribute of entities. Autocomplete feature could also be added subsequently as a separate handler on each character input. This is possible because we can check against IsEditing and IsEditingAllowed flags on what the controls should do.
Changes will be made not only to the view model but also to the views. Initially, this search could appear in a separate page but this is something that could be shared across different views. Moreover, this view could merely hold a single custom control. Since the results of the search and the search control can be a template for different domain entities.
The control for displaying the search text box will be static and it can have a default text ink visible. The text control will have a dependencyproperty and setup for bi-directional flow. Each time the text changes in the control, the handlers evaluate against the entities enumerated from the datastore. The execute button is not required in this case.
Initially, the search results can all be similar and dependent on the type of the results to be returned.  This is simpler and allows for users to reuse the search across different domain entities.
However, the search results need not be homogoneous and can have hierarchical data or grouping. The results view is independent of the search performed and can display a list of items in a tree format based on some pre-determined format.
If the data store provider were to be expanded to include full text search that returns results across domain entities, then the entities could be described in a grid but they won't be double clickable as in the simpler version above. The list items shown in the type based search mentioned earlier will each map to the datastore item and can be double-clicked to open a details view.
Many of the steps mentioned above may alread be implemented in the web application for various other purposes. Since the search results filter the enumeration of datastore item based on their names against the text input, the functionality is simple and moreover, the form of the search pane and results can follow the same layout as the rest of the application.