Saturday, April 27, 2013

A developer’s guide to topic analysis of documents


Why Topic Analysis?

Topic analysis for a document has many different applications. In my search, I’ve not found any commercial product that does all of the text processing that’s outlined here. For example, one can construct an index of the text based on topic analysis. Topics can be scoped to text documents or even text blocks. Main topics and sub-topics can be indicated in the index and one can get a sense of the contents by looking at the topic structure. The methods mentioned here has many applications such as text mining, text summarization, information extraction, question answering and other text processing.

Why Developers?

The applications mentioned above have potential for use by millions of users. Text Summarization alone has had sensational news coverage for a product. This document attempts to list the implementations involved which is otherwise missed from the complex equations and models mentioned in the research papers. That said, this document derives most or all of the ideas from the research papers mentioned in the references section.

Besides, research focuses on finding metrics and measures to evaluate the document for keyword or topic extraction or clustering but developers treat them as factors or algorithms that can be interchanged. Further developers focus on the fast and efficient implementations such as for clustering. While any text mining document has references to clustering, which is a popular technique for information retrieval from documents, this document introduces clustering techniques to developers and at the same time making use of it in context of the goal for topic analysis.

Moreover, developers will find that there are many different implementations possible by simple interchange of algorithms and methods. Different techniques may need to be tried out. Besides, none of the techniques yields a perfect solution. Research calls this domain “a non-convex optimization”.

What are the steps involved?

The steps in the text processing involve:

1)      Keyword detection.  There are several well defined methods for keyword extraction.

One of this uses a factor called Term Frequency – Inverse Document Frequency (also known as TF-IDF). A TF-IDF value is calculated for each word in a text and those words having the largest TF-IDF values are chosen as keywords. Another way to select words would be to find their Shannon Information which is proportional to term frequency and the probability of the occurrence of the word in the corpus data. The corpus data is a collection of documents that have already been parsed and tagged and represents a comprehensive collection of topics.

There are other similarity or distance measures as well. For example, similarity can be based on statistical distribution of words in a corpus of texts. These lead to a weighted sum of terms, with the crucial difference that non-negative weights with sum 1 are allowed which have a natural interpretation as conditional probabilities.

Python’s NLTK framework allows us to experiment with these conditional probabilities with the following sample code from their online documentation.

featuresets = [(document_features(d), c) for (d,c) in documents]

train_set, test_set = featuresets[100:], featuresets[:100]

classifier = nltk.NaiveBayesClassifier.train(train_set)

print nltk.classify.accuracy(classifier, test_set) [1]

 

However, our approach will be based on an input of the word occurrence and co-occurrence data of terms and an output of a set of keywords in a cluster that indicate the topic of the document. We will get this data from a corpus of text in natural language processing kit that has this data.

 

2)      Topic analysis by Clustering:

Clustering involves grouping of data points together based on their co-occurrences or on their similarity measures. If a cluster with a seed word w contains a co-occurring word w1 and another cluster with the seed word w1 has a word w, we can merge the clusters together.  When we cluster based on similarity measures, we are looking for the similarity of a word with that of cluster centers.  We could start with a set of finite clusters and group data points to these clusters adjusting their centers as data points are added to them. Alternatively, we can create a tree of clusters that can be traversed for each data point to find the appropriate cluster where the data point can be inserted. When building a hierarchical cluster, we can treat the data points as one cluster and split them into sub clusters repeatedly in a top down manner or merge them repeatedly from one data point per cluster in a bottom up manner.

 

There is a fast and efficient clustering described as BIRCH which stands for Balanced Iterative Reducing and Clustering using hierarchies. This clustering works in the agglomerative style where the cluster centers are not decided initially. Only the maximum number of cluster summaries and the threshold for any cluster is decided. These two factors are chosen because we want to keep the cluster summaries in memory even for arbitrarily large data sets.

 

The algorithm reads each of the data points sequentially and proceeds in the following manner:

Step 1: Compute the distance between record r and each of the cluster centers. Let i be the cluster index such that the distance between r and Center is the smallest.

Step 2: For that i'th cluster, re-compute the radius as if the record r is inserted into it. If the cluster threshold is not exceeded, we can proceed to the next record. If not, we start a new cluster with only record r

Step 3: Check that the step 2 does not exceed the maximum cluster summaries. If so, increase threshold such that existing clusters can accommodate more records or they can be merged to fewer cluster summaries.

 

The data structure used in BIRCH is a Clustering feature tree (CF-tree). Each entry in the CF-tree represents a cluster of objects and is characterized by a 3-tuple : (N, LS, SS) where N is the number of data points, LS is the linear sum of the N data points,  SS is the sum of squares of the N data points.  CF entries are additive so clusters can be merged incrementally and consistently.

 

In a CF-tree, each non-leaf node has at most B entries where B stands for branching factor. Each leaf node has at most L CF entries, each of which satisfies threshold T.  Node size for both leaf and non-leaf is determined by input page size P. A leaf node is a collection of CF triplets and links to next and previous leaf nodes. Operations include CF-tree insertion and rebuilding.

 

A CF tree insertion proceeds like this:

First, traverse down from root recursively, find the appropriate leaf. Next, modify the leaf – if it cannot absorb otherwise make a new CF-entry. If there is no room for the new leaf, split the parent node and proceed up the tree if necessary. Lastly, traverse back and update the CF’s on the path or split nodes. To split the node, take two farthest CFs and create two leaf nodes, put the remaining CFs (including the new ones that caused the split) into the closest node.

 

A CF tree rebuilding addresses step 3 mentioned above. If the number of clusters is exceeded, increase threshold and push CFs over to a new group and reduce the tree clusters. Since rebuilding operates on the height h of the tree, at most h extra pages of memory are required.

 

 

3)      Text Segmentation: A segmentation method called Text Tiling generally segments a text into blocks (paragraphs) in accord with topic changes within the text.  Another is a cluster based approach. We could start with each sentence as candidates in a relatively short text of say two sentences. We could merge candidates to form clusters and maximize distances between two probability distributions.

 

All the three steps mentioned above are almost independent of each other and could have variations in terms of algorithms and measures involved.

 

Conclusion:

                There are several methods for topic analysis often involving steps as indicated in this document. Implementing these methods prepares a product with applications in index generation, topic analysis, summarization etc.

 

 

References:

  • H. Li and K. Yamanishi. Topic analysis using a finite mixture model. Information Process Management 39(4):521–541, 2003.
  • Christian Wartena and Rogier Brussee. Topic Detection by Clustering Keywords. http://MultimediaN.NL
  • Zhang, Ramakrishnan, Livny. BIRCH: A New Data Clustering Algorithm and Its Applications. Data Mining and knowledge discovery. Also presentation on the same by Zhao Li was referenced.
  • Martin Warin. Using WordNet and Semantic Similarity to Disambiguate an Ontology.
  • Annette Hulth. Improved Automatic Keyword Extraction Given More Linguistic Knowledge.
  • S. A. Hojjatoleslami and J. Kittler. Region Growing: A New Approach. IEEE transactions on Image Processing.

No comments:

Post a Comment