Why Topic Analysis?
Topic analysis for a document has many different
applications. In my search, I’ve not found any commercial product that does all
of the text processing that’s outlined here. For example, one can construct an
index of the text based on topic analysis. Topics can be scoped to text
documents or even text blocks. Main topics and sub-topics can be indicated in
the index and one can get a sense of the contents by looking at the topic
structure. The methods mentioned here has many applications such as text
mining, text summarization, information extraction, question answering and
other text processing.
Why Developers?
The applications mentioned above have potential for use by
millions of users. Text Summarization alone has had sensational news coverage for
a product. This document attempts to list the implementations involved which is
otherwise missed from the complex equations and models mentioned in the
research papers. That said, this document derives most or all of the ideas from
the research papers mentioned in the references section.
Besides, research focuses on finding metrics and measures to
evaluate the document for keyword or topic extraction or clustering but
developers treat them as factors or algorithms that can be interchanged.
Further developers focus on the fast and efficient implementations such as for
clustering. While any text mining document has references to clustering, which
is a popular technique for information retrieval from documents, this document introduces
clustering techniques to developers and at the same time making use of it in
context of the goal for topic analysis.
Moreover, developers will find that there are many different
implementations possible by simple interchange of algorithms and methods. Different
techniques may need to be tried out. Besides, none of the techniques yields a
perfect solution. Research calls this domain “a non-convex optimization”.
What are the steps
involved?
The steps in the text processing involve:
1) Keyword detection. There are several well defined methods for
keyword extraction.
One of this uses a factor called Term
Frequency – Inverse Document Frequency (also known as TF-IDF). A TF-IDF value
is calculated for each word in a text and those words having the largest TF-IDF
values are chosen as keywords. Another way to select words would be to find
their Shannon Information which is proportional to term frequency and the
probability of the occurrence of the word in the corpus data. The corpus data
is a collection of documents that have already been parsed and tagged and
represents a comprehensive collection of topics.
There are other similarity or distance
measures as well. For example, similarity can be based on statistical
distribution of words in a corpus of texts. These lead to a weighted sum of
terms, with the crucial difference that non-negative weights with sum 1 are
allowed which have a natural interpretation as conditional probabilities.
Python’s NLTK framework allows us to
experiment with these conditional probabilities with the following sample code
from their online documentation.
featuresets = [(document_features(d), c)
for (d,c) in documents]
train_set, test_set = featuresets[100:],
featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print nltk.classify.accuracy(classifier,
test_set) [1]
However, our approach will be based on an
input of the word occurrence and co-occurrence data of terms and an output of a
set of keywords in a cluster that indicate the topic of the document. We will
get this data from a corpus of text in natural language processing kit that has
this data.
2) Topic analysis by Clustering:
Clustering involves grouping of data points
together based on their co-occurrences or on their similarity measures. If a
cluster with a seed word w contains a co-occurring word w1 and another cluster
with the seed word w1 has a word w, we can merge the clusters together. When we cluster based on similarity measures,
we are looking for the similarity of a word with that of cluster centers. We could start with a set of finite clusters
and group data points to these clusters adjusting their centers as data points
are added to them. Alternatively, we can create a tree of clusters that can be
traversed for each data point to find the appropriate cluster where the data
point can be inserted. When building a hierarchical cluster, we can treat the
data points as one cluster and split them into sub clusters repeatedly in a top
down manner or merge them repeatedly from one data point per cluster in a
bottom up manner.
There is a fast and efficient clustering
described as BIRCH which stands for Balanced Iterative Reducing and Clustering
using hierarchies. This clustering works in the agglomerative style where the
cluster centers are not decided initially. Only the maximum number of cluster
summaries and the threshold for any cluster is decided. These two factors are
chosen because we want to keep the cluster summaries in memory even for
arbitrarily large data sets.
The algorithm reads each of the data points
sequentially and proceeds in the following manner:
Step 1: Compute the distance between record
r and each of the cluster centers. Let i be the cluster index such that the
distance between r and Center is the smallest.
Step 2: For that i'th cluster, re-compute
the radius as if the record r is inserted into it. If the cluster threshold is
not exceeded, we can proceed to the next record. If not, we start a new cluster
with only record r
Step 3: Check that the step 2 does not
exceed the maximum cluster summaries. If so, increase threshold such that
existing clusters can accommodate more records or they can be merged to fewer
cluster summaries.
The data structure used in BIRCH is a Clustering
feature tree (CF-tree). Each entry in the CF-tree represents a cluster of
objects and is characterized by a 3-tuple : (N, LS, SS) where N is the number
of data points, LS is the linear sum of the N data points, SS is the sum of squares of the N data points. CF entries are additive so clusters can be
merged incrementally and consistently.
In a CF-tree, each non-leaf node has at
most B entries where B stands for branching factor. Each leaf node has at most
L CF entries, each of which satisfies threshold T. Node size for both leaf and non-leaf is
determined by input page size P. A leaf node is a collection of CF triplets and
links to next and previous leaf nodes. Operations include CF-tree insertion and
rebuilding.
A CF tree insertion proceeds like this:
First, traverse down from root recursively,
find the appropriate leaf. Next, modify the leaf – if it cannot absorb
otherwise make a new CF-entry. If there is no room for the new leaf, split the
parent node and proceed up the tree if necessary. Lastly, traverse back and
update the CF’s on the path or split nodes. To split the node, take two
farthest CFs and create two leaf nodes, put the remaining CFs (including the
new ones that caused the split) into the closest node.
A CF tree rebuilding addresses step 3
mentioned above. If the number of clusters is exceeded, increase threshold and
push CFs over to a new group and reduce the tree clusters. Since rebuilding
operates on the height h of the tree, at most h extra pages of memory are
required.
3) Text Segmentation: A segmentation
method called Text Tiling generally segments a text into blocks (paragraphs) in
accord with topic changes within the text. Another is a cluster based approach. We could
start with each sentence as candidates in a relatively short text of say two
sentences. We could merge candidates to form clusters and maximize distances
between two probability distributions.
All the three steps mentioned above are
almost independent of each other and could have variations in terms of
algorithms and measures involved.
Conclusion:
There are several methods for
topic analysis often involving steps as indicated in this document.
Implementing these methods prepares a product with applications in index
generation, topic analysis, summarization etc.
References:
- H. Li and K. Yamanishi. Topic analysis using a finite mixture model. Information Process Management 39(4):521–541, 2003.
- Christian Wartena and Rogier Brussee. Topic Detection by Clustering Keywords. http://MultimediaN.NL
- Zhang, Ramakrishnan, Livny. BIRCH: A New Data Clustering Algorithm and Its Applications. Data Mining and knowledge discovery. Also presentation on the same by Zhao Li was referenced.
- Martin Warin. Using WordNet and Semantic Similarity to Disambiguate an Ontology.
- Annette Hulth. Improved Automatic Keyword Extraction Given More Linguistic Knowledge.
- S. A. Hojjatoleslami and J. Kittler. Region Growing: A New Approach. IEEE transactions on Image Processing.
No comments:
Post a Comment