Tuesday, May 21, 2013

Nltk classifier modules
The nltk decision tree module. This is a classifier model. A decision tree comprises of non-terminal nodes for conditions on feature values and the terminal nodes for the labels. The tree evaluates to a label for a given token.
The module requires feature names and labeled feature sets. Different thresholds such as for depth cut off, entropy cut off and support cut off can also be specified. Entropy refers to degree of randomness or variations in the results while support refers to the number of feature sets used for evaluation.
The term feature is used to refer to some property of an unlabeled token. Typically a token is a word from a text that we have not seen before. If the text is seen before and has already been labeled, it is a training set. Training set helps train our model so that we can pick the labels better for the tokens we encounter. As an example, the proper nouns for names may be labeled male or female. We start with a large collection of already tagged names, we call training data. We build a model where we say if the name ends with a certain set of suffixes, the name is that of a male. Then we run our model on the training data to see how accurate we were and we adjust our model to improve our accuracy. Next we can run our model on a test data. If a name from the test data is labeled by this model as a male, we know its likelihood to be correct.
The property of labeled tokens are also helpful. We call these as joint-features and we distinguish it from the feature we just talked about by referring to the latter as input-features. So joint-features belong to training data and input-features belong to test data. For some classifiers such as the maxent classifier we refer to these as features and contexts respectively. The maxent stands for maximum entropy model where joint-features are required to have numeric values and input-features are mapped to a set of joint-features.
 There are other types of classifiers as well. For example, the mallet package uses the external mallet machine learning package. The megam module uses the external megam maxent optimization package. The naïve Bayes module is a module that assigns probability for a label. The P(label/features) is computed as P(label) * P(features/label) / P(features).  The 'naive' assumption is that all features are independent.  The positivenaivebayes module is a variant of the Bayes classifier
 that performs binary classification based on two complementary classes  where we have labeled examples only for one of the classes. The there are classifiers based exclusively on the corpus that they are trained on. The rte_classify module is a simple classifier for the RTE corpus.  It calculates the overlap in words and named entities between text and hypothesis. Most of the classifiers discussed are built on top of the scikit machine learning library.

Nltk cluster package
This module contains a number of basic clustering algorithms . Clustering is unsupervised machine learning to group similar items with a large collection There are the k-means clustering, E-M clustering and a group average agglomerative clustering. The K-means clustering starts with the k arbitrary chosen means and assigns each vector to the cluster with the closest mean. The centroid of the cluster is recalculated as the means of each cluster. The process is repeated until the clusters stabilize. This may converge to a local maximum so this method is repeated for other random initial means and the most common occurring output is chosen. The Gaussian EM clustering starts with k arbitrarily chosen means, prior probabilities and co-variance matrices which forms the parameters for the Gaussian source. The membership probabilities is then calculated for each vector in each of the clusters - this is the E-step. The parameters are then updated in the M-step using the maximum likelihood estimate from the clustering membership probabilities. This process continues until the likelihood of the data does not significantly increase.  The GAAC clustering starts with each of the N vectors as singleton clusters and then iteratively merges pairs of clusters which have the closest centroids. This continues until there is only one cluster. The order of merges is useful in finding the membership of a given number of clusters because earlier merges are lower than the depth c in the resulting tree.

No comments:

Post a Comment