Cluster computing

Wednesday, June 26, 2013

Comparison of decision tree versus vector space model for document databases
A decision tree operates well on structured data where each tuple is defined by a set of attribute value pairs. This analysis decides which features are more discriminating and build the tree top down. Once the tree is constructed, it is used across all records one by one. Alternatively, the rules can be derived and ordered in a flat list based on their discriminative power and occurrence frequency. Either way, both the attribute value pairs and the order of evaluation are set prior to evaluating new data. This is efficient for large data sets where each data object can be evaluated against the tree to find the label to assign.
But for document databases, this is not the case. The set of attributes or dimensions is not fixed and furthermore, some attributes may be more relevant than others. Support vector space model work well in such high-dimensional space since they use a mapping function that maps term space to a quantitative variable. However, vector space model may be assigning improper weights to rare items since it does not take into account the distribution characteristics of other data points. This can be addressed with a preprocessing step called feature selection which removes the terms in the training documents that are statistically uncorrelated with the class labels.
To recap, here are the advantages and disadvantages of using decision trees versus vector space model.

	Advantages	Disadvantages
Decision Tree	· Useful for structured data · Applicable to hierarchical categorical distinctions. · Automated tree induction is straightforward. · Accuracy at each level can be determined before growing the tree · Can be printed out and interpreted	· Amount of training data reduced as we descend the decision stumps · Force features to be checked in a specific order even if the features may act relatively independent. · Doesn’t work well with weak predictors of correct label.
Vector space model	· Works with varied length attribute value pairs · Allows quantitative representation of data points · Allows all features to act in parallel. · Ordering is not required · Overcomes weak predictors with feature selection	· Disregards class distributions · May assign large weights to rare items. · Can be extremely slow

For example, the following code attempts to find the topic as a single keyword in the provided text.

import nltk.corpus
from nltk.text import TextCollection
from nltk import cluster
from numpy import array

# method to get the topic of a given text
def getTopic(text):
    # clean input
    stop = open('stopwords.txt').read()
    l = []
    src = [w.strip(" .,?!") for w in nltk.word_tokenize(text.lower()) if w not in stop]
    candidates = nltk.FreqDist(w for w in src if w.__len__ > 3)
    candidates = candidates.keys()[:10]

    # initialize vectors
    brown = TextCollection(nltk.corpus.brown)
    for w in candidates:
        l.append((w,brown.tf_idf(w, candidates)))
    vectors = [array(l)]

    # initialize the clusterer
    clusterer = nltk.cluster.kmeans.KMeansClusterer(10, euclidean_distance)
    clusterer.cluster(vectors, True)

    #pick the one closest to the center of the largest
    o = [(clusterer.classify(l.index(i)), l.index(i)) for i in range(l.__len__)]
    o.reverse()
    print o.pop().index(1)

Cluster computing

Wednesday, June 26, 2013

No comments:

Post a Comment