Let's say we want an implementation for detecting keywords in text using a semantic lookup and clustering based data mining approach. The implementation data structures, steps of the control flow and data structures are to be explained below.
In general for any such implementation, the steps are as follows:
The raw data undergoes a data selection step based on stemming and part of speech tagging
The data is cleaned to remove stop words.
In the data mining step, we extract the keywords by evaluating each word and building our tree or running a decision tree.
In the evaluation step, we filter and present the results.
To begin with, we view an unstructured document as a bag of words and attempt to find hierarchical clusters. We use a specific CF tree data structure such as the BIRCH system (Zhang et al) to populate a data set that we can run for our algorithms and others. This one is scalable and efficient for the task at hand.
The CF tree represents hierachical clusters of keywords that we insert into the tree one by one. We find the cluster that this keyword belongs to based on a distance function that tells us the distance of this keyword from cluster centers. The distance functions are interchangeable.
We could tag parts of speech this way:
from nltk.corpus import brown
suffix_fdist = nltk.FreqDist()
common_suffixes = suffix_fdist.keys()[:100]
def pos_features(word):
features = {}
for suffix in common_suffixes:
features['endswith(%s)' % suffix] = word.lower().endswith(suffix)
tagged_words = brown.tagged_words(categories='news')
featuresets = [(pos_features(n), g) for (n,g) in tagged_words]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.DecisionTreeClassifier.train(train_set)
classifier.classify(pos_features('cats'))
'NNS'
In general for any such implementation, the steps are as follows:
The raw data undergoes a data selection step based on stemming and part of speech tagging
The data is cleaned to remove stop words.
In the data mining step, we extract the keywords by evaluating each word and building our tree or running a decision tree.
In the evaluation step, we filter and present the results.
To begin with, we view an unstructured document as a bag of words and attempt to find hierarchical clusters. We use a specific CF tree data structure such as the BIRCH system (Zhang et al) to populate a data set that we can run for our algorithms and others. This one is scalable and efficient for the task at hand.
The CF tree represents hierachical clusters of keywords that we insert into the tree one by one. We find the cluster that this keyword belongs to based on a distance function that tells us the distance of this keyword from cluster centers. The distance functions are interchangeable.
We could tag parts of speech this way:
from nltk.corpus import brown
suffix_fdist = nltk.FreqDist()
common_suffixes = suffix_fdist.keys()[:100]
def pos_features(word):
features = {}
for suffix in common_suffixes:
features['endswith(%s)' % suffix] = word.lower().endswith(suffix)
tagged_words = brown.tagged_words(categories='news')
featuresets = [(pos_features(n), g) for (n,g) in tagged_words]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.DecisionTreeClassifier.train(train_set)
classifier.classify(pos_features('cats'))
'NNS'