Sunday, November 3, 2013

Today I'm going to describe a few steps again for the implementation of Fuzzy SKWIC.  First I want to talk about the data to be used with this implementation. I 'm comparing nltk corpus to the Gutenberg project. What we need are a set of documents from which we can use conditional frequency distribution instead of compute IDF or Inverse Document Frequency that we can rank and take the top few keywords as our vocabulary list. We will also try out a sample vocabulary list test data population with the python nltk library. Its also possible to implement fuzzy SKWIC in  Python.
raw = nltk.clean_html(html)
tokens = nltk.word_tokenize(raw)
vocab = sorted(set(tokens))
porter = nltk.PorterStemmer()
[porter.stem(t) for t in tokens]
wnl = nltk.WordNetLemmatizer()
[wnl.lemmatize(t) for t in tokens]
I take back the use of conditional frequency distributions and will switch to using IDF. The conditional frequency distribution is useful for Wartena's PLSI based implementation. We will stick to using IDF for this implementation since we are only interested in a limited set of vocabulary and want to focus on Fuzzy SKWIC.  IDF can be computed by knowing the frequency distribution of the words in each text of a collection of documents. The frequency distribution is readily available from the nltk package so if a word is present in a text, we can increment the document frequency for that word. We will also stick to using four categories of document samples  just like in the implementation by the authors of Fuzzy SKWIC and slip in our test document we would like to extract keywords on into one of the categories.   We assume that the given test document from which we will extract keywords is similar to one or more of the categories we have chosen. We fix the C clusters. For each cluster we will maintain an array of feature weights ranging from 1 to n the number of words in our chosen vocabulary. The order of the feature weights must correspond to the order of the words in the vocabulary. We also maintain a variable associated to each term for the weighted aggregated sum of cosine based distances along the individual dimensions. These will be computed for different pairwise matching of terms and the ith cluster center vector. Both the cosine and the feature weights could be stored in data structure for each cell of a C x n matrix.  We will also need another data structure for a C x N fuzzy partition matrix which we will use to keep the computed values of the fuzzy membership labels.
We will also keep some data structures between iteration that lets us use the previous iteration values computed for tau. The cosine based distance is probably the only one that may require individual as well as aggregated values to be kept track of between during an iteration as well as between iterations. The individual cosine based dissimilarity  is computed once for each 1 < i < C, 1 < j < N and  1 < k < n. So it represents a 3D matrix of values and this we can aggregate over 1 to n. and store in separate lookup variables in a C x N matrix

No comments:

Post a Comment