Thursday, January 23, 2014

In continuation of our discussion on keywords and documents:
We looked at a few key metrics for selecting keywords based on content, frequency and distribution.
However keywords can also be selected based on co-occurrence of similar terms in the same context.
i.e similar meanings will occur with similar neighbors.

This model proposes that the distribution of a word together with other words and phrases is highly indicative of its meaning This method represents the meaning of a word with a  vector where each feature corresponds to a context and its value to the frequency of the word's occurring in that context.
This vector is referred to as the Term profiles. The profile V(t) of the term t is a set of terms from the list T that co-occur in sentences together with term t, that is P(t) = {t' : t' belongs to s(t)}

Once such a representation is built, machine learning techniques can be used to automatically classify or cluster words according to their meaning. However, these are often noisy representations. This is due to say polysemy and synonymy. Synonymy refers to the existence of similar terms that can be used to express an idea or object in most languages. Polysemy refers to the fact that some words have multiple unrelated meanings. If we ignore synonymy, the clusters we create will have many small disjoint clusters while some of these could have been clustered together. If we ignore polysemy, it can lead to clustering of unrelated documents.

Another challenge in using this technique to massive databases is efficient and accurate computation of the largest few singular values and their vectors in the highly sparse matrix. In fact scalability beyond a few hundred thousand documents is difficult.

Note the technique above inherently tries to find the keywords based on the latent semantics and hence is referred to as Latent semantic indexing. The comparable to clustering in this approach is to find the conditional probability distributions such that the observed word distributions can be decomposed into a few of the latent distributions and a noisy remainder.  The latent components are expected to be the same for co-occurring words.

Next, we will also look at a popular way to represent documents in a collection which is with Vector Space Model.

Material read from survey of text mining, papers by Pinto, papers by Wartena

No comments:

Post a Comment