Tuesday, June 25, 2013

Keyword based association analysis:Such analysis collects sets of keywords or terms that occur together and finds correlations between them. Each document is considered a transaction and the set of keywords in that documents are considered a set of items in a transaction. The association mining process can detect compound associations, such as domain dependent terms or phrases and non-compound associations such as units of measure. These are called term level association mining  as opposed to mining on individual words and consist of bigrams and trigrams. Terms and phrases are automatically tagged and the number of meaningless results are greatly reduced. With such term and phrase recognition, standard association mining or max pattern mining may be evoked.
Document classification analysis: Automated document classification assigns labels to documents for easy lookup and retrieval This is used for tagging topics, constructing topic directory, identifying document writing styles, and the purposes of grouping hyperlinks associated with a set of documents. Document classification is automated in the following way: First, a set of pre-classified documents is used as the training set. The training set is then analyzed in order to derive a classification scheme. Such a classification scheme often needs to be refined with a testing process, then it can be used for classifying other documents. This is different from classifying relational data which is structured. Each tuple in the relational data is defined by a set of attribute value pairs and the classification analysis decides which attribute value pair is most discriminating. Document databases on the other hand contain a set of a keywords per document and don't have any fixed set of attributes and treating each keyword as a dimension results in a large dimension set. Decision tree analysis is not suitable for document classification.
Techniques commonly used for document classification include nearest neighbor classification, feature selection methods, Bayesian classification, support vector machines, and association based classification.
The k- nearest neighbor classifier assumes that similar documents share similar document vectors and are expected to be assigned the same class label. We can simply index all of the training documents, each associated with its corresponding class label. For a given test document query, we can retrieve the k - most similar and use their label. To refine the label selection, we can tune k or use weights associated with documents. Furthermore, a feature selection process can be used to remove terms in the training document that are statistically uncorrelated with the class labels. Bayesian classification is another popular technique that involves the statistical distribution of documents in specific classes. This classifier first trains the model by generating a document distribution to each class c of document d and then tests which class is most likely to generate the test document. Another classification method is the support vector machine. Here the classes are represented by numbers and a direct mapping function from term space to the class variable is constructed.  The least square linear regression method can be used to discriminate this classification. Association based classification classifies document  based on a set of associated frequently occurring text patterns. However frequent terms are usually poor discriminators and the not so frequent ones have better discriminative power. Such an association based classification method proceeds as follows First, we extract the keywords and terms by information retrieval and simple association analysis techniques mentioned above. Second, we use a concept hierarchy with term classes such as WordNet, or expert knowledge or keyword classification systems. Documents in the training set can also be classified into class hierarchies. A term association mining can then be applied to discover sets of associated term that can be used to maximally distinguish one class of documents from another. This derives a set of association rules associated with each document class which can be ordered based on their discriminative power and occurrence frequency and then used to classify new documents.

No comments:

Post a Comment