Steps for processing text:
Text cleaning : clean the data, remove stop words and resolve inconsistencies and stem the terms.
Keywords are extracted based on their conditional probability distributions.
Text is usually analyzed for topics by clustering with K-means using a cosine based distance vector.
Another data mining tool is to use the clustering for high-dimensional data. For example, text documents may contain thousands of terms or keywords as features. Document classification differ from topic analysis in that the document is high-dimensional where there are many keywords as features. Text documents are clustered on the frequent terms they contain. Terms are extracted and stemmed to reduce the term to its basic stem. This reduces the document to a bag of words. If each term is treated as a dimension, the document becomes high-dimensional where there are as many features as keywords. That is, by using a frequent term based analysis, a well selected subset of all documents can be considered as a clustering. The frequent term based cluster is used to refer to a cluster which consists of the set of documents containing all of the terms of the frequent term set. A good subset of all the frequent term sets is selected. A good subset is one that covers all of the documents. Cluster overlap is measured by the distribution of documents supporting some clusters over the remaining cluster candidates. Clusters are automatically described by their frequent set.
Information retrieval is concerned with the organization and retrieval of information from a large number of documents usually with metrics such as Term-frequency-inverse-document-frequency and Shannon information. Two documents are considered to be similar if they have a high cosine measure. Unlike database systems that address concurrency control, recovery, transaction management and update, information retrieval is concerned with keywords and the notion of relevance.
Popular Text Indexing techinques include inverted indices and signature files. An inverted index comprises of two indexed tables (hash or B+-tree): document_table and term_table. The document table comprises of the document_id and a list of terms for each document aka posting_list. The term table consists of the term_id and the list of document identifiers in which the term appears. The list of terms can be quite large so a hashing technique and a superimposed coding technique to encode a list of terms into bit representation.
A signature file stores a signature record for each document in the database. Signatures are usually b bits long, initialized to zero, and a bit is set to 1 if the term appears in the document. To reduce the high-dimensionality of documents, reduction techniques such as latent semantic indexing, probabilistic latent semantic indexing and locality preserving indexing can be used.
Latent semantic indexing decomposes the document matrix using singular value decomposition. i.e. it extracts the most representative features while minimizing the reconstruction error. If the rank of the term document X is r, then LSI decomposes X using X=USigmaVsuffixT where Sigma is the representation of the singular values of X. The LSI uses the first K vectors to embed the documents in a k-dimensional subspace.
Locality preserving index - extracts the most discriminative features by preserving the locality of documents in the reduced dimensionality space. Locality implies semantic closeness. LPI uses a minimization function with constraints.
While LSI seeks to uncover the most representative features, LPI aims to discover the geometrical structure of the document space.
Probabilistic Latent Semantic indexing is similar to LSI but achieves reduction through a probabilistic mixture model. Specifically there are k latent common themes in the document collection and each is characterized by a multinomial word distribution.
There are other approaches to text mining as well in addition to keyword based approach. These are 1) tagging approach and 2) information extraction approach. Tagging may be manual or done by some automated categorization algorithm where the tag set is small and predefined. Information extraction is deeper in that it requires semantic analysis of text by natural language understanding and machine learning methods. Other text mining tasks may include document clustering, classification, information extraction, association analysis, and trend analysis.
Text cleaning : clean the data, remove stop words and resolve inconsistencies and stem the terms.
Keywords are extracted based on their conditional probability distributions.
Text is usually analyzed for topics by clustering with K-means using a cosine based distance vector.
Another data mining tool is to use the clustering for high-dimensional data. For example, text documents may contain thousands of terms or keywords as features. Document classification differ from topic analysis in that the document is high-dimensional where there are many keywords as features. Text documents are clustered on the frequent terms they contain. Terms are extracted and stemmed to reduce the term to its basic stem. This reduces the document to a bag of words. If each term is treated as a dimension, the document becomes high-dimensional where there are as many features as keywords. That is, by using a frequent term based analysis, a well selected subset of all documents can be considered as a clustering. The frequent term based cluster is used to refer to a cluster which consists of the set of documents containing all of the terms of the frequent term set. A good subset of all the frequent term sets is selected. A good subset is one that covers all of the documents. Cluster overlap is measured by the distribution of documents supporting some clusters over the remaining cluster candidates. Clusters are automatically described by their frequent set.
Information retrieval is concerned with the organization and retrieval of information from a large number of documents usually with metrics such as Term-frequency-inverse-document-frequency and Shannon information. Two documents are considered to be similar if they have a high cosine measure. Unlike database systems that address concurrency control, recovery, transaction management and update, information retrieval is concerned with keywords and the notion of relevance.
Popular Text Indexing techinques include inverted indices and signature files. An inverted index comprises of two indexed tables (hash or B+-tree): document_table and term_table. The document table comprises of the document_id and a list of terms for each document aka posting_list. The term table consists of the term_id and the list of document identifiers in which the term appears. The list of terms can be quite large so a hashing technique and a superimposed coding technique to encode a list of terms into bit representation.
A signature file stores a signature record for each document in the database. Signatures are usually b bits long, initialized to zero, and a bit is set to 1 if the term appears in the document. To reduce the high-dimensionality of documents, reduction techniques such as latent semantic indexing, probabilistic latent semantic indexing and locality preserving indexing can be used.
Latent semantic indexing decomposes the document matrix using singular value decomposition. i.e. it extracts the most representative features while minimizing the reconstruction error. If the rank of the term document X is r, then LSI decomposes X using X=USigmaVsuffixT where Sigma is the representation of the singular values of X. The LSI uses the first K vectors to embed the documents in a k-dimensional subspace.
Locality preserving index - extracts the most discriminative features by preserving the locality of documents in the reduced dimensionality space. Locality implies semantic closeness. LPI uses a minimization function with constraints.
While LSI seeks to uncover the most representative features, LPI aims to discover the geometrical structure of the document space.
Probabilistic Latent Semantic indexing is similar to LSI but achieves reduction through a probabilistic mixture model. Specifically there are k latent common themes in the document collection and each is characterized by a multinomial word distribution.
There are other approaches to text mining as well in addition to keyword based approach. These are 1) tagging approach and 2) information extraction approach. Tagging may be manual or done by some automated categorization algorithm where the tag set is small and predefined. Information extraction is deeper in that it requires semantic analysis of text by natural language understanding and machine learning methods. Other text mining tasks may include document clustering, classification, information extraction, association analysis, and trend analysis.
No comments:
Post a Comment