Thursday, November 7, 2013

We talked about the cluster retrieval based on summarization and on the premise known as cluster hypothesis. This summarization data could include naturally occurring overlap of clusters. This will prevent omission of clusters that have a significant overlap with others.
LSI and COV can be used to find major document clusters. During the dimension reduction step, the minor clusters are often mistaken for noise and are discarded. There is an algorithm by Ando to identify small clusters in limited contexts. Here major themes from the documents are prevented from dominating the process of selecting the basis vectors for subspace projection step in LSI. This is done during the basis vector selection process by introducing an unfavorable weight also known as filter or negative bias to documents that belong to clusters that are well represented by basis vectors that have already been selected. This weight is computed by finding the residual of each document vector which is the proportion of the document vector that cannot be represented by the basis vectors that have been selected up to that point. Then the magnitude of the document vector is raised to a power q of the magnitude of its residual.
This process is repeated for every cluster where the document vectors are initialized by their r1, r2..rm residual and the residual matrix is calculated as R - RbibiT where bi is the first Eigenvector of RTsRs. We set R to be A initially. After each iterative step, the residual steps are updated to  take into account the new basis vector f. After the klh basis vector is computed, each document vector is mapped to its counterpart in the k-dimensional subspace. This is followed by the relevance ranking.
However, this algorithm does identify all major and minor clusters. Finding the Eigen vector may be unstable when q is large. The basis vectors may not always be orthogonal. The number of documents in the database is even moderately large, intermediate data may not fit into main memory. These shortcomings arise because there is loss of information during rescaling of document vectors after each computation.  Also the document attribute matrices are usually highly sparse but after a few iterations, the residual matrix may become very dense. COV based algorithms can overcome these.

Wednesday, November 6, 2013

In the previous post, we discussed from the article written by Koboyashi and Aono that the LSI and COV both attempt to reduce the problem to k-dimension sub-space. LSI attempted to reduce the ranks to k and COV attempted to choose k largest ^Eigen values and their corresponding vectors. On the plane where these clusters are visualized the LSI had the origin for the document vectors outside the plane while the COV had the origin inside the plane and thus with better discriminating power between the document vectors.
There are several advantages to the COV matrix. The first advantage is the better discrimination between document vectors.
The second advantage is the scalability of the computation with larger databases.
The third advantage is the computation can be distributed in the sense that the subspace can be local at  the different places where the data is spread over as Qu et al have shown. When the data is distributed across different locations, the principal components  are computed locally.  These are then sent to a centralized location and used to compute the global principal components. Further, it reduces the data transmission rates and thus is better than a parallel processing approach. However, the dominant components do not provide a good representation, then it is suggested to include up to 33% more components to improve the level of accuracy over the centralized counterpart implementation. The interesting thing here is that the size of the local data or the overall data does not seem to matter as long as the dominant components provide a  good representation.
Another way to optimize this further has been to identify clusters (or sets of documents that cover similar topics) during the pre-processing stage itself so that they can be retrieved together to reduce the query response time or even when they are to be summarized for smaller storage while processing large data. This approach is called cluster retrieval and is based on the assumption that closely related documents often satisfy the same requests. Incidentally, this summarization and other data structures from pre-processing can help alleviate a variety of problems with large data sets. One more interesting feature of both LSI and COV is that both methods recognize and preserve the naturally occurring overlap of clusters. These overlaps are interesting because they have multiple meanings and therefore are useful in database content analysis from different viewpoints. In the hard partitioning case, the clusters overlaps are removed losing this rich information. In the soft partitioning case, this is preserved and improves results. Both algorithms are not as successful at discriminating similar documents or minor document clusters. The are often mistaken as noise and discarded. On the other hand these minor clusters and outliers have significant information and can add valuable insights. They should be included as much as possible. 

Tuesday, November 5, 2013

In the previous post I mentioned about Latent Semantic Indexing and in this post I want to introduce both LSI and Covariance Matrix Analysis COV and then we could compare LSI and COV.
In LSI, we are given a database with m documents and n distinguishing attributes for relevancy ranking. A is the corresponding m x n document-attribute matrix model. The idea is to reduce the dimensions to k where k is much smaller than m or n  and where k is a ranking.
In the COV algorithm, document and query vectors are projected onto the subspace spanned by the k-eigenvectors corresponding to the corresponding to the max eigenvalues of the covariance matrix of document vectors. Thus the COV is based on selecting a subset k vectors. The covariance is computed by taking the averages along the individual component or dimensions Note that taking the averages along the individual dimensions from 1 to n is something that we find in Fuzzy SKWIC too.  So its particularly helpful to pay attention to the fundamentals here.  We define the component wise averages as d' = a1', a2' ... an'. where the ith document for i ranging from 1 to m has attributes /components as ai1, ai2, ai3, .. ain. Then the averages for each attribute is computed as the average across the m documents along that attribute. In a matrix with m documents as rows and n attributes as columns, this component-wise average translates to averaging the individual columns. This matrix can be decomposed into a set of matrices of the form VEVt where V is the orthogonal matrix that diagnolizes C, E is the diagonal matrix whose diagonal values are in decreasing order from top to bottom and proceeding from left to right and finally Vt which is the term matrix.  The diagonal values with k-largest are used to construct the eigen vectors v1, v2, ... vk that reduces the dimensions to k from m,n. In other words all the document vectors and the query vectors are then projected into the subspace defined by these k-eigen vectors.  The difference between COV and LSI is that in the LSI we reduce the dimensions to k while in the COV we select a subset of vectors and transform the document and query vectors to a subspace involving a subset k eigenvectors.  By reducing the dimensions, the vectors continue to have the same origin and there is less distinction between the documents in the document-attribute matrix.  In the covariance analysis, the origin is the center of the basis vectors in the subspace so that the document vectors are more evenly spaced so that the document vectors are more evenly spaced. This enables finer detection of  differences between document vectors in different clusters. Both COV and LSI project the query and document vectors into the k-dimensional subspace and then measure similarity between documents.
 Another advantage with COV is that the computations for the k eigenvectors and Eigen values are independent of the document vectors and query vectors. So they need not be performed for each document and query vector and since this is done only once, the document and query vectors can be processed in near real-time. This is a significant advantage when compared with the LSI which did not scale as the size of the database grew. However, the cost for the Eigen-vectors is still significant in comparision to the cost of the individual vector processing.

Monday, November 4, 2013

I will continue today's discussion on the fuzzy SKWIC implementation. We are getting closer to covering the implementation details. The computations for the feature weights, fuzzy membership labels and tau involve components that we could be better prepared for. Although we have discussed the sequence, we are looking at it for the storage or data structures. And to pick up the thread where we left off from the previous post, we have not discussed whether the sample data collections a sufficient for our case since the test document could be from any origin. The sample we collect was part of a predefined finite set of categories of documents. Hence. The given test document could enc
"up as a noise in a cluster. So we have to avoid that as much as possible. One way would have been to used conditional probability. but we did not proceed with it because we will have to use PLSI techniques such as a probability that a randomly selected occurrence of term t has source document d and the randomly selected term occurrence from d is an instance of term t as Wartena put it. There is a standard to building a Vector Space Model for information retrieval. One of the advantages of the method is that it enables relevance ranking of documents of heterogeneous format (eg text, multilingual text, ) with respect to the user input queries as long as the attributes are well defined characteristics of the documents. If the attribute is present, the corresponding co-ordinate for the document vector is unity and when the attribute is absent, the coordinate is zero. This is called the binary vector model. Term weighting is another model that is a refinement over the Boolean model. When the term-weight is Term frequency  Inverse document frequency TF-IDF , the weight of the ith term in the jth document is defined as weight(i, j) = (1 + tf) log n /df when the term frequency is >= 1 and 0 if the term frequency is equal to zero and where df is the number of documents in which the term appears out of n. Given a single query term , two documents can be ranked for similarity based on the difference in the angles their vectors make with respect to q and hence the use of cosine. Cosine is simpler to use than the computing the Euclidean distance since we want relative comparision. Every query is modeled as  a vector using the same attribute space as the documents. Even this similarity measure is too time-consuming for computation in real time for large databases. This is a serious concern for massive databases.  In addition, users consistently select the most relevant features from the IR engines. Scalability of such techniques suffers simply because of the large number of dimensions involved in such computations.
 One approach is to reduce the number of dimensions by transforming it to a subspace with sufficiently small dimensions  that enable fast response time.  The subspace should still be able to discriminate contents of individual documents. This is where Latent Semantic Indexing helps and when it is probability such that the weights add up to unity, we call it Probabilistic Latent Semantic Indexing. For our implementation, we are first concerned with a working implementation prior to using LSI or PLSI and performance optimization techniques.

Sunday, November 3, 2013

Today I'm going to describe a few steps again for the implementation of Fuzzy SKWIC.  First I want to talk about the data to be used with this implementation. I 'm comparing nltk corpus to the Gutenberg project. What we need are a set of documents from which we can use conditional frequency distribution instead of compute IDF or Inverse Document Frequency that we can rank and take the top few keywords as our vocabulary list. We will also try out a sample vocabulary list test data population with the python nltk library. Its also possible to implement fuzzy SKWIC in  Python.
raw = nltk.clean_html(html)
tokens = nltk.word_tokenize(raw)
vocab = sorted(set(tokens))
porter = nltk.PorterStemmer()
[porter.stem(t) for t in tokens]
wnl = nltk.WordNetLemmatizer()
[wnl.lemmatize(t) for t in tokens]
I take back the use of conditional frequency distributions and will switch to using IDF. The conditional frequency distribution is useful for Wartena's PLSI based implementation. We will stick to using IDF for this implementation since we are only interested in a limited set of vocabulary and want to focus on Fuzzy SKWIC.  IDF can be computed by knowing the frequency distribution of the words in each text of a collection of documents. The frequency distribution is readily available from the nltk package so if a word is present in a text, we can increment the document frequency for that word. We will also stick to using four categories of document samples  just like in the implementation by the authors of Fuzzy SKWIC and slip in our test document we would like to extract keywords on into one of the categories.   We assume that the given test document from which we will extract keywords is similar to one or more of the categories we have chosen. We fix the C clusters. For each cluster we will maintain an array of feature weights ranging from 1 to n the number of words in our chosen vocabulary. The order of the feature weights must correspond to the order of the words in the vocabulary. We also maintain a variable associated to each term for the weighted aggregated sum of cosine based distances along the individual dimensions. These will be computed for different pairwise matching of terms and the ith cluster center vector. Both the cosine and the feature weights could be stored in data structure for each cell of a C x n matrix.  We will also need another data structure for a C x N fuzzy partition matrix which we will use to keep the computed values of the fuzzy membership labels.
We will also keep some data structures between iteration that lets us use the previous iteration values computed for tau. The cosine based distance is probably the only one that may require individual as well as aggregated values to be kept track of between during an iteration as well as between iterations. The individual cosine based dissimilarity  is computed once for each 1 < i < C, 1 < j < N and  1 < k < n. So it represents a 3D matrix of values and this we can aggregate over 1 to n. and store in separate lookup variables in a C x N matrix
This is a review of an interesting article written by Daniel Phelps as the Leader, Information Mining Group, at Eastman Kodak Company several years back. He talks about Emerging Trend Detection in Text Data Mining and his suggestion on how an ETD tool would work in an ideal scenario includes:
1. Raw data processed into useful information
2. Sufficient information processed to meet current need.
3. No irrelevant information presented.
4. All information available immediately when needed.
5. Information prioritized for the current need
6. Information presented in a format that is intuitively easy to understand.
7. Information can be viewed at different levels of detail.
8. Information can be viewed from multiple perspectives.
He mentions the objective of ETD to provide an automated alert when new developments are happening in a specific area of interest.
An event may be because of an underlying development. Whether the development is important is determined based on the situation and the particular information needs of the person evaluating the data. The speed to detect and to enable a response is what sets ETD tools apart. ETD tools generally have a user interface to navigate the trends, data and summaries. They are usually customized too from client to client. ETD tools involve knowledge management and text mining. Primarily they include what is a necessary for a decision maker to look at developments and make a call.
 

Saturday, November 2, 2013

We will continue on our discussion on trend analysis of web searching. From the sample discussed, the number of queries per month showed similar pattern. In a plot of natural log of frequency versus natural log of rank, one line represents ranks of all words ordered by their descending frequencies and another line represents ranks of unique frequencies. The first line was not straight but had a pretty good slope. The second line had a great overlap and dropped down for higher ranks. The first did not include words that shared the same frequency. The second did not include the size of the vocabulary.
There were a few more graphs made, to study if they showed similar pattern. Many terms had the same frequencies. The low frequencies cluster many words. And different time periods were chosen for the additional graphs.  They all showed similar trends. There would be overlap for the high frequency words and a divergence on the lower. Consecutively, there were two equations formed one for each half. The lower frequency and higher rank was represented by a line while the higher frequency and lower rank was represented by a polynomial.
The size of the vocabulary increased much more slowly than the size of the queries. Investigations into the overlapping vocabulary across years showed that the overlap had higher frequencies each year. Some of these words were site specific and the others were more seasonal in nature. The matrix of vocabulary terms to term pairs is sparse with only a small portion of the vocabulary words co-occurring.  In terms of improvements to the search engine, the following observations were made. First the zero hits due to the use of stop-words could be solved either by automatically excluding them or providing context sensitive information. Second the search engine could interpret term pairs and queries although users may have to be informed on this advanced query construction. Third the word associations measured by the frequencies of word pairs may be predictable by word frequencies alone. Collection of user vocabulary could help for the generation of content based metadata.