Cluster computing

Saturday, November 9, 2013

Both the search log and case log mentioned in the previous post are required to perform the search. The search log keeps track of the search string that customers formulate and the case log maintains the history of actions, events and dialogues when the case is open. The implementation of the approach mentioned involves searches in both these logs. The search logs often have noise that comes with the nature of the web browsing required for matches of interest and availability to take place. Hence the search logs are subjected to both a pre-processing as well as a post-filtering technique Further the same content is viewed differently for different topics i.e. there are search views of document or content view in a hot topic to identify the extraneous ones. This identification of extraneous documents is not only beneficial for obtaining higher quality topics but to pinpoint documents that are being returned as noise to certain queries.
The case logs are different from the search log and hence their processing is somewhat indirect. The excerpts are generated from case log and mined Note that the case documents are in general very long documents because they capture all the information on the action taken on the case. In addition, there may be input from more than one party and hence there is a lot of noise to deal with. In the author's approach there is both a pre-processing as well as a clean up of the actions involved. Here the noise filtering is done by normalizing typos, misspellings and abbreviations. Even the words are normalized to a known jargon. This is done with a help of a thesaurus. Excerpt generation and summarization is composed of a variety of techniques. Techniques range from dealing with the characteristics of the text to making use of the domain knowledge. Sentences are identified regardless of tables and cryptic text that they wrap around. Sentences are ranked and primarily based on the technical content rather than the logistics. Techniques can be enabled or disabled independently.
The author proposes a novel approach to search hot topics from the search logs. Here the search view and the content view are combined to get high-quality topics and these have a higher match with the user's perspective.

There is a topic mentioned in the book on Survey of Text Mining as HotMiner by Malu Castellanos. Here companies would like to find topics that are of interest to their customers i.e. hot problems and making them available on their website along with links to corresponding solution documents. The customer support centers maintain logs of their customer interactions which becomes the source for discovering hot topics. The author uses these case logs to extract relevant sentences from cases to conform case excerpts. In addition, this approach deals with dirty text containing typos, adhoc abbreviations, incorrect grammar, cryptic tables and ambiguous and missing punctuations. They normalize the terminology with a thesaurus assistant and a sentence identifier.
The suggestion here is that the logs rather than the documents provide information on the topics that are of most interest to the customers. These are called hot topics. This works well given that the document categorization and classification - be it manual or automatic, is not sufficient to detect the topics of interest to the customers. Instead by providing the self-help solution documents by identifying the topics of these hot problems, organizations can better organize their site, reduce customer support costs and improve their customer satisfaction.
The author's approach to mine hot topics from logs of customer support centers, involves the use of two kinds of logs - a search log and a case log. The search log keeps track of the search strings that customers formulate and the case logs keep track of the cases opened by the customers along with the history of action, events and dialogues, followed while a case is open.
These two logs are complimentary to obtain information on all problems encountered by the customer.

Friday, November 8, 2013

We discussed the dynamic rescaling of LSI. Let's discuss the dynamic rescaling of COV. Here too, we follow a similar approach as we did with the LSI and LSI-rescaling. In COV-rescale, the residual of the covariance matrix is computed. The Covariance matrix was given by the function C = V E Vt where V is the
orthogonal matrix that diagnolizes C so that the diagonal entries of E are in monotone decreasing order going from top to bottom. Vt is the document
attribute matrix m x n with m row vectors representing documents. V is calculated as the matrix [aiq, ai2, ai3, .. ain]. V and Vt seem to be the left and
right singular vectors of the singular values of document vectors.
So the algorithm repeats the following steps for each of the k dimensions :
First, the max of the residual vector for the documents r1, r2, r3, ... rm is calculated.
Second, the value of the rescale factor is determined as a function of this value obtained from the step above.
Third, the residual matrix is computed as the magnitude of the residual term raised to the rescale factor combined with the corresponding component residual
vector. This is expressed as [(|r1|^q)r1, (|r2|^q)r2, (|r3|^q)r3, ... (|rn|^q)rn].
Fourth, we generate the Covariance matrix based on the residual matrix above. The covariance matrix function is given above.
Fifth, we perform the singular value decomposition of the covariance matrix. This is a factorization of the real matrix that we get with the step above. The SVD decomposes the covariance matrix into three simple transformations, a rotation V followed by the scaling with E and another rotation Vt
Then we take the first row vector of Vt and perform a Gram-Schmidt transformation on the same.
We then recompute the residual matrix as R-R.b.bt
Then we repeat the iteration with the next k.
Generally speaking the COV-rescaling performs better than LSI, COV and LSI-Rescaling.

Thursday, November 7, 2013

In the previous post we talked about LSI and COV and we are seeing that these fundamentals improve our understanding of why and how to perform the cosine computations and their aggregation, when and how to reduce the dimensions and what benefits come with it and finally how to scale the implementation for large data. The last one is particularly interesting since there is no limit to how much data can be clustered and categorized. In this regard, I want to draw attention to how the document vectors are computed. The document vectors are represented as di = [ai1, ai2, ai3, ... ain] in the m document * n attribute matrix. The ai1 is computed as the attribute relevancy to term 1 in the n terms chosen for the document attribute matrix. It could be a simple inverse document frequency for that term or it could be the (1 + tf)log2(n/df). This is a scalar value. The cosine of the two document vectors are computed as their dot products. and the angle between them is determined by the component wise product divided by their individual vector magnitudes.
Now we can discuss dynamic -rescaling algorithms. We saw that the LSI and COV prevented the major themes from dominating the process of selecting basis vectors for the subspace into which the IR problem is projected. This was done by weights that would decrease the relative importance of attributes that were already well represented by the basis vectors that have already been computed.
In dynamic rescaling of LSI, the residual matrix R and Rs are computed. Initially R is set to the document term matrix A itself. We follow the same steps as the LSI until the document and query vectors are mapped into the k-dimensional subspace. The steps involved upto this point include updating the R and Rs after each iterative step by taking into account the most recently computed basis vector bi. After the kth basis vector is computed, each document vector dj in the original 1R problem is mapped to its counterpart in the subspace. The query vector is also mapped to this subspace. We rescale document vectors after the computation of the basis vectors and the residual vectors because the length of the residual vector prevents the loss of information during reduction of dimension and we take the maximum residual vector length. The rescaling factor q is adjusted as a function of this maximum residual vector length tmax. from the previous iteration and depending on the values of tmax. For example, if tmax is greater than 1, we take the inverse as the rescaling factor. If tmax is approximately equal to 1, we just add it to 1 to compute the rescaling factor. If tmax is less than 1, we raise the tmax^-2 to the order of 10. This way we compute the rescaling factor dynamically after each iterations since the max residual vector length could change. Thus the rescaling factor prevents the deletion of vectors from overweighting or over reduction.

We talked about the cluster retrieval based on summarization and on the premise known as cluster hypothesis. This summarization data could include naturally occurring overlap of clusters. This will prevent omission of clusters that have a significant overlap with others.
LSI and COV can be used to find major document clusters. During the dimension reduction step, the minor clusters are often mistaken for noise and are discarded. There is an algorithm by Ando to identify small clusters in limited contexts. Here major themes from the documents are prevented from dominating the process of selecting the basis vectors for subspace projection step in LSI. This is done during the basis vector selection process by introducing an unfavorable weight also known as filter or negative bias to documents that belong to clusters that are well represented by basis vectors that have already been selected. This weight is computed by finding the residual of each document vector which is the proportion of the document vector that cannot be represented by the basis vectors that have been selected up to that point. Then the magnitude of the document vector is raised to a power q of the magnitude of its residual.
This process is repeated for every cluster where the document vectors are initialized by their r1, r2..rm residual and the residual matrix is calculated as R - RbibiT where bi is the first Eigenvector of RTsRs. We set R to be A initially. After each iterative step, the residual steps are updated to take into account the new basis vector f. After the klh basis vector is computed, each document vector is mapped to its counterpart in the k-dimensional subspace. This is followed by the relevance ranking.
However, this algorithm does identify all major and minor clusters. Finding the Eigen vector may be unstable when q is large. The basis vectors may not always be orthogonal. The number of documents in the database is even moderately large, intermediate data may not fit into main memory. These shortcomings arise because there is loss of information during rescaling of document vectors after each computation. Also the document attribute matrices are usually highly sparse but after a few iterations, the residual matrix may become very dense. COV based algorithms can overcome these.

Wednesday, November 6, 2013

In the previous post, we discussed from the article written by Koboyashi and Aono that the LSI and COV both attempt to reduce the problem to k-dimension sub-space. LSI attempted to reduce the ranks to k and COV attempted to choose k largest ^Eigen values and their corresponding vectors. On the plane where these clusters are visualized the LSI had the origin for the document vectors outside the plane while the COV had the origin inside the plane and thus with better discriminating power between the document vectors.
There are several advantages to the COV matrix. The first advantage is the better discrimination between document vectors.
The second advantage is the scalability of the computation with larger databases.
The third advantage is the computation can be distributed in the sense that the subspace can be local at the different places where the data is spread over as Qu et al have shown. When the data is distributed across different locations, the principal components are computed locally. These are then sent to a centralized location and used to compute the global principal components. Further, it reduces the data transmission rates and thus is better than a parallel processing approach. However, the dominant components do not provide a good representation, then it is suggested to include up to 33% more components to improve the level of accuracy over the centralized counterpart implementation. The interesting thing here is that the size of the local data or the overall data does not seem to matter as long as the dominant components provide a good representation.
Another way to optimize this further has been to identify clusters (or sets of documents that cover similar topics) during the pre-processing stage itself so that they can be retrieved together to reduce the query response time or even when they are to be summarized for smaller storage while processing large data. This approach is called cluster retrieval and is based on the assumption that closely related documents often satisfy the same requests. Incidentally, this summarization and other data structures from pre-processing can help alleviate a variety of problems with large data sets. One more interesting feature of both LSI and COV is that both methods recognize and preserve the naturally occurring overlap of clusters. These overlaps are interesting because they have multiple meanings and therefore are useful in database content analysis from different viewpoints. In the hard partitioning case, the clusters overlaps are removed losing this rich information. In the soft partitioning case, this is preserved and improves results. Both algorithms are not as successful at discriminating similar documents or minor document clusters. The are often mistaken as noise and discarded. On the other hand these minor clusters and outliers have significant information and can add valuable insights. They should be included as much as possible.

Tuesday, November 5, 2013

In the previous post I mentioned about Latent Semantic Indexing and in this post I want to introduce both LSI and Covariance Matrix Analysis COV and then we could compare LSI and COV.
In LSI, we are given a database with m documents and n distinguishing attributes for relevancy ranking. A is the corresponding m x n document-attribute matrix model. The idea is to reduce the dimensions to k where k is much smaller than m or n and where k is a ranking.
In the COV algorithm, document and query vectors are projected onto the subspace spanned by the k-eigenvectors corresponding to the corresponding to the max eigenvalues of the covariance matrix of document vectors. Thus the COV is based on selecting a subset k vectors. The covariance is computed by taking the averages along the individual component or dimensions Note that taking the averages along the individual dimensions from 1 to n is something that we find in Fuzzy SKWIC too. So its particularly helpful to pay attention to the fundamentals here. We define the component wise averages as d' = a1', a2' ... an'. where the ith document for i ranging from 1 to m has attributes /components as ai1, ai2, ai3, .. ain. Then the averages for each attribute is computed as the average across the m documents along that attribute. In a matrix with m documents as rows and n attributes as columns, this component-wise average translates to averaging the individual columns. This matrix can be decomposed into a set of matrices of the form VEVt where V is the orthogonal matrix that diagnolizes C, E is the diagonal matrix whose diagonal values are in decreasing order from top to bottom and proceeding from left to right and finally Vt which is the term matrix. The diagonal values with k-largest are used to construct the eigen vectors v1, v2, ... vk that reduces the dimensions to k from m,n. In other words all the document vectors and the query vectors are then projected into the subspace defined by these k-eigen vectors. The difference between COV and LSI is that in the LSI we reduce the dimensions to k while in the COV we select a subset of vectors and transform the document and query vectors to a subspace involving a subset k eigenvectors. By reducing the dimensions, the vectors continue to have the same origin and there is less distinction between the documents in the document-attribute matrix. In the covariance analysis, the origin is the center of the basis vectors in the subspace so that the document vectors are more evenly spaced so that the document vectors are more evenly spaced. This enables finer detection of differences between document vectors in different clusters. Both COV and LSI project the query and document vectors into the k-dimensional subspace and then measure similarity between documents.
Another advantage with COV is that the computations for the k eigenvectors and Eigen values are independent of the document vectors and query vectors. So they need not be performed for each document and query vector and since this is done only once, the document and query vectors can be processed in near real-time. This is a significant advantage when compared with the LSI which did not scale as the size of the database grew. However, the cost for the Eigen-vectors is still significant in comparision to the cost of the individual vector processing.