In the previous post, we discussed from the article written by Koboyashi and Aono that the LSI and COV both attempt to reduce the problem to k-dimension sub-space. LSI attempted to reduce the ranks to k and COV attempted to choose k largest ^Eigen values and their corresponding vectors. On the plane where these clusters are visualized the LSI had the origin for the document vectors outside the plane while the COV had the origin inside the plane and thus with better discriminating power between the document vectors.
There are several advantages to the COV matrix. The first advantage is the better discrimination between document vectors.
The second advantage is the scalability of the computation with larger databases.
The third advantage is the computation can be distributed in the sense that the subspace can be local at the different places where the data is spread over as Qu et al have shown. When the data is distributed across different locations, the principal components are computed locally. These are then sent to a centralized location and used to compute the global principal components. Further, it reduces the data transmission rates and thus is better than a parallel processing approach. However, the dominant components do not provide a good representation, then it is suggested to include up to 33% more components to improve the level of accuracy over the centralized counterpart implementation. The interesting thing here is that the size of the local data or the overall data does not seem to matter as long as the dominant components provide a good representation.
Another way to optimize this further has been to identify clusters (or sets of documents that cover similar topics) during the pre-processing stage itself so that they can be retrieved together to reduce the query response time or even when they are to be summarized for smaller storage while processing large data. This approach is called cluster retrieval and is based on the assumption that closely related documents often satisfy the same requests. Incidentally, this summarization and other data structures from pre-processing can help alleviate a variety of problems with large data sets. One more interesting feature of both LSI and COV is that both methods recognize and preserve the naturally occurring overlap of clusters. These overlaps are interesting because they have multiple meanings and therefore are useful in database content analysis from different viewpoints. In the hard partitioning case, the clusters overlaps are removed losing this rich information. In the soft partitioning case, this is preserved and improves results. Both algorithms are not as successful at discriminating similar documents or minor document clusters. The are often mistaken as noise and discarded. On the other hand these minor clusters and outliers have significant information and can add valuable insights. They should be included as much as possible.
There are several advantages to the COV matrix. The first advantage is the better discrimination between document vectors.
The second advantage is the scalability of the computation with larger databases.
The third advantage is the computation can be distributed in the sense that the subspace can be local at the different places where the data is spread over as Qu et al have shown. When the data is distributed across different locations, the principal components are computed locally. These are then sent to a centralized location and used to compute the global principal components. Further, it reduces the data transmission rates and thus is better than a parallel processing approach. However, the dominant components do not provide a good representation, then it is suggested to include up to 33% more components to improve the level of accuracy over the centralized counterpart implementation. The interesting thing here is that the size of the local data or the overall data does not seem to matter as long as the dominant components provide a good representation.
Another way to optimize this further has been to identify clusters (or sets of documents that cover similar topics) during the pre-processing stage itself so that they can be retrieved together to reduce the query response time or even when they are to be summarized for smaller storage while processing large data. This approach is called cluster retrieval and is based on the assumption that closely related documents often satisfy the same requests. Incidentally, this summarization and other data structures from pre-processing can help alleviate a variety of problems with large data sets. One more interesting feature of both LSI and COV is that both methods recognize and preserve the naturally occurring overlap of clusters. These overlaps are interesting because they have multiple meanings and therefore are useful in database content analysis from different viewpoints. In the hard partitioning case, the clusters overlaps are removed losing this rich information. In the soft partitioning case, this is preserved and improves results. Both algorithms are not as successful at discriminating similar documents or minor document clusters. The are often mistaken as noise and discarded. On the other hand these minor clusters and outliers have significant information and can add valuable insights. They should be included as much as possible.
No comments:
Post a Comment