Cluster computing

Wednesday, June 19, 2013

Cluster analysis:
A cluster is a collection of data objects that are similar to one another and dissimilar to the objects in another cluster. The quality of a cluster can be measured based on the dissimilarity of objects, which can be computed for different types of data such as interval-scaled, binary, categorical, ordinal and ratio-scaled variables. Vector data can be measured with cosine measure and Tanimoto coefficient. Cluster analysis can be a stand alone data mining tool. Clustering methods can be grouped as partitioning methods, hierarchical methods, density-based methods, grid-based methods, model-based methods, methods for high-dimensional data (including frequent pattern based methods) and constraint-based methods.
A partitioning method creates an initial set of k partitions and assigns the data points to the nearest clusters. Then it iteratively re-computes the center of the cluster and reassigns the data points. Some examples are K-means, k-medoids, and CLARANS. A hierarchical method groups data points into a hierarchy by either bottom up or top down construction. Iterative relocation can also be applied to the sub clusters. A density based approach clusters objects based on a density function or according to the density of neighboring objects. Some examples are DBSCAN, DENCLUE and OPTICS. A grid based method first allocates the objects into a finite number of cells that form a grid structure, and then performs clustering on the grid structure. STING is an example of grid based where the cells contain statistical information. A model based method hypothesizes a model for each of the clusters and finds the best fit of the data into the model. Self-organizing feature maps is an example of model based method. Clustering high-dimensional data has applications to text documents. There are three methods - dimension growth subspace clustering, dimension reduction projected clustering and frequent pattern based clustering. A constraint based clustering method groups objects based on application dependent or user-specified constraints. Outlier detection and analysis can be useful for applications such as fraud detection. The outliers are detected based on statistical distribution, distance, density, or deviation.

Cluster computing

Wednesday, June 19, 2013

No comments:

Post a Comment