Wednesday, June 19, 2013

Cluster analysis:
A cluster is a collection of data objects that are similar to one another and dissimilar to the objects in another cluster. The quality of  a cluster can be measured based on the dissimilarity of objects, which can be computed for different types of data such as interval-scaled, binary, categorical, ordinal and ratio-scaled variables. Vector data can be measured with cosine measure and Tanimoto coefficient. Cluster analysis can be a stand alone data mining tool. Clustering methods can be grouped as partitioning methods, hierarchical methods, density-based methods, grid-based methods, model-based methods, methods for high-dimensional data (including frequent pattern based methods) and constraint-based methods.
A partitioning method creates an initial set of k partitions and assigns the data points to the nearest clusters. Then it iteratively re-computes the center of the cluster and reassigns the data points. Some examples are K-means, k-medoids, and CLARANS. A hierarchical method groups data points into a hierarchy by either bottom up or top down construction. Iterative relocation can also be applied to the sub clusters. A density based approach clusters objects based on a density function or according to the density of neighboring objects. Some examples are DBSCAN, DENCLUE and OPTICS.  A grid based method first allocates the objects into a finite number of cells that form a grid structure, and then performs clustering on the grid structure. STING is an example of grid based where the cells contain statistical information.  A model based method hypothesizes a model for each of the clusters and finds the best fit of the data into the model. Self-organizing feature maps is an example of model based method. Clustering high-dimensional data has applications to text documents. There are three methods - dimension growth subspace clustering, dimension reduction projected clustering and frequent pattern based clustering. A constraint based clustering method groups objects based on application dependent or user-specified constraints. Outlier detection and analysis can be useful for applications such as fraud detection.  The outliers are detected based on statistical distribution, distance, density, or deviation.

No comments:

Post a Comment