Friday, November 29, 2013

A discussion on SSE as brought up in the previous post. SSE is the sum of squared error and it describes the scatter of a cluster. When the SSE is low, the clusters are well formed and cohesive. Thus SSE gives a measure of the quality of clustering.  Also, when the measure is low, the clustering is better represented by the centroids. As the name suggests, SSE is computed by aggregating the squares of the distance for each point from its centroid. This distance is the Euclidean distance between two points. It can be shown that the centroid that minimizes the SSE of the cluster is the mean. This is valuable because it describes the optimum.  In K-means for example, the points are assigned to their nearest centroid, thus minimizing the SSE for the given set of centroids. And the centroids are also recomputed so as to minimize the SSE. Notice, however that these steps of the K-means only find the local minimum with respect to the SSE. As we will see shortly, this leads to sub-optimal clustering.  Also, random initialization of the centroids results in different SSEs in each run. This causes variations in the end result. Typically the run with the lowest SSE is chosen as the best result. Consider the following two examples: 1) Poor initial centroids - Even when the centroids are better distributed, we can obtain a sub-optimal clustering with higher squared error,  and 2) Limits of random initialization - one way to overcome the first factor could be to perform multiple runs with varying initial points and then choosing the results with the least SSE. However, this also could make only local minima.

No comments:

Post a Comment