Sunday, December 1, 2013

We talked about evaluating clusters in the previous post. We briefly introduced the various measure categories. We could look into a few details now. For partitional clustering, we have measures based on cohesion and separation based on proximity matrix. For hierarchical clustering, we have cophenetic correlation coefficient. We could also look into entropy, purity and Jaccard measure.In the former  case, we have the proximity as the square Euclidean distance. For cohesion, we aggregate the distance between the points and the centroid.  For separation, we take the link distances whether minimum or maximum or average. The relationship between cohesion and separation is written as
TSS = SSE + SSB
where TSS is the total sum of squares.
SSE is the sum of squared error
and SSB is the between group sum of squares, the higher the total SSB, the more separated the clusters are.
Another thing to note here is that minimizing SSE (cohesion) automatically results in maximizing SSB (separation).  This can be proved in this way. The total sum of squares is the sum of the square of the distance of each point from each centroid (x-c). This distance between each point and each centroid can be taken as the difference between the distance of the point from a centroid ci and the distance between the centroid ci and c which can then be expressed in the quadratic form and from which only the significant components are taken to form the sum of SSE and SSB.
We have so far talked about using these measures for all of the clusters. In fact, these measures can also be used for individual clusters. For example, a cluster that has a high degree of cohesion, may be considered better than a cluster that has a lower value.  A lower value cluster is also indicative of requiring partitioning to result in better clusters.  Thus clusters can be ranked and processed based on these measures.  Having talked about the individual clusters, we can also rank objects within the cluster based on these measures. An object that contributes more to cohesion is likely to be near the core of the cluster. And for others, they would likely be near the edge.
Lastly, we talk about Silhouette co-efficient. This combines both cohesion and separation.  This is done in three steps.
For the ith object, calculate its average distance to all other objects in cluster and call it ai
For the ith object, and any cluster not containing the object, calculate the object's average distance to all the objects in the given cluster. Use the minimum value and call it bi
For the ith object, the silhouette coefficient is given by (bi-ai)/max(ai, bi)

Saturday, November 30, 2013

We can talk about some cluster evaluation or validation techniques. Each clustering algorithm can define its own type of validation. For example we discussed that K-means can be evaluated based on SSE. Almost any clustering algorithm will find clusters in a dataset. Even if we take uniformly distributed data and apply the algorithms we discussed such as DBSCAN, K-means, etc. they will find three clusters. This prompts us to do cluster validation.  We involve the following aspects:
1) Detecting whether there is non-random structure in the data
2) Detecting the clustering tendency of a set of data.
3) Determining the correct number of clusters
4) Comparing the results of cluster analysis with an external labeling
5) Comparing two sets of clusters to determine which is better.
1 to 3 does not rely on external data. 3 to 5 can be applied to entire clustering or just individual clusters.
If the validation were to be done with a measure, it may not apply to all the clusters. Such a measure may only be applicable to two or three dimensional data. Further if we obtain a value for the measure, we may still have to evaluate if its a good match. The goodness of a match can be measured by looking at the statistical distribution of the value to see how likely such a value is to occur.
The measures may be classified into one of following three categories:
1) Unsupervised where there is no external information. eg. SSE. Hence they are also called internal indices.  There are two subcategories here: one that measures cluster cohesion and another that measures cluster separation.
2) Supervised. measures the extent to which the clustering structure matches external classification. Hence they are also called external indices. An example of this type of measure is entropy which measures how well cluster labels match external labels.
3) Relative. Compares different clustering or clusters and they are not a category per se but the comparison of say two clusters.  A relative cluster evaluation measure can work with both supervised or unsupervised method.
These and some previous posts have been reviewed from Clustering : Basic concepts and algorithms book

bisecting k-means

A bisecting k-means algorithm overcomes some of the limitations with the choice of centroids and finding the minimum SSE. It is based on the idea that to obtain k clusters, split the set of all points into two clusters, select one of these clusters to split, and so on until K clusters have been produced.
The steps for the bisecting K-means algorithm include the following:
1. Initialize the list of clusters with one that contains all the points
2. repeat
3. Remove a cluster from the list of clusters
4. Perform several trial bisections of the chosen cluster
5. for i - 1 to number of trials do
6.    Bisect the selected cluster using basic k-means
7 end for
8. Select the two clusters from the bisection with the lowest total SSE
9. Add these two clusters to the list of clusters
10. until the list of clusters contains K clusters.
There are a number of ways to choose which cluster to split. We can choose the largest cluster at each split or the one with the largest SSE or both.
Further, we can refine the resulting clusters by using their centroids as the initial centroids for the basic K-means algorithm.

Friday, November 29, 2013

A discussion on SSE as brought up in the previous post. SSE is the sum of squared error and it describes the scatter of a cluster. When the SSE is low, the clusters are well formed and cohesive. Thus SSE gives a measure of the quality of clustering.  Also, when the measure is low, the clustering is better represented by the centroids. As the name suggests, SSE is computed by aggregating the squares of the distance for each point from its centroid. This distance is the Euclidean distance between two points. It can be shown that the centroid that minimizes the SSE of the cluster is the mean. This is valuable because it describes the optimum.  In K-means for example, the points are assigned to their nearest centroid, thus minimizing the SSE for the given set of centroids. And the centroids are also recomputed so as to minimize the SSE. Notice, however that these steps of the K-means only find the local minimum with respect to the SSE. As we will see shortly, this leads to sub-optimal clustering.  Also, random initialization of the centroids results in different SSEs in each run. This causes variations in the end result. Typically the run with the lowest SSE is chosen as the best result. Consider the following two examples: 1) Poor initial centroids - Even when the centroids are better distributed, we can obtain a sub-optimal clustering with higher squared error,  and 2) Limits of random initialization - one way to overcome the first factor could be to perform multiple runs with varying initial points and then choosing the results with the least SSE. However, this also could make only local minima.
Since we talked about DBSCAN clustering method in the previous post, let's quickly revisit the agglomerative hierarchical clustering as well. They are an important category of clustering algorithms. Hierarchical clustering can be agglomerative or divisive. Both are often displayed visually using a tree-like diagram called a dendrogram which displays both the cluster and the sub-cluster relationships and the order in which the clusters were merged or split. For sets of two dimensional points,  a hierarchical clustering can also be illustrated using a nested cluster diagram
The algorithm proceeds like this:
1. Compute the proximity matrix, if necessary.
2. repeat
3. Merge the closest two clusters
4. Update the proximity matrix to reflect the proximity between new and original clusters.
5. until only one cluster remains.
Step 3. can be done using single link, complete link and group average methods.
The single link is the minimum distance between clusters as measured between two points - one from each cluster that are nearest. The complete link is the maximum distance between clusters as measured between two points - one from each cluster that are farthest. And the group average takes the average of pairwise distances of all pairs of points from the two different clusters.
Proximity can also be measured in terms of the distances between the centroids of the cluster. Ward's method also uses centroid but it measures the proximity between two clusters but it measures the proximity between two clusters in terms of the increase in sum-of-squared-error (SSE) that results from the merging of two clusters.  SSE gives an indication of the cluster quality and Ward's method attempts to minimize the SSE of points from their cluster centroids.
Due to the use of proximity matrix, the space complexity is O(m^2). Since the algorithm updates the proximity matrix and these are updated in m-1 iterations, the time complexity is also O(m^3). With the use of some data-structures for efficiently organizing as heap, the complexity could be reduced to O(m^2.logm).
This kind of clustering does not globally optimize an objective function. Instead it uses criteria described above to determine the clusters to be merged (or split) at each step. The local minima is easily calculated but the time complexity and space complexity for the overall run is high.

Thursday, November 28, 2013

Today we talk about DBSCAN algorithm. It's a simple and effective density-based clustering algorithm that illustrates a number of important concepts. We first consider the notion of density. There are several methods to measure density that are distinct and different from measuring similarity. In the center based approach, density is estimated for a particular point by counting the number of points within a certain radius including that point. This method is simple but the density measured this way depends on radius. If the radius is large enough, then all points will have a density of m which is the number of points in a dataset. If the radius is too small, then all points will have a density of 1.
By this approach, points will either lie in the core of a density based cluster, or they will lie in the border between the neighborhoods of several core points or they may be outliers where they are neither part of a cluster nor in the neighborhood of two or more clusters. Thus points in this approach are classified as core points, border points and noise points.
The algorithm proceeds like this. Any two core points that are close enough i.e. within the radius of one another are put in the same cluster. Any border point that is close enough to a cluster is included within the same cluster as a core point. If there are ties between two or more clusters, they are resolved by choosing the one that's closest. Noise points are discarded.
So the steps can be enumerated as follows:
1. Label the points as core, border and noise points
2. Eliminate the noise points
3. Put an edge between all core points that are within the specified distance of one another
4. Make a cluster out of each group of connected core points that are within the specified distance of one another.
5. Assign each border point to one of the clusters of its associated core points.
For each point we find the points that are within the neighborhood of this point, so the complexity of this algorithm is O(m^2). However, there are data structures that can help reduce the complexity in retrieving the points adjacent to this center to O(mlogm). The space complexity is linear because we only persist a small amount of metadata for each of the points namely the cluster label and the classification of each point as the core, border or noise point.
The parameters for this algorithm are the radius and the minimum number of points in a cluster to distinguish the core from the border points.
This algorithm has trouble when the density of the clusters vary widely. It also has difficulty with the high-dimensional data but it's one of the simple ways of clustering.
In order for us to work with a plugin for an editor such as MSWord as mentioned in the previous post, we need to visit the object model exposed by that application. A word object model comprises of the following five top level objects:
Application object - represents the word application
Document object - represents the document and all of its content
Selection object - represents the area that is currently selected
Range object - represents a contiguous area in a document with a start and end character. It is active while the document is open and not visible.
Bookmark object - represents a contiguous area in a document with a start and end position and is persisted.
As you may see, they closely represent the hierarchy seen in the user interface.
At the top of this hierarchy is the application object. The application object contains document, selection, bookmark and range objects. Both the document object and the selection object can contain bookmark and Range objects.
Word also has content control objects. This let us control the input and the presentation of text and other types of content such as a date picker or a combo box. They help you do the following
prevent users from editing or deleting parts of document and
bind parts of a document or template to data.
One of the content controls is the Rich text. A rich text control contains text or other items such as tables, pictures or other controls.
A rich text content control has a DefaultTextStyle  and a PlaceholderText member. The former gets the name of the character style to format the text and the latter gets or sets the text that is displayed in the rich text content control.
A document also has XMLNodeControl to directly manipulate the markups and their hierarchy, This lets you define the xml content and schema that can be queried with XPath and displayed with content control objects. This is done by adding the custom xml part with the data from the xml and then binding the controls to custom xml part. XmlNodes can have a SmartTag object that represents the tag associated with the object.
The Range object can be used to format text in a Word document.
var start = this.Content.Start; // this.Sentences[0].Start
var end = this.Content.End;
Word.Range rng = this.Range(ref start, ref end); // this.Paragraphs[1].Range;
rng.Font.Size = 14;
rng.Font.Name = "Arial";
rng.ParagraphFormat.Alignment = Word.WdParagraphAlignment.wdAl
object indentStyle = "Normal Indent"
rng.set_Style(ref indentStyle);
rng.Select();
rng.Find.ClearFormatting();
rng.Find.Text = "find me";
rng.Replacement.ClearFormatting();
rng.Replacement.Text = "Found";
var replaceAll  = Word.wdReplace.wdReplaceAll;
rng.Find.Execute(ref findText, ref missing);