Friday, November 15, 2013

Today let us just focus on an implementation topic for text categorizing software. How do we design a class for a clusterer ? Let us take a specific example for the clusterer. We can generalize it later. We want a K-means clusterer that classifies text documents to categories/clusters. Text documents are represented by term vectors. Distance between documents and categories are measured by a distance function or metric. Documents can vary in number. They can be quite large. Number of clusters are fixed. Clusters will have centers. Distance to the nearest center for a document can be found. The clusterer has a method called Classify. It reads the input documents, computes the distances, assigns the clusters, adjusts the cluster centers and repeats until the cluster centers stabilize. The mechanism of the classify method is known. The class needs to be aware of the following
cluster centers
distance function
document set
With the classify method, there will be an output where each document has been assigned a category label or cluster.
How do we test the classifier ? We give it different sets of inputs and check for the labels. What can we vary for this classifier. Surely we can vary the sample space. We can include a small number or a large number of documents. We can check for the proper assignments of each documents We can vary the number of clusters. We can check the output for different distance functions. We may have to manually classify the samples  before we run the clusterer. But how do we check the correctness of the program ? We want an invariant, an initial condition and a final state. How do we check the progress especially when the classify method takes a long time ? We could add some supportability.
Next we could consider some design factors for flexibility.
We could decouple the distance calculating function from the clusterer.  We could use design patterns to switch different distance functions to the class. We could use a strategy pattern.
We could also parameterize the input and the output so that we don't need to know what the vectors are that we cluster. We could create a base class for the clustered and derive others.

No comments:

Post a Comment