Saturday, April 15, 2017

Two-level based Clustering 
The goal of clustering is typically to improve the intracluster similarity while reducing the inter cluster similarity. Therefore the true quality of a cluster is measured by this so called  
“internal criteria” but it is difficult to capture and hence measured by goodness of fit metrics which are called the “external criteria”.  Examples of the external criteria include the purity, normalized mutual information, the rand index and the F-measure.  
While any clustering algorithm can create clusters, the quality of a cluster determines the choice of the clustering algorithm.  Some algorithms create a noise cluster that grabs all the outliers. Generally there is only one noise cluster formed as all the other instances are classified.  But if we were to introduce a sliding scale of outliers, we could zone in on the salient categories with better accuracy.  
When clusters are big and nebulous, they are not as helpful as when they are all cohesive, small and ranked. On the other hand if we were to form micro-clusters and then introduce a forest of dendograms of the salient topics with the help of a sliding scale on the measures, there is a better possibility of finding the summary of the instances.  In other words, clustering need not be exclusively partitioned and hierarchical but composite.  To prevent misunderstanding of micro-clusters as instances themselves, the metric should articulate the variance of the cluster members. One possibility is to introduce an absolute scale to represent all possible themes in the instances and then use subdivisions of the grades to form microclusters. The second level organization of micro-clusters using a hierarchical algorithm gives the idea of the themes.  This two-level technique is used even in data structures like a B-Tree where keys and data form two different levels. 
The aspiration for such as technique is to improve document classification
#codingexercise
// Find number of ways to increase LCS by 1
The technique here is that the LCS can be increased by one with the addition of a single letter. Therefore this letter is all that we need to work with. This letter may occupy one of the m+1 positions in the first string where m is the length of the first and n is the length of the second. Moreover this letter can vary from a to z. 
In each insertion, the substring before the insertion and after the insertion must contribute individually to the total for the LCS. 
Therefore the pseudo code looks something like this:
// Find number of ways to increase LCS by 1 
// s1 and e1 are start of string x
// s2 and e2 are start of string y
// GetLCS is the well known algorithm to compute the Length of the LCS and described earlier
// The rest is recursion
int GetLCSPlus1Ways(string x, string y, int s1, int e1, int s2, int e2)  
int count = GetLCS(x,y,s1,e1,s2,e2); 
for( int i = s1; i <= e1; i++) 
    for (char c = 'a'; c <= 'z'; c++) 
     { 
           int p = GetPosition(y, c); 
           if (p != -1){ 
                  if ( GetLCSPlus1Ways(s1,i,s2,p-1)  + GetLCSPlus1Ways(i+1, e1, p+1, e2) == GetLCS(x,y,s1,e1,s2,e2))
                       count ++
           } 
     } 

return count; 
}

Note that in the recursion, we are trying to say that the substring prior to the candidate insertion index in the first string and the substring before the matching letter in the second substring must contribute towards the total lcs length as before the insertion and the same holds for the substring on the other side. This is because the subsequence that is common to both the string 

No comments:

Post a Comment