Saturday, April 15, 2017

Two-level based Clustering 
The goal of clustering is typically to improve the intracluster similarity while reducing the inter cluster similarity. Therefore the true quality of a cluster is measured by this so called  
“internal criteria” but it is difficult to capture and hence measured by goodness of fit metrics which are called the “external criteria”.  Examples of the external criteria include the purity, normalized mutual information, the rand index and the F-measure.  
While any clustering algorithm can create clusters, the quality of a cluster determines the choice of the clustering algorithm.  Some algorithms create a noise cluster that grabs all the outliers. Generally there is only one noise cluster formed as all the other instances are classified.  But if we were to introduce a sliding scale of outliers, we could zone in on the salient categories with better accuracy.  
When clusters are big and nebulous, they are not as helpful as when they are all cohesive, small and ranked. On the other hand if we were to form micro-clusters and then introduce a forest of dendograms of the salient topics with the help of a sliding scale on the measures, there is a better possibility of finding the summary of the instances.  In other words, clustering need not be exclusively partitioned and hierarchical but composite.  To prevent misunderstanding of micro-clusters as instances themselves, the metric should articulate the variance of the cluster members. One possibility is to introduce an absolute scale to represent all possible themes in the instances and then use subdivisions of the grades to form microclusters. The second level organization of micro-clusters using a hierarchical algorithm gives the idea of the themes.  This two-level technique is used even in data structures like a B-Tree where keys and data form two different levels. 
The aspiration for such as technique is to improve document classification
#codingexercise
// Find number of ways to increase LCS by 1
The technique here is that the LCS can be increased by one with the addition of a single letter. Therefore this letter is all that we need to work with. This letter may occupy one of the m+1 positions in the first string where m is the length of the first and n is the length of the second. Moreover this letter can vary from a to z. 
In each insertion, the substring before the insertion and after the insertion must contribute individually to the total for the LCS. 
Therefore the pseudo code looks something like this:
// Find number of ways to increase LCS by 1 
// s1 and e1 are start of string x
// s2 and e2 are start of string y
// GetLCS is the well known algorithm to compute the Length of the LCS and described earlier
// The rest is recursion
int GetLCSPlus1Ways(string x, string y, int s1, int e1, int s2, int e2)  
int count = GetLCS(x,y,s1,e1,s2,e2); 
for( int i = s1; i <= e1; i++) 
    for (char c = 'a'; c <= 'z'; c++) 
     { 
           int p = GetPosition(y, c); 
           if (p != -1){ 
                  if ( GetLCSPlus1Ways(s1,i,s2,p-1)  + GetLCSPlus1Ways(i+1, e1, p+1, e2) == GetLCS(x,y,s1,e1,s2,e2))
                       count ++
           } 
     } 

return count; 
}

Note that in the recursion, we are trying to say that the substring prior to the candidate insertion index in the first string and the substring before the matching letter in the second substring must contribute towards the total lcs length as before the insertion and the same holds for the substring on the other side. This is because the subsequence that is common to both the string 

Friday, April 14, 2017

Archiving 
As  new assets are added to an inventory and the old ones retired, the inventory grows  to a very large size. The number of active assets within the inventory may only be a fraction. Moreover, this active set of assets may be sprawled all over the table used to list the inventory. Consequently software engineers perform a technique called archiving which moves older and unused records to a secondary table. This technique is robust and involves some interesting considerations. 
For example, the assets are continuously added and retired. Therefore there is no fixed set to work and the records may have to be scanned again and again.  Fortunately, the assets on which the  archiving action needs to be performed do not accumulate forever as the archival catches up to the rate of retirement.  
Also the retirement policy may be dependent not just on the age but several other attributes of the assets. Therefore the archival may have policy stored as a separate logic to evaluate each asset against. Since the archival is expected to run over and over again, it is convenient to revisit each asset again with this criteria to see if the asset can now be retired.  
Moreover, the archival action may fail and the source and destination must remain clean without any duplication or partial entries. Consider the case when the asset is removed from the source but it is not added to the destination. It may be missed forever if the archival fails before the retired asset makes it to the destination table. Similarly, if the asset has been moved to the destination table, there need not be another entry for the same asset if the archival runs again and finds the original entry lingering in the source table.  
This leads to a policy where the selection of the asset, the insertion into the destination and the removal from the original is done in a transaction that guarantees all parts of the operation happen successfully or are rolled back to just before these operations started. But this can be relaxed by adding checks in each of the statement to make sure each operations can be taken again and again on the same asset with a forward only movement of the asset from the source to the destination. This is often referred to as reentrant logic and helps take action on the assets in a failsafe manner without requiring the use of locks and logs for the overall set of select, insert and delete. 
Lastly, the set of three actions mentioned above only work on one asset at a time. This is prudent and mature consideration because the storage requirement and possible changes to the asset is minimized when we work on one asset at a time instead of several.  Consider the case when an asset is faultily marked as ready for retirement and then reverted back again. If it were part of  a list of say ten assets that were being archived, it may affect the other nine to be rolled back and the actions repeated by excluding the one. On the other hand, if we were to work with only one asset at a time, the rest of the inventory is untouched. 
Thus the technique for archival has many considerations that can be applied upfront into the design before the policy is implemented and executed.
#codingexercise
Find the Longest common subsequence of two strings:
int GetLCS(String X, String Y, int m, int n)
{
if ( m == 0 || n == 0)
     return 0;
if (X[m-1] == Y[n-1])
   return 1 + GetLCS(X, Y, m-1, n-1);
else 
   return Math.Max(GetLCS(X,Y, m, n-1), GetLCS(X, Y, m-1, n));
}

Thursday, April 13, 2017

We have been discussing Inventory with techniques for organization. We discussed storage options with separation of static and dynamic. But we also give some consideration to size. When the inventory is in a flat list, it becomes enormous to manage. Software engineers have a simple and efficient technique to manage the assets. It involves partitioning to divide the assets into manageable chunks but it doesn’t stop at that.  Instead of directly managing the assets, a separate table of consolidators is maintained. Assets in the inventory are managed with the help of consolidators. For example, a region-based consolidator may facilitate the servicing of resources specific to that region. If there were different geographic regions, there would be different consolidators. The inventory points to an entry for a resource but only the consolidator takes actions on the resource. 
Since the consolidator is assigned to a specific set of resources based on a specific criteria, it usually bears a name that suggests this criteria. For example, it could have a name that reflects the region to which the resources belong. Consolidators also have an address to which requests can be directed so that corresponding actions can be taken on the resources. When new resources have to be added and the existing regions become saturated, then a new region is carved and the new resources are added to it. This helps expand the inventory in different regions or dimensions depending on the criteria for the consolidators. Consolidators can also be hosts and while they may be similar in nature to the resources, they are special resources because they host other resources as tenants. In this case the hosts and resources become separate entities. Generally hosting may be nested only upto one level. This makes the resource different from the host otherwise hosts are not distinguishable from the resources. In the latter case, there would be a host attribute in the resource table itself and there would not be any need for consolidators. If the actions are performed with the help of consolidators, then the states on the resource may need to be updated after the actions are taken.  Hosts need not be of the same kind and may span virtualization platforms, flavors, releases and other varieties but they all behave as consolidators for their resources
Consolidators serve to partition the resources and to consolidate actions against a set of resources
#codingexercise
We were looking at an exercise involving count of all numbers formed by subsequences in  the digits representing a number. We stated that this is in fact represented by  binomial coefficients. Let's take an example and see the pattern:
a0, a1, a2
can be combined as a0 + a1 + a2  + a0a1 + a1a2 + a0a2 + (a0a1a2 ) resulting in 3C1 + 3C2 + 3C3 = 7 subsequences
Since we know the individual count of increasing length subsequences, we can now easily count different kinds of subsequences. We said that the enumeration of the subsequences corresponding to each binomial coefficient k can be done by selecting the combinations with length corresponding to the binomial coefficient.
With the enumeration as follows:
Void  Combine(ref List<int> digits, ref List<int> b, int start, int level, ref List<List<int>> combinations)  
{  
for (int I =start; I < digits.length; I++)  
{   
  if (b.contains(digits[i])== false){  
    b[level] = digits[i];  
    combinations.Add(b);
 if (I < digits.length)  
   Combine(ref digits, ref b, m, i+1, level+1, ref combinations);  
 b[level] = '/0';  
}
}
}  
we now observe an interesting repetition, each subsequence of a unit length occurs after the sum of all binomial coefficients of the length of origination string starting from that character. For example
xyz is enumerated as 
x yz is enumerated as 
x  xy  xyz   = 3C1 + 3C2 + 3C3
    xz
y yz           = 2C1 + 2C2
z                = 1C1

Wednesday, April 12, 2017

We have been discussing Inventory with techniques for organization. These techniques were based on storage. However, when business needs change, the organization may have to classify their resources to meet different expectations. If they were all storage based, then there would be far too many changes to the storage since business requirements can change very often. On the other hand many requirements don't translate to assets as they relate better with tiers. Software engineers come up with an interesting technique of separating the requirements from the resource organization. The former called groups is dynamic and is represented by rules  and policies while the latter called pools is more static and remains a system design. 
The organization of the application is such that the static and the dynamic parts are separated lending more stability while reusing existing organization against more and more changes. As an example the inventory may meet different servicing groups and the assets may move only in and out of pools. To the inventory the groups and policies don't matter and they can be partitioned or load balanced accordingly. They can scale independently from the groups or can have resources throttled. By letting groups migrate between pools, we can change the services for the groups without having to change the pools.  
Accounting and performance are both improved by this design. The assignment of the groups helps with both. The assignment is based on logic which can encompass many criteria that are not necessarily part of the assets information in the inventory.  For example, an individual customer may have better servicing than others. Both the logic and the pool organization become an administrator's arsenal and lets her make changes to the user friendly groups with changes to the criteria on which the group is assigned while allowing maintenance chores and partitioning of resources in the backend. 
Assets servicing a group also enables group throttling. This is particularly useful when groups are not well behaved.  The statistics is collected by the pools which keeps track of request rates.  The same determination is used for load balancing. Automatic load balancing can now be based on pool partitioning approach and the group based throttling.  This improves multi-tenancy in the environment as well as handling of peaks in traffic patterns. And the algorithm can be adaptive even choosing appropriate metrics to determine traffic patterns that are better known and have been handled before.   
Moreover traffic patterns can now be better analyzed with this organization and changed  on demand with little disruption.
#codingexercise
We were looking at an exercise involving count of all numbers formed by subsequences in  the digits representing a number. We stated that this is in fact represented by  binomial coefficients. Let's take an example and see the pattern:
a0, a1, a2
can be combined as a0 + a1 + a2  + a0a1 + a1a2 + a0a2 + (a0a1a2 ) resulting in 3C1 + 3C2 + 3C3 = 7 subsequences
Since we know the individual count of increasing length subsequences, we can now easily count different kinds of subsequences. For example, to count odd length subsequences only we do 
Int GetCount(List<int>A, int start, int end)
{
If (start>end) return 0;
If (start==end)return A[start];
int count= 0
For (int n = 1; n <=A.Count; n=n+2) // odd length subsequences
       count += GetNChoosekDP(A.Count, n)
Return count;
}
int GetNChooseKDP(uint n, uint k)
{
if (k == 0 || k == n)
    return 1;
return GetNChooseKDP(n-1, k-1) + GetNChooseKDP(n-1,k);
}
It is also interesting to note that the enumeration of the subsequences corresponding to each binomial coefficient k can be done by selecting the combinations with length corresponding to the binomial coefficient.

Tuesday, April 11, 2017

Today I take a break to discuss naming conventions since this is a frequent consideration. An inventory is a source of great pride. The name for an asset sticks to it  for its lifetime and it is easier to remember and find an asset by name.  This is particularly important when assets look all the same and we want the name to reflect something more than just a serial number to tell them apart.  Names have to be friendly for the customer to use and they should have the consistency that an administrator wants from the types of assets maintained in the inventory.  That is why Information Technology personnel pay great attention to naming the assets in their inventory.  Often they include a prefix in the name of the asset to recognize the location or categorization of the assets. 
It's true that the inventory or database objects have become synonymous  since it is easier to manage the inventory with a table no matter how large how it grows. Moreover, several routine operations including the create, update and delete on the assets become easier to take on tables.  Also it is easier to add properties of the asset as columns of the inventory table as and when the need arises. Administrators have long used this technique to tag the assets in the inventory. Besides, the identifier of the asset in this table need not just be its name but that’s a longer story. If we were to look at just the names of the assets in the inventory table as the primary means for lookup, wouldn't it be convenient to store and fetch the properties of the asset given the name. Unfortunately, such reads come with the onus of authenticating,  authorizing and auditing the access to the inventory.  
What if the information of the asset could be available on the asset itself so that the user and not just the administrator can quickly tell the attributes of the asset. For example, the finger protocol used to provide the status of a particular computer system. Such information now is accessible easily and securely from APIs over HTTP. Administrators often went a step ahead of keeping this information local to the asset for easy readability. For example, the asset information could be displayed on the desktop wallpaper of a Windows or other Operating Systems. Others keep it stored on a local file that can be displayed on the terminal for no network access information. 
This provided the convenience to consolidate all readable information on the asset locally in a centralized spot and in a variety of formats such as JSON, xml and even configurations.  Till date, many Linux based systems use files to store release, version and configuration information for applications, operating systems and services.  These can be queried over ssh or other protocols and most system center tools allow us to directly read the information from the target so that we can check if our inventory is out of date as systems get reused or repurposed. A name must be static. 
The additional charm of using a richer naming system by way of this extended information is that we can now have aliases and multiple names. For example, an administrator can assign an IPV4 as well as an IPV6 name as appropriate. Arguably both names may not be more user friendly than an alias but it avoids the over thinking of a name with overloading of purpose.  
In the end every asset may genuinely get a name that jives with the user rather than the administrator. Hopefully that name will become just as popular as the robots in the Star Wars movie. 

#codingexercise
We were looking at an exercise involving sum of all numbers formed by subsequences in  the digits representing a number and a similar technique to count the number of repetitions. The same is true to count the number of subsequences as well
We use the fact the number of subsequences is represented by combinations which are represented by  binomial coefficients. Let's take an example and see the pattern:
a0, a1, a2
can be combined as a0 + a1 + a2  + a0a1 + a1a2 + a0a2 + (a0a1a2 ) resulting in 7 subsequences
and these are infact the sum of the binomial coefficients.
Int GetCount(List<int>A, int start, int end)
{
If (start>end) return 0;
If (start==end)return A[start];
Int sum = A.Sum();
int count= 0
For (int n = 1; n <=A.Count; n++)
       count += GetNChoosekDP(A.Count, n)
Return count;
}
int GetNChooseKDP(uint n, uint k)
{
if (k == 0 || k == n)
    return 1;
return GetNChooseKDP(n-1, k-1) + GetNChooseKDP(n-1,k);
}

Monday, April 10, 2017

We were discussing the choice for learning algorithms, the metrics they use and their fine tuning with feedback from the sample set. There are a few steps followed in sequence to make the right choices.
The first step is collecting the dataset. If an expert is available, then we may know the fields (attributes, features) which are most informative otherwise we try to measure everything in a brute-force method in order to shortlist the attribute, feature pair.
The second step is the data preparation and the data preprocessing. Depending on the circumstances, researchers have a number of methods to choose from to handle missing data. These methods not only handle noise but also cope with large data by working on smaller datasets.
Then the features are pruned by selecting a subset of features and is used to identify and remove as many irrelevant and redundant features as possible. Sometimes features depend on one another and unduly influence the classifier. This is solved by constructing new features from the basic feature set.
Then the training set is defined. One rule of thumb is to split the training set by using two thirds for training and the other third for estimating performance.  Another technique is to do cross validation where the training set is mutually exclusive and equal sized and the classifier is trained on the union. Even the choice of distance metric within a classifier is fine tuned with feedback from the training set although some algorithms try not to be sensitive to the choice of the similarity functions. We discussed Euclidean, Minkowski, Manhattan, and Chebyshev, and Canberra.
#codingexercise
We were looking at an exercise involving sum of all numbers formed by subsequences in  the digits representing a number but a similar technique is also suitable to count the number of repetitions.
We use the fact the number of repetitions is represented by combinations.  A digit will appear as many times as it is chosen in this combination and therefore the count of repetitions is the count of binomial coefficients. Let's take an example and see the pattern:
a0, a1, a2
can be combined as a0 + a1 + a2  + a0a1 + a1a2 + a0a2 + (a0a1a2 ) resulting in 1 + 2 + 1 = 4 repetitions
For a0,a1,a2,a3 we have 1 +3 + 3 + 1 = 8 repetitions
and these are infact the sum of the binomial coefficients.
Int GetRepetitionCount(List<int>A, int start, int end)
{
If (start>end) return 0;
If (start==end)return A[start];
Int sum = A.Sum();
double count= 0
For (int n = 1; n <=A.Count; n++)
       count += GetNChoosekDP(A.Count-1, n-1)
Return count;
}
int GetNChooseKDP(uint n, uint k)
{
if (k == 0 || k == n)
    return 1;
return GetNChooseKDP(n-1, k-1) + GetNChooseKDP(n-1,k);
}

if we were to count the number of subsequences, it would be the sum of the binomial coefficient of choosing k from n where k is the length of the subsequence and can vary in size from 1 to the length of the string.

Saturday, April 8, 2017

We were discussing the content based techniques in document filtering for indexing. The technique that is chosen for the learning is only one of the considerations in the overall design. In order to come up with the right classifier, we need to consider the following steps in sequence:
The first step is collecting the dataset. If an expert is available, then we may know the fields (attributes, features) which are most informative otherwise we try to measure everything in a brute-force method in order to shortlist the attribute, feature pair
The second step is the data preparation and the data preprocessing. Depending on the circumstances, researchers have a number of methods to choose from to handle missing data. These methods not only handle noise but also cope with large data by working on smaller datasets.
Then the features are pruned by selecting a subset of features and is used to identify and remove as many irrelevant and redundant features as possible. Sometimes features depend on one another and unduly influence the classifier. This is solved by constructing new features from the basic feature set.
Then the training set is defined. One rule of thumb is to split the training set by using two thirds for training and the other third for estimating performance.  Another technique is to do cross validation where the training set is mutually exclusive and equal sized and the classifier is trained on the union. Generally we have only one test dataset of size N and all estimates must be obtained from this sole dataset. A common method for comparing supervised ML algorithms involves comparing the accuracies of the classifiers statistically. Algorithm selection depends on the feedback from this evaluation.  The feedback from the evaluation of the test set is also used to improve each of the steps described above.
Even the choice of distance metric within a classifier is fine tuned with feedback from the training set although some algorithms try not to be sensitive to the choice of the similarity functions. Euclidean distance is based on the sum of squares of the difference between the two instances Minkowsky replaces the power of two with the power of r where r is the dimensions of the normed vector.  If the power was 1, the distance is Manhattan. Chebyshev goes by the maximum distance between the two instances. Camberra takes the ratio of the difference to the sum of the distances without the signs.  Kendall's rank correlation takes the difference between the number of concordant pairs and the number of discordant pairs as found from the signs on the distance differences and as a fraction of the total number of pair combinations. If the instances are distributed with high cohesion, something simpler will suffice. If the marginals need to be included, we may need to make it more sophisticated.
#codingexercise
We were looking at an exercise involving sum of all numbers formed by subsequences in  the digits representing a number
We did this by choosing different numbers but we can also put to use the fact the number of such subsequences is represented by repetitions.  A digit will appear as many times as it is chosen in this combination. Therefore the sum is the accumulation of repetitions times the digit value.
Let's take an example and see the pattern:
a0, a1, a2
can be combined as a0 + a1 + a2  + a0a1 + a1a2 + a0a2 + (a0a1a2 ) = 4(a0+a1+a2) where 4 = 1 + 2 + 1
For a0,a1,a2,a3 we have 8(a0+a1+a2+a3) and 8 = 1 +3 + 3 + 1
and these are infact the sum of the binomial coefficients.
Int GetSumOfAllSubsequences(List<int>A, int start, int end)
{
If (start>end) return 0;
If (start==end)return A[start];
Int sum = A.Sum();
int sumofallsubsequences= 0
For (int n = 1; n <=A.Count; n++)
       sumofallsubsequences += sum * GetNChoosekDP(A.Count-1, n-1)
Return sumofallsubsequences;
}
Alternatively, we could also calculate is A.Sum() x Math.Pow(2, A.Count-1)
int GetNChooseKDP(uint n, uint k)
{
if (k == 0 || k == n)
    return 1;
return GetNChooseKDP(n-1, k-1) + GetNChooseKDP(n-1,k);
}