Friday, March 31, 2017

Yesterday we were discussing a survey of statistical techniques in document filtering for indexing. The concept of excluding sensitive documents from indexing is similar to filtering spams from email. These techniques use conditional probabilites and Bayes theorem.Naïve Bayes exhibits good performance on a small feature set but the same cannot be said for larger feature sets. The supervised statistical filtering algorithms therefore do two stages – the training stage and the classification stage. During training a set of labeled documents provide training data. During classification, the prediction is made. When the dimensions of the vector grows, the features are pruned or selected. This has the side benefit that it improves predictions. Feature selection was usually effective to the degree of reducing the features to orders of magnitude smaller than the original. Feature selection techniques include Document Frequency, Information Gain, and Chi-Square.
It is important to note the difference between  spam filtering and document indexing. It is much more severe to misclassify a legitimate mail as a spam and missed from the inbox than allowing a spam to pass the filter and show up in the inbox. The missed mail may have been a personal mail and could be  a disaster when it wasn't let through. The filtering for documents on the other hand is more focused on not letting a sensitive document through to indexing where it may find its way to public searcg results. it is far less severe to not let a document be indexed in which case it could have been found from file explorer.
Performance of classifiers is measured in terms of precision, recall and F-score. Let S and L stand for sensitive and legitimate documents. NLL and NSS are the number of documents correctly classified by the system in the respective categories. The incorrect classifications are denoted by  NLS and NSL for legitimate documents classified as sensitive and sensitive documents classified as legitimate respectively. This gives precision measure as the ratio of successful sensitive classification to overall classifications resulting in sensitive labelling. The recall measure is given by the ratio of the successful sensitive classification to the actual number of sensitive documents. The f-score ranks the classifier with the precision and recall taken together twice as as a fraction of their sums. In all these measures, the sensitive classification is determined by the behavior of the classifier but while a high precision and a high recall  determines the better classifier. In addition the emphasis is on a high recall rather than high precision.
#codingexercise
Find the largest contiguous subarray sum:
We can solve this iteratively
int GetMaxSubArraySum(List<int> A)
{
    int max_so_far = INT_MIN, max_ending_here = 0;

    for (int i = 0; i < A.Count; i++)
    {
        max_ending_here = max_ending_here + a[i];
        if (max_so_far < max_ending_here)
            max_so_far = max_ending_here;

        if (max_ending_here < 0)
            max_ending_here = 0;
    }
    return max_so_far;
}

Thursday, March 30, 2017

A survey of statistical techniques in document filtering for indexing
The concept of excluding sensitive documents from indexing is similar to filtering spams from email. Consequently, the statistical techniques that apply for spam filtering also apply to document filtering. I discuss these techniques here:

The task of index filtering is to rule out sensitive documents automatically using a user’s document stream. The statistical filter based techniques work on computing the probability that the document containing a given word such as SSN, is not indexable. The determination proceeds by calculating the conditional probability by taking the probabilities of the word appearing in the document, the overall probability of any given document is unindexable, the probability that the word appears in sensitive documents, the probability that the given document is indexable and the probability that the word appears in indexable documents. The Bayes probability is therefore computed as a conditional probability.
Naïve Bayes exhibits good performance on a small feature set but the same cannot be said for larger feature sets. Filtering for documents is formally described as follows: Given  a document set D= {d1, d2, ...dj, ...dD} and category C = {c1 = invalid, c2 = legitimate } where dj is the jth mail in D and C is the possible label set. The task of the Automated Document Filtering is to build  a boolean categorization function Phi(dj,ci) on DxC->{True, False}. When Phi(dj, ci) is True, it indicates  document dj belongs to category ci; when Phi(dj, ci) is False, it means dj does not belong to ci. Each document can only be assigned to one of them but not both The supervised statistical filtering algorithms therefore do two stages – the training stage and the classification stage. During training a set of labeled documents provide training data. During classification, the prediction is made.

For a classifier, Vector space model helped define the feature space using tens of thousands of features. Loosely speaking this is similar to treating a document as a bag of words where the words are features and a finite but numerous features were selected and their emphasis in the document represented the vector. It turned out that the large number of features did more harm than good. Therefore features were "pruned" or "selected" which included a benefit that it improved prediction in some cases.  Feature selection was usually effective to the degree of reducing the features to orders of magnitude smaller than the original. Feature selection techniques include Document Frequency, Information Gain, and Chi-Square.
Information Gain is often treated the same as mutual information but can be better translated as average mutual information.
Performance of classifiers is measured in terms of precision, recall and F-score.

#codingexercise
we were discussing ways to count distinct subsequences. Here's another:

Int GetCount(List<int>A) 
var dp = new int[A.Count+1]{}; 
var sum = new int[A.Count+1]{};
dp[0] = 1;
sum[0] = 1;
for (int  I = 1; I <= A.Count; i++){ 
    var total_dp_upto_current = 0;
    var total_dp_upto_repetition = 0;
    for (int j = 0; j <= i – 1;j++) { 
       total_dp_upto_current += dp[j];
    }
     if (h.Contains(A[i-1]) == true){
        for (int j= 0; j <= h[A[i-1]]; j++){
              total_dp_upto_repetition += dp[j];
        }
     }
     dp[i] = total_dp_upto_current - total_dp_upto_repetition;
     sum[i] = sum[i-1] + dp[i];
     last[A[i-1]] = (i-1);
   } 
Return sum[n]; 
}

Wednesday, March 29, 2017

We continue with the enumeration of analytical techniques using tags on datasets.
Cluster mining  There are efficient algorithms to mine cluster centers. Given a table R(A1, A2, ..., An) find cluster with say center-points and minimize average-distance with the constraint on the number of clusters. The average distance is computed on the rows of R, nearest center-points. The center-points belong to R  A naive algorithm is to use exhaustive enumeration. If the domains of the attributes are small, use the following steps : 1) Enumerate cross-products of domains of attributes 2) Enumerate size K subsets of tuple and 3) choose size-K subset with smallest average distance to R.  An alternative idea is to use the K-mean algorithm which uses a greedy strategy - each iteration reduces average distance to R. 1. Select K points at random. 2. Assign tuples of R to the nearest point. 3. Compute center of each cluster. 4. Repeat steps 2 and 3 to reach local minima. The advantage of cluster based outliers is that it is faster than exhaustive but it is heuristic i..e there is no guarantee of solution quality and it may miss outliers.   Outliers   Outliers are the rows that are most dissimilar. Given a relation R(A1, A2, ..., An), and a similarity function between rows of R, find rows in R which are dissimilar to most point in R. The objective is to maximize dissimilarity function in with a constraint on the number of outliers or significant outliers if given.   The choices for similarity measures between rows include distance functions such as Euclidean, manhattan, string-edits, graph-distance etc and L2 metrics. The choices for aggregate dissimilarity measures is the distance of K nearest neighbours, density of neighborhood outside the expected range and the attribute differences with nearby neighbours.  Outlier mining:  A naive algorithm could compute dissimilarity measure for each row in R with the rest of R, then choose the top-K rows, or those with scores outside the expected range.  An alternative idea is clustering based where we use the clusters as a summary of data to reduce computational costs. It involves the following steps 1. Cluster  R eg via K-means, 2.  Compute distance of each tuple in R to nearest cluster center  and 3. choose top-K rows, or those with scores outside the expected range.  The advantage of using the cluster based outliers  is that it is faster but the disadvantages are the same as with clustering i.e. it is heuristic ( there is no guarantee of solution quality) and may miss outliers 

#codingexercise
Yesterday we were look at the count of distinct subsequences where we computed the sum of the dp. We can also do it linearly as shown below:
int GetCountAlt(List<int> A)
{
    var last = int[A.Count];
    int n = A.Count;
    int dp[n+1];
    dp[0] = 1;
    for (int i=1; i<=n; i++)
    {
        dp[i] = 2*dp[i-1];

        if (last[A[i-1]] != -1)
            dp[i] = dp[i] - dp[last[A[i-1]]];

        last[A[i-1]] = (i-1);
    }
    return dp[n];

}


we can also write this function without memoization

such as follows:
/*

// untested
int GetCountH(List<int> A, int N, int k, ref Hashtable h)
{
assert(N == A.Count);
if N == 0 return 0;
if k == 0 return 1; // sorted
int count = 1;
for(int i = 0; i < k; i++)
{
count += 2 ×GetCountH(A, N, i, ref h);

        if (h[A[i]] != -1)
            count -= GetCountH(A, N, h[A[i]], ref h);

        h[A[i-1]] = i;
}
return count;

}
*/

Tuesday, March 28, 2017

Use of annotations in cloud inventory
Most public clouds have a notion of tags or labels to associate with their inventory. For example, as AWS explains, tags enable us to categorize our cloud resources in different ways, for example, by grouping, color coding and action labeling. This is useful when we have many resources of the same type — we can quickly identify a specific resource based on the tags we have assigned to it. Each tag consists of a key and an optional value that we decide and as many as meets our needs for each resource type. Using a consistent set of tag keys makes it easier for us to manage our resources. we can search and filter the resources based on the tags we add. 
In private cloud, we can put this tags to much more use: 
First, we can use a variety of data mining techniques to expand our analytics 
Second we can expose a lot more data for each resource that we can then associate with tags.This data can be gathered and saved so we are not limited to instance creation time. 
Tags can become dimensions of intent for a possible match against a term used for search. These intentions can then be used as lines of search 
Tags can generate more tags. 
This write up describes some of these techniques 
Let us first describe the tags. Tags don't have any semantic meaning to the functional aspects of the resource and are interpreted strictly as a string of characters. Also, tags are not automatically assigned to our resources. That said, in our new use cases, we may suggest a few tags out of the box based on some common usages of our suggestions. Tags can easily be authored and managed by console, command line interface or API 

We can assign tags only to resources that already exist. If we add a tag that has the same key as an existing tag on that resource, the new value overwrites the old value. We can edit tag keys and values, and we can remove tags from a resource at any time. We can set a tag's value to the empty string, but we can't set a tag's value to null. We can even control who can see these tags. 

Now lets describe the new usages: 

Data mining: The idea in data mining is to apply a data driven, inductive and backward technique to identifying model. Get model from learning data and check with test data, refine the model if there’s a mismatch (prediction, reality. This is different from forward deductive methods in that those build model first, then deduce conclusions and then match with data. If there’s a mismatch between the model prediction and reality, the model would then be refined. 

The conceptual data model involves tables for items (instance-id, item-type-id, attribute1, attribute2, … ) and transactions ( xid, subset of item-types, customer-id, timestamp ) and item-types (item-type-id, type-name, … ). 

With this data model, a family of patterns called Associations, are established. They involve subsets of similar item-types such as say customer buying item-type ITI also buys item-type IT2. Another example is sequential associations where the customers buying item-type IT1 will buy item-type 
  
Clustering 

Clustering is a technique for categorization and segmentation of tuples. Given a relation R(A1, A2, ..., An), and a similarity function between rows of R. Find a set of those groups of rows in R with the objectives that the groups should be cohesive and not coupled. The tuples within a group are similar to each other. The tuples across group are dissimilar. The constraint is that the number of clusters may be given and the clusters should be significant. 

The choices for similarity measures include distance functions such as euclidean, manhattan, string edits,  graph-distance etc. and with L2 metrics. 

The choices for group representations are made with finding center such as with  mean, medoid, (mean and standard deviation) or with boolean expression on attributes or with named subsets of rows. 

 In addition to these usages,  tags can generate more tags. Background processing and automation can work with tags to generate more tags. For example, a clustering operation on the existing data using similarity measures on existing tags will generate more tags. 

Conclusion: These tags come very useful in expanding options for analytics. They can be made better than conventional usages and applied over existing resources without any disruption. 
#codingexercise
We were looking at counting the number of increasing subsequences in an earlier post. We try the same for distinct subsequences:

Int GetCount(List<int>A) 
var dp = new int[A.Count+1]{}; 
var sum = new int[A.Count+1]{};
dp[0] = 1;
sum[0] = 1;
for (int  I = 1; I <= A.Count; i++){ 
    var total_dp_upto_current = 0;
    var total_dp_upto_repetition = 0;
    for (int j = 0; j <= i – 1;j++) { 
       total_dp_upto_current += dp[j];
    }
     if (h.Contains(A[i-1]) == true){
        for (int j= 0; j <= h[A[i-1]]; j++){
              total_dp_upto_repetition += dp[j];
        }
     }
     dp[i] = total_dp_upto_current - total_dp_upto_repetition;
     sum[i] = sum[i-1] + dp[i];
     last[A[i-1]] = (i-1);
   } 
Return sum[n]; 
}
For example ABA has "", A, B, AA, AB,BA, ABA

Monday, March 27, 2017

Virtual Machines are tremendous inventory to manage for Cloud providers. Consequently, they use centralized management tools. However, these tools are already available from public cloud providers that help to manage not just one public cloud but other public clouds and on-premise inventory. They are also a means to provide productivity tools to desktop and virtual machine end users. I discuss one such technique below as an extension of the article on indexing based search in the cloud as described here. 
Systems Manager is a management service that helps to automatically collect software inventory, apply OS patches, create system images, and configure Windows and Linux operating systems. These capabilities help to define and track system configurations, prevent drift, and maintain software compliance of your EC2 and on-premises configurations. By providing a management approach that is designed for the scale and agility of the cloud but extends into the on-premises data center, Systems Manager is convenient tool to seamlessly bridge existing infrastructure with public clouds. 
Backup and file indexing activities on the desktop are regular maintenance and productivity enhancement operations and are therefore best suited for automation via Systems Manager. Ibelieve that desktop search can be included with System Center automations. This lets data to be indexed to follow the same route as logs and metrics with the convenience of a central management.Personal document indexing as a productivity tool then merely requires using queries on a cloud service.  
Running commands and batches on the target is easy with Systems Manager because it uses an agent on the target. Commands issued from the Systems Manager go over encrypted channel to the agent and are executed locally. Individual commands and batches can be executed seamlessly between agents of different flavors in a platform agnostic manner. 
By hosting the logic to do the bulk of the operations behind backup and file indexing in cloud services, the Systems Manager and the agents are alleviated from strenuous work. There are several advantages to designs based on Cloud Services and remote administration of targets via System Manager First, there is a formal separation between function and administration. Second, the service can take on many different data sources and callers.  Third a staging xml or json cache can be used to prevent significant changes to data anywhere. Fourth, by hosting the service in the cloud or on a cluster from the cloud, the service can be elastic and expand to as many resources as necessary with automatic rollover, load balancing and differentiation in terms of components. Finally, cloud based services can use a  variety of data repositories such as Big Data, data warehouse, time series database and so on  rather than a simple but efficient relational databaseFinally, access control can be better exercised by the Cloud based Services. 
Systems Manager is a viable method for managed services to end users and transfer of data from their virtual machines
 #codingexercise
We were looking at counting the number of increasing subsequences in an earlier post by enumerating the combinations of its elements or maintaining counts of subsequences that end with the element. We try dynamic programming approach:
Int GetCountDP(List<int> A, int index);  
 
Int n = A.Count;  
If (index == 0) return 1;      
Int result = 1;  
For ( int I = 0; I < index; i++)   
 
    If (A[j] < A[I])  
        result += GetCountDP(A, i) + 1;  
 
return result;  
}
or better yet
Int GetCount(List<int>A) 
var dp = new int[A.Count+1]{}; 
for (int  I = 0; I < A.Count; i++){ 
    dp[i] = 1 
    for (int j = 0; j <= i – 1;j++) { 
        if (A[j] < A[i]){ 
            dp[i] = dp[i] + dp[j]; 
        } 
   } 
Return dp[n-1]; 
}

The count of distinct subsequences replaces the logical comparision above with a hashtable checkup.