Sunday, May 14, 2017

We were discussing the MicrosoftML rxFastLinear algorithm which is a fast linear model trainer based on the Stochastic Dual Coordinate Ascent method. It combines the capabilities of logistic regressions and  SVM algorithms. The algorithm implements an optimization associated with the regularized loss minimization  of linear predictors. The approach for solving SVM is usually stochastic gradient descent. This method uses dual coordinate descent. Here the objective is framed as a maximization in terms of a dual variable associated with each example in the training set. In each iteration, a single dual variable is optimized while keeping the rest in tact.  At the optimal solution the objective is known, so we have a duality gap which helps us put an upper bound on the sub-optimality. The method is stochastic and at each round of the iteration, a dual coordinate is chosen at random  The duality gap is described in the Shalev-Shwartz Zhang paper. Originally the generic optimization problem was stated in the form of AX+B where the first component was the normalization of the scalar convex functions applied on the instance dot linear predictors and the second component was the regularized instance.  The dual problem replaced the instance with the dual co-ordinate and the dual objective is optimized with respect to a single dual variable while the rest of the variables are kept in tact.The second component is in terms of instances only and in the dual objective case we take this as the regularized normalized new coordinate dot linear predictors. We call this the computed instance. The optimal solution of the dual problem is also the optimal solution of the generic problem. At each such computed instance, the computed generic objective value must be greater than or equal to the dual objective at the corresponding dual coordinate. This is the duality gap and because the duality gap closes from the generic objective value at the computed instance to the generic objective value at the optimal solution, the duality gap has an upper bound.
The paper provides the rigor to understand the convergence of the duality gap for Stochastic Dual Coordinate Ascent (SDCA). They performed experiments involving randomization which improved the analysis performed earlier. They derived the rate of convergence for SDCA with two types of loss functions - Lipschitz loss and Smooth loss.

Saturday, May 13, 2017

We were discussing the MicrosoftML rxFastLinear algorithm which is a fast linear model trainer based on the Stochastic Dual Coordinate Ascent method. It combines the capabilities of logistic regressions and  SVM algorithms. The algorithm implements an optimization associated with the regularized loss minimization  of linear predictors. The approach for solving SVM is usually stochastic gradient descent. The advantage of this method is that it does not depend on the number of linear predictors. Therefore it works very well for a large number of linear predictors. This method reaches an approximate solution pretty quickly but it does not converge. This is true for many gradient descent methods. An alternative approach is to use dual coordinate descent. Here the objective is framed as a maximization in terms of a dual variable associated with each example in the training set. In each iteration, a single dual variable is optimized while keeping the rest in tact.  At the optimal solution the objective is known, so we have a duality gap which helps us put an upper bound on the sub-optimality. The method is stochastic and at each round of the iteration, a dual coordinate is chosen at random
Today we try to understand the duality gap as described by the Shalev-Shwartz Zhang paper. Originally the generic optimization problem was stated in the form of AX+B where the first component was the normalization of the scalar convex functions applied on the instance dot linear predictors and the second component was the regularized instance.  The dual problem replaced the instance with the dual co-ordinate and the dual objective is optimized with respect to a single dual variable while the rest of the variables are kept in tact.The second component is in terms of instances only and in the dual objective case we take this as the regularized normalized new coordinate dot linear predictors. We call this the computed instance. The optimal solution of the dual problem is also the optimal solution of the generic problem. At each such computed instance, the computed generic objective value must be greater than or equal to the dual objective at the corresponding dual coordinate. This is the duality gap and because the duality gap closes from the generic objective value at the computed instance to the generic objective value at the optimal solution, the duality gap has an upper bound. 

Friday, May 12, 2017

Today we continue to discuss Machine Learning which is not limited to using NoSQL databases or graph databases. We can now run Python in SQL Server using stored procedures or remote compute contexts. The package used in Python for machine learning purposes is revoscalepy module. This module has a subset of algorithms and contexts in RevoScaleR  The R utilities are made available in SQL Server 2017 and Microsoft R Server. These include supported compute contexts such as RxSpark and RxInSQLServer. By making the models and contexts as citizens of the database, we can now control who uses them. A certain package is published by Microsoft and this is called the MicrosoftML package and it brings speed, performance and scale to handling a large corpus of text data and  high dimensional categorical data in R models. It provides several algorithms. One such is rxFastLinear algorithm which is a fast linear model trainer based on the Stochastic Dual Coordinate Ascent method. It combines the capabilities of logistic regressions and  SVM algorithms. The algorithm implements an optimization associated with the regularized loss minimization  of linear predictors.
The approach for solving SVM is usually stochastic gradient descent. The advantage of this method is that it does not depend on the number of linear predictors. Therefore it works very well for a large number of linear predictors. This method reaches an approximate solution pretty quickly but it does not converge. This is true for many gradient descent methods. An alternative approach is to use dual coordinate descent. Here the objective is framed as a maximization in terms of a dual variable associated with each example in the training set. In each iteration, a single dual variable is optimized while keeping the rest in tact.  At the optimal solution the objective is known, so we have a duality gap which helps us put an upper bound on the sub-optimality.
The method is stochastic and at each round of the iteration, a dual coordinate is chosen at random.
Courtesy Shwartz and Zhang
#programmingexercise
Find the number of unique BSTs from a range
repeatedly call makeBST with different start and end of range.
Choose each number of the range as a splitting point and recur for the sub ranges. If the left and right sub range are empty form the root. if the left sub range is empty recur on the right sub range for each of the values. repeat for the left if right is empty. finally repeat with one of the sub range.
        static int numUniqueBSTs(int n)
        {
            int count = 0;
            if (n == 0) return 1;
            for (int i = 1; i <= n; i++)
                count += numUniqueBSTs(i - 1) * numUniqueBSTs(n - i);
            return count;
        }
        which is the same as Catalan number
        static int GetNChooseKDP(int n, int k)
        {
            if (k == 0 || k == n)
                return 1;
            return GetNChooseKDP(n - 1, k - 1) + GetNChooseKDP(n - 1, k);
        }
        static int GetCatalan(int n)
        {
            int c = GetNChooseKDP(2 * n, n);

            return c / (n + 1); // Catalan number
        }

Thursday, May 11, 2017

Today we continue to discuss Machine Learning is not limited to using NoSQL databases or graph databases. Recently SQL Server announced machine learning Services that are supported in database. We can now run Python in SQL Server using stored procedures or remote compute contexts. The package used in Python for machine learning purposes is revoscalepy module. This module has a subset of algorithms and contexts in RevoScaleR  The R utilities are made available in SQL Server 2017 and Microsoft R Server. These include supported compute contexts such as RxSpark and RxInSQLServer. By making the models and contexts as citizens of the database, we can now control who uses them. Also, the users can be assigned the right to install their own packages or share packages with other users. Users who belong to these roles can install and uninstall R packages on the SQL server computer from a remote development client, without having to go through the database administrator each time. A certain package is published by Microsoft and this is called the MicrosoftML package and it brings speed, performance and scale to handling a large corpus of text data and  high dimensional categorical data in R models. This package provides machine learning transform pipelines where we can specify the transforms to be applied to our data for featurization before training or testing to facilitate these processes The MicrosoftML package provides several algorithms. One such is rxFastLinear algorithm which is a fast linear model trainer based on the Stochastic Dual Coordinate Ascent method. It supports three types of loss functions - log loss, hinge loss, smoothed hinge loss. This is used for applications in Mortgage default prediction and Email Spam filtering.It combines the capabilities of logistic regressions and  SVM algorithms. The algorithm implements an optimization associated with the regularized loss minimization  of linear predictors.  The minimization problem can be considered an SVM problem by setting the scalar convex function  to be a linear step.It is posed as a regularized logistic regression by setting the scalar convex function to be a log exponential step.It can also be considered a regression problem by setting the scalar convex function to be a square of error.The SVM is solved by using a stochastic gradient descent method. It does not depend on the number of linear predictors and works well when the number is large. However it does not have a criterion that determines when to stop exactly. It tends to be too aggressive at the beginning of the optimization process and becomes rather slow when precision is important. But it does find a reasonable solution pretty fast. An alternative approach is the dual coordinate ascent. Here instead of taking the scalar convex functions, we take the convex conjugate of the scalar functions.By translating the objective optimization in terms of a dual variable, a definite criterion for stopping is established and accuracy is improved.
Courtesy Shwartz and Zhang

Wednesday, May 10, 2017

Yesterday we were discussing Machine Learning is not limited to using NoSQL databases or graph databases. Recently SQL Server announced machine learning Services that are supported in database. We can now run Python in SQL Server using stored procedures or remote compute contexts. The package used in Python for machine learning purposes is revoscalepy module. This module has a subset of algorithms and contexts in RevoScaleR  The R utilities are made available in SQL Server 2017 and Microsoft R Server. These include supported compute contexts such as RxSpark and RxInSQLServer. By making the models and contexts as citizens of the database, we can now control who uses them. Also, the users can be assigned the right to install their own packages or share packages with other users. Users who belong to these roles can install and uninstall R packages on the SQL server computer from a remote development client, without having to go through the database administrator each time. A certain package is published by Microsoft and this is called the MicrosoftML package and it brings speed, performance and scale to handling a large corpus of text data and  high dimensional categorical data in R models. This package provides machine learning transform pipelines where we can specify the transforms to be applied to our data for featurization before training or testing to facilitate these processes.
The MicrosoftML package provides fast and scalable machine learning algorithms for classification, regression and anomaly detection.
The rxFastLinear algorithm is a fast linear model trainer based on the Stochastic Dual Coordinate Ascent method.  It combines the capabilities of logistic regressions and  SVM algorithms. The dual problem is the dual ascent by maximizing the regression in the scalar convex functions adjusted by the regularization of vectors. It supports three types of loss functions - log loss, hinge loss, smoothed hinge loss. This is used for applications in Mortgage default prediction and Email Spam filtering.
The rxOneClassSVM is used for anomaly detection such as in credit card fraud detection.  It is a simple one class support vector machine which helps detect outliers that do not belong to some target class because the training set contains only examples from the target class.
The rxFastTrees is a fast tree algorithm which is used for binary classification or regression. It can be used for bankruptcy prediction.  It is an implementation of FastRank which is a form of MART gradient boosting algorithm. It builds each regression tree in a step wise fashion using a predefined loss function. The loss function helps to find the error in the current step and fix it in the next.
The rxFastForest is a fast forest algorithm also used for binary classification or regression. It can be used for churn prediction. It builds several decision trees built using the regression tree learner in rxFastTrees. An aggregation over the resulting trees then finds a Gaussian distribution closest to the combined distribution for all trees in the model.
The rxNeuralNet is a neural network implementation that helps with multi class classification and regression. It is helpful for applications say signature prediction, OCR, click prediction. A neural network is a weighted directed graph arranged in layers where the nodes in one layer are connected by a weighted edge to the nodes in another layer This algorithm tries to adjust the weights on the graph edges based on the training data.
The rxLogisticRegression is a binary and multiclass classification that classifies sentiments from feedback. This is a regular regression model where the variable that determines the category is dependent on one or more independent variables that has a logistic distribution.

Tuesday, May 9, 2017

Machine Learning is not limited to using NoSQL databases or graph databases. Recently SQL Server announced machine learning Services that are supported in database. We can now run Python in SQL Server using stored procedures or remote compute contexts. The package used in Python for machine learning purposes is revoscalepy module. This module has a subset of algorithms and contexts in RevoScaleR Since data scientists have to experiment with different models overs subsets of a very large datasets, parallel creation and execution of different models is now enabled using the new rxExecBy function. This function accepts a dataset containing ungrouped and unordered data and then lets us partition it by a single entity for training and model building. The result is the training of multiple models on appropriate subsets.
The R utilities are made available in SQL Server 2017 and Microsoft R Server. These include supported compute contexts such as RxSpark and RxInSQLServer.
Not just the models and contexts but also access to who uses them can be controlled by managing permissions associated with the packages. The users can be assigned the right to install their own packages or share packages with other users. Users who belong to these roles can install and uninstall R packages on the SQL server computer from a remote development client, without having to go through the database administrator each time. RevoScaleR package can also be upgraded from previous SQL Server  The upgrade is done by merely switching the server to Modern LifeCycle policy. It takes advantages for faster release cycle for R and automatically upgrades all R components. A specific package in R is also published by Microsoft. This is called the MicrosoftML package and it brings speed, performance and scale to handling a large corpus of text data and  high dimensional categorical data in R models. It also includes five fast, highly accurate learners that are included in Azure Machine Learning. MicrosoftML now also includes new image and test featurization functions as well as support for predictable models with rxExecBy. This package provides machine learning transform pipelines where we can specify the transforms to be applied to our data for featurization before training or testing to facilitate these processes. These include concat(), categoricalHash(), categorical(), selectFeatures(),  featurizeText() , featurizeImage() and getSentiment(). concat() creates a single vector valued column from multiple columns. categoricalHash converts a categorical value into an indicator array using Hashing. categorical() does the same using a dictionary. selectFeatures selects features from the specified variables using one of the two modes, count or mutual information. FeaturizeText produces a bag of counts of n-grams from a given text after performing language detection, tokenization, stopwords removing, text normalization, feature generation and term weighting.featurizeImage featurizes an image using the specified pre-trained deep neural network model. getSentiment returns a sentiment score of the specified natural language text, without the need for any text processing. A value that is closer to 0 indicates a negative sentiment while a value that is closer to 1 indicates a positive sentiment.
#codingexercise
Find the minimum number of moves in a Snake and Ladder game:
Consider each cell to be  vertex of a graph where each cell can connect with six other vertices based on the roll of a dice We find the minimum number of moves using a Breadth-First-Search.
For every cell we maintain the preknown next cell based on ladder or snake or a default value of -1 if neither is present.
int GetMinMoves(List<int> move, int n)
{
Initialize a queue and a boolean array for visited cells
Enqueue the root
while the queue is not empty:
      dequeue a cell
      if the dequeued cell is last
            break
      for each of the six next cells that are valid:
           if the visited flag is false:
              visited for that cell is set to true
              increment the distance
              determine the next cell from move for that cell
              enqueue the next cell
return the distance of the last cell
}
}

Monday, May 8, 2017

Building a machine learning platform - some inspirations and thumb rules 
Enterprises are waking up to drive machine learning into their systems. This is made clear by the increase in the steeply increasing investments year after year. 
Platforms and tools are increasingly being enhanced that seem to serve a wide variety of applications. Many organizations now expect ML dedicated engineers or teams just as they exist for other aspects of data science. 
However the applications and use cases should continue to drive these investments in any organization. For example, traditional applications have improved capabilities with ML scoring and recommendations. They may even be used where automation is the only way to sift through large data sets. 
Its true that most of the ML platforms come from companies that have public clouds or operate at global social networking scale. They make their platforms open source, widely available and well maintained. This is enticing to the application developers. Moreover, these platforms serve accepted versions of popular algorithms.  But this has also been a limitation because developers first try to fit the platform to their application and then enter a consultancy with different teams to satisfy their requirements. Instead if it were on a use case by use case basis, then the platform would have evolved from the grounds up with rapid delivery for each use case. 
When the development efforts are spearheaded by the use case, it will become clearer to separate the content efforts from the machine learning algorithm efforts. Although the latter may be referred as a platform, the emphasis here is not so much as system design or production environment versus lab experimentation but more so in terms of separating the code from the pipeline. There is universally accepted delay in getting a working solution off the ground with the existing data in the initial phase but it's arguably the best time to reduce the debt going forward by learning what works. Developers generally like to use the term service for an area of computation such predictive modeling and build an integration layer to handle the interaction of consumers to the service. The integration layer is not just about marshaling data from one data source to the service but also about the end to end experience. This is served by a pipeline of data so called because t is a forward only movement of data from the origination source through the candidate generation, feature extraction, scoring and post processing. The pipeline brings consideration to managing the data in batches or streaming where necessary, performing local calculations and graceful culmination. Most people prefer to use big data sources or graph databases as the choice repository for ML processing. This is not necessarily as restricted and relational databases and data mining can be just as helpful as the use cases may require.  Each stage of the pipeline may have its own considerations or systems. For example, feature extraction may require vectors to be created from the data so that the models can then be built on top of the features. The machine learning platform can have two different paths for data. The first path is the model building path is usually taken from the staging data store whether it is a graph database or an S3 store. The second path is the prediction and analysis path that can work either directly from the staged data or from the models built from it. If the database is classified/predicted, it will make its way through an upload and merge with the originating data for the next wave of download to staging, model building and prediction repetition cycles on a scheduled basis. Data increases with time and some of the stages in the pipeline are dependent on others. And this is just for one model which means the data may increase by orders of magnitude for other models. Consequently the data pipeline is often visualized as a complicated directed graph.  Depending on the size of data, many engineers prefer to use distributed computing such as clusters and containers along with the choice of debugging tools, monitoringalerting, data transfer technologies as well as tools like Spark, sklearn and Turi. 
Testing the pipeline and the integration system is generally focused on relevance, quality and demand. Each of these focus areas come not only with their own metrics but also instruments for tuning the system for satisfactory performance
Courtesy Preethi Rathi for business impact and book on programming collective intelligence.
#codingexercise
There are n trees in a circle. Each tree has a fruit value associated with it. A bird can sit on a tree for 0.5 sec and then he has to move to a neighbouring tree. It takes the bird 0.5 seconds to move from one tree to another. The bird gets the fruit value when she sits on a tree. We are given n and m (the number of seconds the bird has), and the current position and V the fruit values of the trees. We have to maximise the total fruit value that the bird can gather. The bird can start from any tree.
int GetFruitValue(List<int> V, double n, int m)
{
if ( n <= 0 ) return 0;
if m == V.Count return 0;
return max(V[m%V.count] +GetFruitValue(V, n-1, m+1), GetFruitValue(V,n-0.5, m+1));

}