Cluster computing

Thursday, May 11, 2017

Today we continue to discuss Machine Learning is not limited to using NoSQL databases or graph databases. Recently SQL Server announced machine learning Services that are supported in database. We can now run Python in SQL Server using stored procedures or remote compute contexts. The package used in Python for machine learning purposes is revoscalepy module. This module has a subset of algorithms and contexts in RevoScaleR The R utilities are made available in SQL Server 2017 and Microsoft R Server. These include supported compute contexts such as RxSpark and RxInSQLServer. By making the models and contexts as citizens of the database, we can now control who uses them. Also, the users can be assigned the right to install their own packages or share packages with other users. Users who belong to these roles can install and uninstall R packages on the SQL server computer from a remote development client, without having to go through the database administrator each time. A certain package is published by Microsoft and this is called the MicrosoftML package and it brings speed, performance and scale to handling a large corpus of text data and high dimensional categorical data in R models. This package provides machine learning transform pipelines where we can specify the transforms to be applied to our data for featurization before training or testing to facilitate these processes The MicrosoftML package provides several algorithms. One such is rxFastLinear algorithm which is a fast linear model trainer based on the Stochastic Dual Coordinate Ascent method. It supports three types of loss functions - log loss, hinge loss, smoothed hinge loss. This is used for applications in Mortgage default prediction and Email Spam filtering.It combines the capabilities of logistic regressions and SVM algorithms. The algorithm implements an optimization associated with the regularized loss minimization of linear predictors. The minimization problem can be considered an SVM problem by setting the scalar convex function to be a linear step.It is posed as a regularized logistic regression by setting the scalar convex function to be a log exponential step.It can also be considered a regression problem by setting the scalar convex function to be a square of error.The SVM is solved by using a stochastic gradient descent method. It does not depend on the number of linear predictors and works well when the number is large. However it does not have a criterion that determines when to stop exactly. It tends to be too aggressive at the beginning of the optimization process and becomes rather slow when precision is important. But it does find a reasonable solution pretty fast. An alternative approach is the dual coordinate ascent. Here instead of taking the scalar convex functions, we take the convex conjugate of the scalar functions.By translating the objective optimization in terms of a dual variable, a definite criterion for stopping is established and accuracy is improved.
Courtesy Shwartz and Zhang

Cluster computing

Thursday, May 11, 2017

No comments:

Post a Comment