Cluster computing

Thursday, November 2, 2023

Applying rxFastTrees

Applying MicrosoftML rxFastTree algorithm to classify claims:  

Logistic regression is a well-known statistical technique that is used to model binary outcomes. It can be applied to detect root causes of payment errors. It uses statistical measures, is highly flexible, takes any kind of input and supports different analytical tasks. This regression folds the effects of extreme values and evaluates several factors that affect a pair of outcomes.   

Logistic regression differs from the other Regression techniques in the use of statistical measures. Regression is very useful to calculate a linear relationship between a dependent and independent variable, and then use that relationship for prediction. Errors demonstrate elongated scatter plots in specific categories. Even when the errors come with different error details in the same category, they can be plotted with correlation. This technique is suitable for specific error categories from an account.   

One advantage of logistic regression is that the algorithm is highly flexible, taking any kind of input, and supports several different analytical tasks:  

Use demographics to make predictions about outcomes, such as probability of defaulting payments.  

Explore and weigh the factors that contribute to a result. For example, find the factors that influence customers to make a repeat past due payment.  

Classify claims, payments, or other objects that have many attributes.  

The rxFastTrees is a fast tree algorithm which is used for binary classification or regression. It can be used for payment default prediction and for classifying claims. It is an implementation of FastRank which is a form of MART gradient boosting algorithm. It builds each regression tree in a step wise fashion using a predefined loss function. The loss function helps to find the error in the current step and fix it in the next. The term boosting is used to denote the improvements in numerical optimization in the function space by correlating it with the steepest descent minimization. When the individual additive components are regression trees, this boosting is termed TreeBoost. Gradient boosting of regression trees is said to produce competitive, highly robust, interpretable procedures for both regression and classification.

When the mapping function is restricted to be a member of a parameterized class of functions, then it can be represented as a weighted summation of the individual functions in the parameterized set. This is called additive expansion. This technique is very helpful for approximations. With gradient boost, the constraint is applied to the rough solution by fitting the parameterized function set to obtain "pseudoresponses" This permits the replacement of the difficult minimization problem by the least squares function minimization followed by only a single optimization based on the original criterion.

Gradient boost algorithm is described as :

1. Describe the problem as a minimization function over a parameterized class of functions

2. For each of the parameterized set from 1 to M do

3. Fit the mapping function to the pseudoresponses by calculating the negative gradient from i = 1 to N

4. find the smoothed negative gradient by using any fitting criterion such as least squares

5. Perform the line search using the constrained negative gradient in steepest descent, we take the one that leads to the minimum

6. Update the approximation by performing a step along the direction of line of search.

Prediction rates can be boosted, and false positives can be reduced using real-time behavioral profiling as well as historical profiling. Big Data, commodity hardware and historical data going as far back as three years help with accuracy. This enables prediction to be almost as early as when it is committed. True real time processing implies stringent response times.

The algorithm for the least squares regression can be written as:   

1. Set the initial approximation    

2. For a set of successive increments or boosts each based on the preceding iterations, do   

3. Calculate the new residuals   

4. Find the line of search by aggregating and minimizing the residuals   

5. Perform the boost along the line of search   

6. Repeat 3,4,5 for each of 2.  

Conjugate gradient descent can be described with a given input matrix A, b, a starting value x, a number of iterations i-max and an error tolerance  epsilon < 1 in this way:

set I to 0        

set residual to b - Ax     

set search-direction to residual.    

And delta-new to the dot-product of residual-transposed.residual.    

Initialize delta-0 to delta-new    

while I < I-max and delta > epsilon^2 delta-0 do:     

    q = dot-product(A, search-direction)    

    alpha = delta-new / (search-direction-transposed. q)     

    x = x + alpha.search-direction    

    If I is divisible by 50     

        r = b - Ax     

    else     

        r = r - alpha.q     

    delta-old = delta-new    

    delta-new = dot-product(residual-transposed,residual)    

    Beta = delta-new/delta-old    

    Search-direction = residual + Beta. Search-direction    

    I = I + 1 

Sample application:  

#! /bin/python 

import matplotlib.pyplot as plt 

import pandas 

import os 

here = os.path.dirname(__file__) if "__file__" in locals() else "." 

data_file = os.path.join(here, "data", "payment_errors", "data.csv") 

data = pandas.read_csv(data_file, sep=",") 

# y is the last column and the variable we want to predict. It has a boolean value. 

data["y"] = data["y"].astype("category") 

print(data.head(2)) 

print(data.shape) 

data["y"] = data["y"].apply(lambda x: 1 if x == 1 else 0) 

print(data[["y", "X1"]].groupby("y").count()) 

try: 

    from sklearn.model_selection import train_test_split 

except ImportError: 

    from sklearn.cross_validation import train_test_split 

train, test = train_test_split(data) 

import numpy as np 

from microsoftml import rx_fast_trees, rx_predict 

features = [c for c in train.columns if c.startswith("X")] 

model = rx_fast_trees("y ~ " + "+".join(features), data=train) 

pred = rx_predict(model, test, extra_vars_to_write=["y"]) 

print(pred.head()) 

#codingexercise

CodingExercise-11-03-23.docx

Wednesday, November 1, 2023

Applying MicrosoftML rxFastLinear algorithm to Insurance payment default prediction: 

One advantage of logistic regression is that the algorithm is highly flexible, taking any kind of input, and supports several different analytical tasks: 

· Use demographics to make predictions about outcomes, such as probability of defaulting payments. 

· Explore and weigh the factors that contribute to a result. For example, find the factors that influence customers to make a repeat past due payment. 

· Classify claims, payments, or other objects that have many attributes. 

Support Vector machines, on the other hand, can detect non-linear and complex patterns with good predictive power. These are sophisticated classification machines. These build a predictive model by finding the dividing line between two categories. In other words, the data is most distant to these lines and one of them is usually chosen as the best. The points that are closest to the line are the ones that determine the line and are called support vectors. Once the line is found, classifying is just a preference for putting the data in the right category.

The MicrosoftML rxFastLinear algorithm is a fast linear model trainer based on the Stochastic Dual Coordinate Ascent method. It combines the capabilities of logistic regressions and SVM algorithms. The dual problem is the dual ascent by maximizing the regression in the scalar convex functions adjusted by the regularization of vectors. It supports three types of loss functions - log loss, hinge loss, smoothed hinge loss.

An application of rxFastLinear algorithm that encapsulates Logistic regression and Support Vector machines for the purpose of payment default prediction would leverage individual oriented scoring instead of broad segment-based scoring of transactions. Default detection rates can be boosted, and false positives can be reduced using real-time behavioral profiling as well as historical profiling. Big Data, commodity hardware and historical data going as far back as three years help with accuracy. This enables payment default detection to be almost as early as when it is committed. True real time processing implies stringent response times.

The algorithm for the least squares regression can be written as:  

1. Set the initial approximation   

2. For a set of successive increments or boosts each based on the preceding iterations, do  

3. Calculate the new residuals  

4. Find the line of search by aggregating and minimizing the residuals  

5. Perform the boost along the line of search  

6. Repeat 3,4,5 for each of 2. 

Conjugate gradient descent can be described with a given input matrix A, b, a starting value x, a number of iterations i-max and an error tolerance  epsilon < 1 in this way:

set I to 0       

set residual to b - Ax    

set search-direction to residual.   

And delta-new to the dot-product of residual-transposed.residual.   

Initialize delta-0 to delta-new   

while I < I-max and delta > epsilon^2 delta-0 do:    

    q = dot-product(A, search-direction)   

    alpha = delta-new / (search-direction-transposed. q)    

    x = x + alpha.search-direction   

    If I is divisible by 50    

        r = b - Ax    

    else    

        r = r - alpha.q    

    delta-old = delta-new   

    delta-new = dot-product(residual-transposed,residual)   

    Beta = delta-new/delta-old   

    Search-direction = residual + Beta. Search-direction   

   I = I + 1 

Sample application: 

#! /bin/python 
from microsoftml import rx_fast_linear, rx_predict
model = rx_fast_linear("clas ~ x + y", data=data)
pred = rx_predict(model, data, extra_vars_to_write=["x", "y"])

print(pred.head())

#codingexercise

Print all nodes in a binary tree with k leaves.

int GetNodeWithKLeaves(Node root, int k, ref List<Node> result)

{

if (root == null) return 0;

if (root.left == null && root.right == null) return 1;

int left = GetNodeWithKLeaves(root.left, k, ref result);

int right = GetNodeWithKLeaves(root.right, k, ref result);

if (left + right == k)

{

result.Add(root);

}

return left + right;

}

Tuesday, October 31, 2023

The essence of copywriting:

Copywriting can be considered a content production strategy. Only in copywriting, the goal is to convince the reader to take a specific action and achieve it with its persuasive character, using triggers to arouse readers’ interest, to generate conversations and sales. Copyrighting is also an essential part of digital marketing strategy with potential to increase brand awareness, generate higher-quality leads, and acquire new customers. Good copywriting articulates the brand’s messaging and image while tuning into the target audience.

Kristen Fischer, says in her book, “When Talent isn’t enough – business basics for the creatively inclined” that most creative professionals and scholars can succeed in conducting business on their talent, if they just know how to create their business blueprint that spells all their goals. She is writer and freelance expert herself and recognizes that many creative people are unable to sell their talents as good as how they write, paint, draft, design, or program. This is even more important when their business venture is about entrepreneurship or the advancement of career or professional growth. Artistic or scholarly capabilities do not suffice. Business know-how is about delivering quality work and superior customer service. A business might be known by a name, a resume, a website and dossiers and emails, communications, correspondence, and newsletters are valuable marketing tools. While her book talks about baseline hourly rate and moonlighting as an excellent way to test the freelance market, she is all about articulating what one can accept and not accept to draw lines with the clients. It is this deep understanding that also promotes business.

Being good at business is being creative too. Connecting with a prospective client and dotting all the i’s and crossing all the t’s will promote one’s work and make the engagement rewarding. Keeping records will help with an understanding both for oneself as well as for prospective clients. Its this difference that makes a profession set apart from a hobby for creative people. One example that explains this difference is that a well written website or business brochure does not equal good leads and targeted marketing. One can also differentiate from the competition by virtue of speaking, listening, writing, coaching, analyzing, meditating, and networking. Staying in business means always playing on these strengths to their full.

A business blueprint is about strategy and the right business model. It outlines the business objectives, the marketing strategies, the legal needs, a profile of ideal clients, the location to target, and how to go about the transactions. Knowing who to impress is half the ground in a marketing campaign. Always keep the contact information handy in any information about the work such as a portfolio. A bubbly personality and the ability to carry a conversation can help even with cold-calling. Copywriter Julie Cortes says a client could either love or hate your work. That’s their opinion and they are entitled to it. Contracts and customer service often trumps talent.

These are some of the ways in which creative people can improve their business skills.

Sunday, October 29, 2023

This is a summary of the book “The Cold Start Problem” written by Andrew Chen and published by Harper Business 2021. The author is a general partner for Andreessen Horowitz who explores the network effects behind the growth and success of several companies such as Reddit, Microsoft, Uber, YouTube and Craigslist. He presents a detailed study of network effects with examples and insights. A startup in the planning stage will find many useful tips in this book. Once a startup reaches “escape velocity”, Chen provides guidance for strong results.

Network effects stand in contrast to product feature developments. Traditionally, makers of technology goods have focused on building more and better features based on how users use their products. Instead, networked products focus on user interactions and grow by attracting more users. There are three types of network effects that can drive a product’s success:

1. The acquisition effect where user population increases through viral growth, building the company’s economic base.

2. The engagement effect is where users increase their involvement as the network expands. When the products scale, re-engaging lapsed users become a powerful driver.

3. The economic effect where growth kicks in as monetization and revenue per user increases.

Ecology may provide a framework for understanding network growth. A critical mass or “Allee Threshold” which is the “Tipping point” is the critical number in a social animal group. Below this number, survival prospects wane.

Growing a user base is a different focus than building software. An established competitor can duplicate the features of a startup by capturing its network in another matter altogether. Network effects impel growth and provide competitive advantage.

Losing initial customers because the new product lacks customers is called the cold start problem. For example, the number of rideshare drivers in a city is critical. If riders must wait for half an hour for a ride, the rideshare company is not providing value. Adding more drivers increases customers.

Networked products focus on experiences that users have with each other while traditional products focus on how users interact with the software itself. Cold start problem can be overcome by building an “atomic network” before we launch a new product. An atomic network is the smallest possible self-sustaining network. Building the first network for a startup can be hard but their mainstream relevance even if not apparent, has significance. For example, Tiny Speck was building a game with remote workers who communicated using an archaic Internet Relay Chat technology. Although their game was not successful, it enhanced and adapted its chat tool and named it Slack. The CEO asked friends at other companies to try Slack and while most of them were startups themselves, Slack’s client network expanded and the product gained more features. When it made its debut, the company earned 8000 companies. Within a year, it had 135000 paid subscribers and up to ten thousand daily signups.

Assets also fuel networks. Gaining drivers of competing rideshare companies helped one company gain advantage over another. Even dating apps look for attractive people as assets for match making. When a new product succeeds, considering who uses it and how they differ from category to category is important. Marketplaces such as eBay and Airbnb must have sellers. The supply of goods being sold must precede and sustain buyer demand.

Network effects happen only on a scale. Zoom, for example, improved incumbents by letting people join with a link and providing high-quality video. When a few people adopted Zoom, it quickly expanded virally. Strategies for building networks include “invite only”, “come for the tool”, and financial incentives. The invite-only feature fueled LinkedIn’s explosive growth. Financial incentives date back to 1888 when Coca-cola offered a coupon for a free coke. The author says hustle and creativity help tip over markets because each atomic network is not the same. Also, when a product reaches scale, negative forces may impede further expansion. Forces that undermine growth include churn, market saturation, regulatory measures, trolling, spamming and fraud. Networks also suffer from overcrowding. Applying algorithms that optimize use according to engagement may result in controversial content and new competitors may try cherry-picking from an incumbent. Finally, growth does not continue forever. But network effects are powerful, and they are undeniable drive factors for growth and success.

Saturday, October 28, 2023

This is a continuation of articles on Infrastructure-as-code aka IaC for short.  There’s no denying that IaC can help to create and manage infrastructure and that they can be versioned, reused, and shared – all of which helps to provision resources quickly and consistently and manage them consistently throughout their lifecycle. Unlike software product code that must be general purpose and provide a strong foundation for system architecture and aspiring to be a platform for many use cases, IaC often varies a lot and must be manifested in different combinations depending on environment, purpose and scale and encompass complete development process. It can even include CI/CD platform, DevOps, and testing tools. The DevOps based approach is critical to rapid software development cycles. This makes IaC spread over in a variety of forms. The more articulated the IaC the more predictable and cleaner the deployments. 

One of the challenges of working with IaC that is somewhat unique to IaC is that authors frequently encounter errors in the ‘apply’ stage of the IaC and do not detect any errors in the ‘plan’ stage of the IaC. This leads to write-once-and-fix-many-times and appears to be unavoidable. The compiler only catches limited set of errors such as when a key is specified instead of an id but whether its only at runtime can an id be tried and found to be correct or not. A guid for a principal id is common for role assignments but whether the guid is appropriate for a particular role assignment depends on the principal to which the guid belongs as well as the intended target. One way to overcome this limitation is to have a pre-production environment where the code can be applied in a similar way. By the nature of the non-production environment maintaining a separate set of resources than the production environment, sometimes, even this is difficult to do. In such cases, some experimentation might be involved where the IaC is applied once to add and again to remove leaving behind a clean slate. Both non-production and production environments are secured with DevOps pipelines so that IaC is pushed to these environments which results in raising a request and following through each time. Fortunately, there is a better way to scope down problematic or suspicious IaC code snippets and try it out in a personal azure subscription. This approach strongly eliminates all doubts and works without the touch points required for pipelines. And since the sandbox is of no concern to the business, it is even facilitated by organizations to work for all employees and by public cloud as free accounts.

Another challenge that routinely requires more experimentation is for applying permissions to managed identities. Every resource can have its own system managed identity but deployments comprising of resources and their dependencies can have a common user managed identity to govern them. In this case, the identity must be granted permission on all those resources. Several built-in roles varying per resource are applicable to the environment, but the principle of least privileges can only be honored by increasing privileges step-by-step. This calls for a gradation in built-in roles to be tried out for successful application deployments.

Similarly, access is also about connectivity, and it might be surprising that 404 https status code can also imply network failure when the error is being translated from an upstream resource. Some resources have mutual exclusivity between public access and private access. Granting public access with restrictions to some ip addresses might be a hybrid approach that works sufficiently enough to secure resources. It is also important to note that Azure services can bypass general deny rules.

These are some resolutions that can be categorized as miscellaneous under the IaC.

Friday, October 27, 2023

Securing compute for azure machine learning workspace:

An Azure machine learning compute instance is a managed cloud-based workstation dedicated to a single owner usually for data analysis. It serves as a fully configured and managed development environment or as a compute target for training of models and inference. Models can be build and deployed using integrated notebooks and tools. A compute instance differs from a compute cluster in that it has a single node.

IT administrators prefer this compute for enterprise readiness capabilities. They leverage IaC or resource manager templates to create instances for users. Using advanced settings or security settings, they can further lockdown the instance such as enabling or disabling the ssh or specifying the subnet for the compute instance. They might also require to prevent users from creating compute themselves. In all these cases, some control is necessary.

One option is to list the operations available on the resource and then setting up role-based access control limiting some of those. This approach is favored because users can be switched between roles without affecting the resource or its deployment. It also works for groups and users can be added or removed from both groups and roles. Listing the operations enumerates the associated permissions all of which begin with the provider as the prefix. This listing is thorough and covers all aspects of working with the resources. The custom-role is described in terms of permitted ‘actions’, ‘data-actions’ and ‘not-actions’ where the first two correspond to control and data plane associated actions and the last one corresponds to deny set that takes precedence over control and data plane actions. By appropriately selecting the necessary action privileges and listing them under a specific category without the categories overlapping, we create the custom role with just the minimum number of privileges needed to complete a set of selected tasks.

Another option is to supply an init script with the associated resource, so that as other users start using it, the init script will set the predefined configuration that they must work with. This allows for some degree of control on sub resources and associated containers necessary for an action to complete so that by virtue of removing those resources, an action even if permitted by a role on a resource type, may not be permitted on a specific resource.

These are some techniques to secure the compute instance for azure machine learning workspace.

Thursday, October 26, 2023

Potential applications of machine learning

The MicrosoftML package provides fast and scalable machine learning algorithms for classification, regression and anomaly detection.

The rxFastLinear algorithm is a fast linear model trainer based on the Stochastic Dual Coordinate Ascent method. It combines the capabilities of logistic regressions and SVM algorithms. The dual problem is the dual ascent by maximizing the regression in the scalar convex functions adjusted by the regularization of vectors. It supports three types of loss functions - log loss, hinge loss, smoothed hinge loss. This is used for applications in Payment default prediction and Email Spam filtering.
The rxOneClassSVM is used for anomaly detection such as in credit card fraud detection. It is a simple one class support vector machine which helps detect outliers that do not belong to some target class because the training set contains only examples from the target class.
The rxFastTrees is a fast tree algorithm which is used for binary classification or regression. It can be used for bankruptcy prediction. It is an implementation of FastRank which is a form of MART gradient boosting algorithm. It builds each regression tree in a step wise fashion using a predefined loss function. The loss function helps to find the error in the current step and fix it in the next.
The rxFastForest is a fast forest algorithm also used for binary classification or regression. It can be used for churn prediction. It builds several decision trees built using the regression tree learner in rxFastTrees. An aggregation over the resulting trees then finds a Gaussian distribution closest to the combined distribution for all trees in the model.
The rxNeuralNet is a neural network implementation that helps with multi class classification and regression. It is helpful for applications say signature prediction, OCR, click prediction. A neural network is a weighted directed graph arranged in layers where the nodes in one layer are connected by a weighted edge to the nodes in another layer. This algorithm tries to adjust the weights on the graph edges based on the training data.
The rxLogisticRegression is a binary and multiclass classification that classifies sentiments from feedback. This is a regular regression model where the variable that determines the category is dependent on one or more independent variables that have a logistic distribution.