Cluster computing: Applying rxFastTrees

Applying MicrosoftML rxFastTree algorithm to classify claims:  

Logistic regression is a well-known statistical technique that is used to model binary outcomes. It can be applied to detect root causes of payment errors. It uses statistical measures, is highly flexible, takes any kind of input and supports different analytical tasks. This regression folds the effects of extreme values and evaluates several factors that affect a pair of outcomes.   

Logistic regression differs from the other Regression techniques in the use of statistical measures. Regression is very useful to calculate a linear relationship between a dependent and independent variable, and then use that relationship for prediction. Errors demonstrate elongated scatter plots in specific categories. Even when the errors come with different error details in the same category, they can be plotted with correlation. This technique is suitable for specific error categories from an account.   

One advantage of logistic regression is that the algorithm is highly flexible, taking any kind of input, and supports several different analytical tasks:  

Use demographics to make predictions about outcomes, such as probability of defaulting payments.  

Explore and weigh the factors that contribute to a result. For example, find the factors that influence customers to make a repeat past due payment.  

Classify claims, payments, or other objects that have many attributes.  

The rxFastTrees is a fast tree algorithm which is used for binary classification or regression. It can be used for payment default prediction and for classifying claims. It is an implementation of FastRank which is a form of MART gradient boosting algorithm. It builds each regression tree in a step wise fashion using a predefined loss function. The loss function helps to find the error in the current step and fix it in the next. The term boosting is used to denote the improvements in numerical optimization in the function space by correlating it with the steepest descent minimization. When the individual additive components are regression trees, this boosting is termed TreeBoost. Gradient boosting of regression trees is said to produce competitive, highly robust, interpretable procedures for both regression and classification.

When the mapping function is restricted to be a member of a parameterized class of functions, then it can be represented as a weighted summation of the individual functions in the parameterized set. This is called additive expansion. This technique is very helpful for approximations. With gradient boost, the constraint is applied to the rough solution by fitting the parameterized function set to obtain "pseudoresponses" This permits the replacement of the difficult minimization problem by the least squares function minimization followed by only a single optimization based on the original criterion.

Gradient boost algorithm is described as :

1. Describe the problem as a minimization function over a parameterized class of functions

2. For each of the parameterized set from 1 to M do

3. Fit the mapping function to the pseudoresponses by calculating the negative gradient from i = 1 to N

4. find the smoothed negative gradient by using any fitting criterion such as least squares

5. Perform the line search using the constrained negative gradient in steepest descent, we take the one that leads to the minimum

6. Update the approximation by performing a step along the direction of line of search.

Prediction rates can be boosted, and false positives can be reduced using real-time behavioral profiling as well as historical profiling. Big Data, commodity hardware and historical data going as far back as three years help with accuracy. This enables prediction to be almost as early as when it is committed. True real time processing implies stringent response times.

The algorithm for the least squares regression can be written as:   

1. Set the initial approximation    

2. For a set of successive increments or boosts each based on the preceding iterations, do   

3. Calculate the new residuals   

4. Find the line of search by aggregating and minimizing the residuals   

5. Perform the boost along the line of search   

6. Repeat 3,4,5 for each of 2.  

Conjugate gradient descent can be described with a given input matrix A, b, a starting value x, a number of iterations i-max and an error tolerance  epsilon < 1 in this way:

set I to 0        

set residual to b - Ax     

set search-direction to residual.    

And delta-new to the dot-product of residual-transposed.residual.    

Initialize delta-0 to delta-new    

while I < I-max and delta > epsilon^2 delta-0 do:     

    q = dot-product(A, search-direction)    

    alpha = delta-new / (search-direction-transposed. q)     

    x = x + alpha.search-direction    

    If I is divisible by 50     

        r = b - Ax     

    else     

        r = r - alpha.q     

    delta-old = delta-new    

    delta-new = dot-product(residual-transposed,residual)    

    Beta = delta-new/delta-old    

    Search-direction = residual + Beta. Search-direction    

    I = I + 1 

Sample application:  

#! /bin/python 

import matplotlib.pyplot as plt 

import pandas 

import os 

here = os.path.dirname(__file__) if "__file__" in locals() else "." 

data_file = os.path.join(here, "data", "payment_errors", "data.csv") 

data = pandas.read_csv(data_file, sep=",") 

# y is the last column and the variable we want to predict. It has a boolean value. 

data["y"] = data["y"].astype("category") 

print(data.head(2)) 

print(data.shape) 

data["y"] = data["y"].apply(lambda x: 1 if x == 1 else 0) 

print(data[["y", "X1"]].groupby("y").count()) 

try: 

    from sklearn.model_selection import train_test_split 

except ImportError: 

    from sklearn.cross_validation import train_test_split 

train, test = train_test_split(data) 

import numpy as np 

from microsoftml import rx_fast_trees, rx_predict 

features = [c for c in train.columns if c.startswith("X")] 

model = rx_fast_trees("y ~ " + "+".join(features), data=train) 

pred = rx_predict(model, test, extra_vars_to_write=["y"]) 

print(pred.head()) 

#codingexercise

CodingExercise-11-03-23.docx

Cluster computing

Thursday, November 2, 2023

Applying rxFastTrees

No comments:

Post a Comment