Applying MicrosoftML rxFastTree
algorithm to classify claims:
Logistic regression is a well-known statistical technique
that is used to model binary outcomes. It can be applied to detect root causes
of payment errors. It uses statistical measures, is highly flexible, takes any
kind of input and supports different analytical tasks. This regression folds
the effects of extreme values and evaluates several factors that affect a pair
of outcomes.
Logistic regression differs from the other Regression
techniques in the use of statistical measures. Regression is very useful to
calculate a linear relationship between a dependent and independent variable,
and then use that relationship for prediction. Errors demonstrate elongated
scatter plots in specific categories. Even when the errors come with different
error details in the same category, they can be plotted with correlation. This
technique is suitable for specific error categories from an account.
One advantage of logistic regression is that the algorithm
is highly flexible, taking any kind of input, and supports several different
analytical tasks:
Use demographics to make predictions about outcomes, such
as probability of defaulting payments.
Explore and weigh the factors that contribute to a result.
For example, find the factors that influence customers to make a repeat past
due payment.
Classify claims, payments, or other objects that have many
attributes.
The rxFastTrees is a fast tree algorithm which is used for binary
classification or regression. It can be used for payment default prediction and
for classifying claims. It is an
implementation of FastRank which is a form of MART gradient boosting algorithm.
It builds each regression tree in a step wise fashion using a predefined loss
function. The loss function helps to find the error in the current step and fix
it in the next. The term boosting is used to denote the improvements in
numerical optimization in the function space by correlating it with the
steepest descent minimization. When the
individual additive components are regression trees, this boosting is termed
TreeBoost. Gradient boosting of regression trees is said to produce
competitive, highly robust, interpretable procedures for both regression and
classification.
When the mapping function is restricted to be a member of a
parameterized class of functions, then it can be represented as a weighted
summation of the individual functions in the parameterized set. This is called
additive expansion. This technique is very helpful for approximations. With
gradient boost, the constraint is applied to the rough solution by fitting the
parameterized function set to obtain "pseudoresponses" This permits the replacement of the difficult
minimization problem by the least squares function minimization followed by
only a single optimization based on the original criterion.
Gradient boost algorithm is described as :
1. Describe the problem as a minimization function over a
parameterized class of functions
2. For each of the parameterized set from 1 to M do
3. Fit the mapping function to the pseudoresponses by calculating
the negative gradient from i = 1 to N
4. find the smoothed negative gradient by using any fitting
criterion such as least squares
5. Perform the line search using the constrained negative gradient
in steepest descent, we take the one that leads to the minimum
6. Update the approximation by performing a step along the
direction of line of search.
Prediction rates can be boosted, and false positives can be reduced
using real-time behavioral profiling as well as historical profiling. Big Data,
commodity hardware and historical data going as far back as three years help
with accuracy. This enables prediction to be almost as early as when it is
committed. True real time processing implies stringent response times.
The algorithm for the least squares regression can be
written as:
1. Set the initial approximation
2. For a set of successive increments or boosts each based
on the preceding iterations, do
3. Calculate the new residuals
4. Find the line of search by aggregating and minimizing
the residuals
5. Perform the boost along the line of search
6. Repeat 3,4,5 for each of 2.
Conjugate gradient descent can be described with a given input matrix A, b, a starting value x, a number of iterations
i-max and an error tolerance epsilon < 1 in this way:
set I to 0
set residual to b - Ax
set search-direction to residual.
And delta-new to the dot-product of residual-transposed.residual.
Initialize delta-0 to delta-new
while I < I-max and delta > epsilon^2 delta-0 do:
q = dot-product(A, search-direction)
alpha = delta-new / (search-direction-transposed. q)
x = x + alpha.search-direction
If I is divisible by 50
r = b - Ax
else
r = r - alpha.q
delta-old = delta-new
delta-new = dot-product(residual-transposed,residual)
Beta = delta-new/delta-old
Search-direction = residual + Beta. Search-direction
I = I + 1
Sample application:
#! /bin/python
import matplotlib.pyplot as plt
import pandas
import os
here = os.path.dirname(__file__) if
"__file__" in locals() else "."
data_file = os.path.join(here,
"data", "payment_errors", "data.csv")
data = pandas.read_csv(data_file,
sep=",")
# y is the last column and the
variable we want to predict. It has a boolean value.
data["y"] =
data["y"].astype("category")
print(data.head(2))
print(data.shape)
data["y"] =
data["y"].apply(lambda x: 1 if x == 1 else 0)
print(data[["y", "X1"]].groupby("y").count())
try:
from sklearn.model_selection
import train_test_split
except ImportError:
from sklearn.cross_validation
import train_test_split
train, test = train_test_split(data)
import numpy as np
from microsoftml import
rx_fast_trees, rx_predict
features = [c for c in train.columns
if c.startswith("X")]
model = rx_fast_trees("y ~
" + "+".join(features), data=train)
pred = rx_predict(model, test,
extra_vars_to_write=["y"])
print(pred.head())
No comments:
Post a Comment