Cluster computing

Saturday, November 4, 2023

Remove Digit From Number to Maximize Result

You are given a string number representing a positive integer and a character digit.

Return the resulting string after removing exactly one occurrence of digit from number such that the value of the resulting string in decimal form is maximized. The test cases are generated such that digit occurs at least once in number.

Example 1:

Input: number = "123", digit = "3"

Output: "12"

Explanation: There is only one '3' in "123". After removing '3', the result is "12".

Example 2:

Input: number = "1231", digit = "1"

Output: "231"

Explanation: We can remove the first '1' to get "231" or remove the second '1' to get "123".

Since 231 > 123, we return "231".

Example 3:

Input: number = "551", digit = "5"

Output: "51"

Explanation: We can remove either the first or second '5' from "551".

Both result in the string "51".

Constraints:

· 2 <= number.length <= 100

· number consists of digits from '1' to '9'.

· digit is a digit from '1' to '9'.

· digit occurs at least once in number.

Solution:

import java.util.*;

import java.util.Comparator;

import java.lang.*;

import java.io.*;

class Program

{

public static void main (String[] args) throws java.lang.Exception

{

getMaxWithOneDigitRemoval("123", "3 ");

getMaxWithOneDigitRemoval("1231", "1");

getMaxWithOneDigitRemoval("551", "5");

}

private static String getMaxWithOneDigitRemoval(Strint input, Character digit)

{

if (input == null || input.length() == 0 || digit == null) return input;

List<Integer> locations = new ArrayList<Integer>();

for (int i = 0; i < input.Length(); i++)

{

if (digit.equals(input.charAt(i)))

{

locations.add(i);

}

List<Integer> candidates = new ArrayList<Integer>();

locations.forEach(x => candidates.add(convertTo(maximize(input, x))));

return String.valueOf(candidates.stream().max(Integer::compare).get());

}

private static String maximize(String input, Integer index)

{

if (index == 0 ) return input.substring(1, input.length());

if (index == input.length() - 1) return input.substring(0, input.length() - 1);

return input.substring(0,index) + input.substring(index+1, input.length());

}

private static Integer convertTo(String input) {

int power = 0;

Integer sum = 0;

for (int i = input.length()-1; i>=0; i--)

{

sum += Integer.parseInt(input.charAt(i).toString()) * Math.pow(10, power);

power++;

}

return sum;

}

Thursday, November 2, 2023

Applying rxFastTrees

Applying MicrosoftML rxFastTree algorithm to classify claims:  

Logistic regression is a well-known statistical technique that is used to model binary outcomes. It can be applied to detect root causes of payment errors. It uses statistical measures, is highly flexible, takes any kind of input and supports different analytical tasks. This regression folds the effects of extreme values and evaluates several factors that affect a pair of outcomes.   

Logistic regression differs from the other Regression techniques in the use of statistical measures. Regression is very useful to calculate a linear relationship between a dependent and independent variable, and then use that relationship for prediction. Errors demonstrate elongated scatter plots in specific categories. Even when the errors come with different error details in the same category, they can be plotted with correlation. This technique is suitable for specific error categories from an account.   

One advantage of logistic regression is that the algorithm is highly flexible, taking any kind of input, and supports several different analytical tasks:  

Use demographics to make predictions about outcomes, such as probability of defaulting payments.  

Explore and weigh the factors that contribute to a result. For example, find the factors that influence customers to make a repeat past due payment.  

Classify claims, payments, or other objects that have many attributes.  

The rxFastTrees is a fast tree algorithm which is used for binary classification or regression. It can be used for payment default prediction and for classifying claims. It is an implementation of FastRank which is a form of MART gradient boosting algorithm. It builds each regression tree in a step wise fashion using a predefined loss function. The loss function helps to find the error in the current step and fix it in the next. The term boosting is used to denote the improvements in numerical optimization in the function space by correlating it with the steepest descent minimization. When the individual additive components are regression trees, this boosting is termed TreeBoost. Gradient boosting of regression trees is said to produce competitive, highly robust, interpretable procedures for both regression and classification.

When the mapping function is restricted to be a member of a parameterized class of functions, then it can be represented as a weighted summation of the individual functions in the parameterized set. This is called additive expansion. This technique is very helpful for approximations. With gradient boost, the constraint is applied to the rough solution by fitting the parameterized function set to obtain "pseudoresponses" This permits the replacement of the difficult minimization problem by the least squares function minimization followed by only a single optimization based on the original criterion.

Gradient boost algorithm is described as :

1. Describe the problem as a minimization function over a parameterized class of functions

2. For each of the parameterized set from 1 to M do

3. Fit the mapping function to the pseudoresponses by calculating the negative gradient from i = 1 to N

4. find the smoothed negative gradient by using any fitting criterion such as least squares

5. Perform the line search using the constrained negative gradient in steepest descent, we take the one that leads to the minimum

6. Update the approximation by performing a step along the direction of line of search.

Prediction rates can be boosted, and false positives can be reduced using real-time behavioral profiling as well as historical profiling. Big Data, commodity hardware and historical data going as far back as three years help with accuracy. This enables prediction to be almost as early as when it is committed. True real time processing implies stringent response times.

The algorithm for the least squares regression can be written as:   

1. Set the initial approximation    

2. For a set of successive increments or boosts each based on the preceding iterations, do   

3. Calculate the new residuals   

4. Find the line of search by aggregating and minimizing the residuals   

5. Perform the boost along the line of search   

6. Repeat 3,4,5 for each of 2.  

Conjugate gradient descent can be described with a given input matrix A, b, a starting value x, a number of iterations i-max and an error tolerance  epsilon < 1 in this way:

set I to 0        

set residual to b - Ax     

set search-direction to residual.    

And delta-new to the dot-product of residual-transposed.residual.    

Initialize delta-0 to delta-new    

while I < I-max and delta > epsilon^2 delta-0 do:     

    q = dot-product(A, search-direction)    

    alpha = delta-new / (search-direction-transposed. q)     

    x = x + alpha.search-direction    

    If I is divisible by 50     

        r = b - Ax     

    else     

        r = r - alpha.q     

    delta-old = delta-new    

    delta-new = dot-product(residual-transposed,residual)    

    Beta = delta-new/delta-old    

    Search-direction = residual + Beta. Search-direction    

    I = I + 1 

Sample application:  

#! /bin/python 

import matplotlib.pyplot as plt 

import pandas 

import os 

here = os.path.dirname(__file__) if "__file__" in locals() else "." 

data_file = os.path.join(here, "data", "payment_errors", "data.csv") 

data = pandas.read_csv(data_file, sep=",") 

# y is the last column and the variable we want to predict. It has a boolean value. 

data["y"] = data["y"].astype("category") 

print(data.head(2)) 

print(data.shape) 

data["y"] = data["y"].apply(lambda x: 1 if x == 1 else 0) 

print(data[["y", "X1"]].groupby("y").count()) 

try: 

    from sklearn.model_selection import train_test_split 

except ImportError: 

    from sklearn.cross_validation import train_test_split 

train, test = train_test_split(data) 

import numpy as np 

from microsoftml import rx_fast_trees, rx_predict 

features = [c for c in train.columns if c.startswith("X")] 

model = rx_fast_trees("y ~ " + "+".join(features), data=train) 

pred = rx_predict(model, test, extra_vars_to_write=["y"]) 

print(pred.head()) 

#codingexercise

CodingExercise-11-03-23.docx

Wednesday, November 1, 2023

Applying MicrosoftML rxFastLinear algorithm to Insurance payment default prediction: 

One advantage of logistic regression is that the algorithm is highly flexible, taking any kind of input, and supports several different analytical tasks: 

· Use demographics to make predictions about outcomes, such as probability of defaulting payments. 

· Explore and weigh the factors that contribute to a result. For example, find the factors that influence customers to make a repeat past due payment. 

· Classify claims, payments, or other objects that have many attributes. 

Support Vector machines, on the other hand, can detect non-linear and complex patterns with good predictive power. These are sophisticated classification machines. These build a predictive model by finding the dividing line between two categories. In other words, the data is most distant to these lines and one of them is usually chosen as the best. The points that are closest to the line are the ones that determine the line and are called support vectors. Once the line is found, classifying is just a preference for putting the data in the right category.

The MicrosoftML rxFastLinear algorithm is a fast linear model trainer based on the Stochastic Dual Coordinate Ascent method. It combines the capabilities of logistic regressions and SVM algorithms. The dual problem is the dual ascent by maximizing the regression in the scalar convex functions adjusted by the regularization of vectors. It supports three types of loss functions - log loss, hinge loss, smoothed hinge loss.

An application of rxFastLinear algorithm that encapsulates Logistic regression and Support Vector machines for the purpose of payment default prediction would leverage individual oriented scoring instead of broad segment-based scoring of transactions. Default detection rates can be boosted, and false positives can be reduced using real-time behavioral profiling as well as historical profiling. Big Data, commodity hardware and historical data going as far back as three years help with accuracy. This enables payment default detection to be almost as early as when it is committed. True real time processing implies stringent response times.

The algorithm for the least squares regression can be written as:  

1. Set the initial approximation   

2. For a set of successive increments or boosts each based on the preceding iterations, do  

3. Calculate the new residuals  

4. Find the line of search by aggregating and minimizing the residuals  

5. Perform the boost along the line of search  

6. Repeat 3,4,5 for each of 2. 

Conjugate gradient descent can be described with a given input matrix A, b, a starting value x, a number of iterations i-max and an error tolerance  epsilon < 1 in this way:

set I to 0       

set residual to b - Ax    

set search-direction to residual.   

And delta-new to the dot-product of residual-transposed.residual.   

Initialize delta-0 to delta-new   

while I < I-max and delta > epsilon^2 delta-0 do:    

    q = dot-product(A, search-direction)   

    alpha = delta-new / (search-direction-transposed. q)    

    x = x + alpha.search-direction   

    If I is divisible by 50    

        r = b - Ax    

    else    

        r = r - alpha.q    

    delta-old = delta-new   

    delta-new = dot-product(residual-transposed,residual)   

    Beta = delta-new/delta-old   

    Search-direction = residual + Beta. Search-direction   

   I = I + 1 

Sample application: 

#! /bin/python 
from microsoftml import rx_fast_linear, rx_predict
model = rx_fast_linear("clas ~ x + y", data=data)
pred = rx_predict(model, data, extra_vars_to_write=["x", "y"])

print(pred.head())

#codingexercise

Print all nodes in a binary tree with k leaves.

int GetNodeWithKLeaves(Node root, int k, ref List<Node> result)

{

if (root == null) return 0;

if (root.left == null && root.right == null) return 1;

int left = GetNodeWithKLeaves(root.left, k, ref result);

int right = GetNodeWithKLeaves(root.right, k, ref result);

if (left + right == k)

{

result.Add(root);

}

return left + right;

}