Cluster computing

Friday, September 1, 2017

We continue reading "Modern data Fraud Prevention at Big Data Scale". Feedzai enables companies to move from broad segment based scoring of transactions to individual oriented scoring with machine learning based techniques. Feedzai claims to use a new technology on a new platform. They claim to have highest fraud detection rates with lowest false positives. Feedzai uses real-time behavioral profiling as well as historical profiling that has been proven to detect 61% more fraud. They have true real time processing. They say they have true machine learning capabilities. Feedzai relies on Big Data and therefore runs on commodity hardware. The historical data goes as far back as three years. In addition, Feedzai processes realtime data in 25 milli seconds against vast amounts of data at 99th percentile. This enables fraud to be detected almost as early as when it is committed.
The Machine learning algorithms used include Random Forests and Support Vector machines. The former is helpful because it can be treated as an ensemble of decision trees which brings more robustness to meet the different kinds of transactions subjected to fraud detection. In addition, they handle noise and outliers better. Microsoft's R-package sets the standard for these types of algorithms.
The rxFastForest in MicrosoftML is a fast forest algorithm also used for binary classification or regression. It can be used for churn prediction. It builds several decision trees built using the regression tree learner in rxFastTrees. An aggregation over the resulting trees then finds a Gaussian distribution closest to the combined distribution for all trees in the model This helps to generalize fraud detection patterns well and is fast and easy to train and score.
Support Vector machines on the other hand are able to detect non-linear and complex patterns with good predictive power. These are sophisticated classification machines. These build a predictive model by finding the dividing line between two categories. In other words, the data is most distant to these lines and one of them is usually chosen as the best. The points that are closest to the line are the ones that determine the line and are called support vectors. Once the line is found, classifying is just a preference for putting the data in the right category.
#codingexercise
QuickSort partition
Partition(A, p, r)
x = A[r]
i = p - 1
for j = p to r-1
if A[j] <= x
i = i + 1
exchange A[i] with A[j]
exchange A[i+1] with A[r]
return i + 1

Cluster computing

Friday, September 1, 2017

No comments:

Post a Comment