Cluster computing

Sunday, September 3, 2017

#codingexercise
Find the next smaller element for all in an integer array
Int[] GetNextSmallerElements(List<int> A)
{
var result = new int[A.length];
for (int i =0; i < A.Length; i++)
{
int next = -1;
for (int j = i+1; j < A.Length; j++)
if (A[j] < A[i]){
next = A[j];
break;
}
result[i] = next;
}
return result;

}
We could also do this with the help of a stack which we keep for all the elements that do not have a next smaller element.
we push the first element in the stack. we pick the next item in the array if the next is smaller than the element in the stack, we print the tuple and pop the element otherwise we push it back on to the stack for retaining the elements we have not found an answer yet. we also push the next element on to the stack so it can participate for matches going forward. This is still O(N^2) but instead of looking ahead through all the elements we are looking back at the collection of unmatched so far. In the worst case, this stack will grow to be the length of the array. The order of the stack is the reverse order of the portion of the array we have covered.

Saturday, September 2, 2017

We continue reading "Modern data Fraud Prevention at Big Data Scale". Feedzai enables companies to move from broad segment based scoring of transactions to individual oriented scoring with machine learning based techniques. Feedzai claims to use a new technology on a new platform. They claim to have highest fraud detection rates with lowest false positives. Feedzai uses real-time behavioral profiling as well as historical profiling that has been proven to detect 61% more fraud. They have true real time processing. They say they have true machine learning capabilities. Feedzai relies on Big Data and therefore runs on commodity hardware. The historical data goes as far back as three years. In addition, Feedzai processes realtime data in 25 milli seconds against vast amounts of data at 99th percentile. This enables fraud to be detected almost as early as when it is committed.
The Machine learning algorithms used include Random Forests and Support Vector machines. The former is helpful because it can be treated as an ensemble of decision trees which brings more robustness to meet the different kinds of transactions subjected to fraud detection. The latter is helpful because it can form more sophisticated models.

#codingexercise
Find the next greater element for all in an integer array
Int[] GetNextGreaterElements(List<int> A)
{
var result = new int[A.length];
for (int i =0; i < A.Length; i++)
{
int next = -1;
for (int j = i+1; j < A.Length; j++)
if (A[j] > A[i]){
next = A[j];
break;
}
result[i] = next;
}
return result;

}
We could also do this with the help of a stack which we keep for all the elements that do not have a next greater element.
we push the first element in the stack. we pick the next item in the array if the next is greater than the element in the stack, we print the tuple and pop the element otherwise we push it back on to the stack for retaining the elements we have not found an answer yet. we also push the next element on to the stack so it can participate for matches going forward. This is still O(N^2) but instead of looking ahead through all the elements we are looking back at the collection of unmatched so far. In the worst case, this stack will grow to be the length of the array. The order of the stack is the reverse order of the portion of the array we have covered.

Friday, September 1, 2017

We continue reading "Modern data Fraud Prevention at Big Data Scale". Feedzai enables companies to move from broad segment based scoring of transactions to individual oriented scoring with machine learning based techniques. Feedzai claims to use a new technology on a new platform. They claim to have highest fraud detection rates with lowest false positives. Feedzai uses real-time behavioral profiling as well as historical profiling that has been proven to detect 61% more fraud. They have true real time processing. They say they have true machine learning capabilities. Feedzai relies on Big Data and therefore runs on commodity hardware. The historical data goes as far back as three years. In addition, Feedzai processes realtime data in 25 milli seconds against vast amounts of data at 99th percentile. This enables fraud to be detected almost as early as when it is committed.
The Machine learning algorithms used include Random Forests and Support Vector machines. The former is helpful because it can be treated as an ensemble of decision trees which brings more robustness to meet the different kinds of transactions subjected to fraud detection. In addition, they handle noise and outliers better. Microsoft's R-package sets the standard for these types of algorithms.
The rxFastForest in MicrosoftML is a fast forest algorithm also used for binary classification or regression. It can be used for churn prediction. It builds several decision trees built using the regression tree learner in rxFastTrees. An aggregation over the resulting trees then finds a Gaussian distribution closest to the combined distribution for all trees in the model This helps to generalize fraud detection patterns well and is fast and easy to train and score.
Support Vector machines on the other hand are able to detect non-linear and complex patterns with good predictive power. These are sophisticated classification machines. These build a predictive model by finding the dividing line between two categories. In other words, the data is most distant to these lines and one of them is usually chosen as the best. The points that are closest to the line are the ones that determine the line and are called support vectors. Once the line is found, classifying is just a preference for putting the data in the right category.
#codingexercise
QuickSort partition
Partition(A, p, r)
x = A[r]
i = p - 1
for j = p to r-1
if A[j] <= x
i = i + 1
exchange A[i] with A[j]
exchange A[i+1] with A[r]
return i + 1

Thursday, August 31, 2017

We mentioned their machine learning capabilities. These include:
In-memory event streaming processing which enables fast response
use of NoSQL on commodity servers which enable it to scale
Continuous learning as history builds and the accruing transactions are used to learn
Detection of anomalies no matter how outlier they may be
Reduction in time to process the transactions and
reducing the cost overall for all transaction processing
The challenge that comes with fraud detection is that fraud often mimics genuine customer behavior so they are harder to tell apart. The classifiers used by Feedzai have very low false positives. The manually learned rules over the years had not yielded such low level of false positives as these algorithms do. Consequently, it the size and computation that distinguish Feedzai from its competitors.
#codingexercise
describe merge-sort
Merge-Sort(A,p,r)
if (p < r)
then q <- (p + r) / 2
Merge-Sort(A, p, q)
Merge-Sort(A, q+1, r)

Merge(A, p, q, r)

MERGE(A,p,q,r)
// Initialize L and R arrays with left and right partitions of A at boundary q
// and to have one more element at the end to have a max integer value
i = 1
j = 1
for k = p to r
if L[i] <= R[j]
A[k] = L[i]
i = i + 1
else
A[k] = R[j]
j = j + 1

Wednesday, August 30, 2017

We continue reading "Modern data Fraud Prevention at Big Data Scale". Feedzai enables companies to move from broad segment based scoring of transactions to individual oriented scoring with machine learning based techniques. Feedzai claims to use a new technology on a new platform. They claim to have highest fraud detection rates with lowest false positives. Feedzai uses real-time behavioral profiling as well as historical profiling that has been proven to detect 61% more fraud. They have true real time processing. They say they have true machine learning capabilities. Feedzai relies on Big Data and therefore runs on commodity hardware. The historical data goes as far back as three years. In addition, Feedzai processes realtime data in 25 milli seconds against vast amounts of data at 99th percentile. This enables fraud to be detected almost as early as when it is committed.
Feedzai has three primary deployment steps:
1) It evaluates data sets and models
2) It evaluates data sources
3) It connects to case management systems

If we compare Splunk with its connectors, machine learning abilities and use of Big Data, commodity machines and clusters for analytics on machine data in a time series database, it seems the primary difference is the customer orientation of data and analytics. That said, Splunk has immense power in the way it handles machine data. It can collect and tag these data from a variety of sources and it can enable a wide variety of alerts on the data. Even machine learning tools are available but the logic for fraud detection may need to be customized. Feedzai specializes in fraud detection.

#codingexercise
Find the weighted mean of elements with duplicates in a contiguous sorted sequence
Solution:
1. For each element in a contiguous sequence
2. Insert the element, count of repetitions in a dictionary
3. for each key-value pair in the dictionary
sum the value of element times the count
also sum the counts
4. divide the sums for the weighted mean.

#As we read about fraud detection, I'm going to see if delegated identity can help alleviate fraud detection: https://1drv.ms/w/s!Ashlm-Nw-wnWsE3BHcaes2F7Lsoi

Tuesday, August 29, 2017

We continue reading "Modern data Fraud Prevention at Big Data Scale". Feedzai enables companies to move from broad segment based scoring of transactions to individual oriented scoring with machine learning based techniques. Feedzai claims to use a new technology on a new platform. They claim to have highest fraud detection rates with lowest false positives. Feedzai uses real-time behavioral profiling as well as historical profiling that has been proven to detect 61% more fraud. They have true real time processing. They say they have true machine learning capabilities. Feedzai relies on Big Data and therefore runs on commodity hardware. The historical data goes as far back as three years. In addition, Feedzai processes realtime data in 25 milli seconds against vast amounts of data at 99th percentile. This enables fraud to be detected almost as early as when it is committed. Moreover the monitoring and alerting components of Feedzai can work independently from its inflight transactions. Therefore for those purposes, Feedzai can work independently and in a non-intrusive manner. It is also deployed quickly as an appliance that can be trained and activated.
Feedzai involves an in-memory analytics engine which can compute multi-dimensional fraud scores based on 250,000 KPI in the same second every second. This provides a new industry standard for real-time fraud protection. It also comes in useful to augment machine learning capabilities. For example, the individual transactions being scored are also used train the models. Moreover the scoring and flagging are intuitive which helps comprehension and reduces manual intervention.
The ability to process 100,000 events per second enables them to detect risk and fraud patterns that would have otherwise gone undetected. The actions taken by Feedzai are configurable from merely reporting to blocking. As such, it is a non-intrusive system. Approximately, ninety percent of all Feedzai customers connect the solution to message queuing but it comes with a variety of connectors that can take the feed from other sources. As opposed to a rules based engine where the deployment and refinement of rules may take time, Feedzai can install its analytic engine and connectors within a day.
If we compare Splunk with its connectors, machine learning abilities and use of Big Data, commodity machines and clusters for analytics on machine data in a time series database, it seems the primary difference is the customer orientation of data and analytics,

#codingexercise
We discussed an exercise yesterday involving topological sort. Let's revisit it:

topological sorting DFS ( V, E)

For each vertex v in V

V.color=white

V.d = nil

Time = 0

For each vertex v in V:

If v.color == white:

DFS-Visit (V, E)

DFS-VISIT (V,E, u)

time = time + 1

u.d = time

u.color = gray

foreach vertex v adjacent to u

If v.color == white

DFS-VISIT(V,E,v)

Else

If v.d <= u.d < u.f <= v.f throw back edge exception.

u.color = black

time = time + 1

u.f = time

#As we read about fraud detection, I'm going to see if delegated identity can help alleviate fraud detection: https://1drv.ms/w/s!Ashlm-Nw-wnWsE3BHcaes2F7Lsoi

Monday, August 28, 2017

We continue reading "Modern data Fraud Prevention at Big Data Scale". Feedzai enables companies to move from broad segment based scoring of transactions to individual oriented scoring with machine learning based techniques.Traditional approaches such as those based on SAS suffered from the limitation that they have become old and difficult to maintain. Second, they are inflexible and unable to keep up with dynamic requirements.Feedzai claims to use a new technology on a new platform. They claim to have highest fraud detection rates with lowest false positives. They have true real time processing. They have true machine learning capabilities. They run on commodity hardware. They are non-intrusive and they are easily deployed.
The difference comes from the approach taken by traditional versus Feedzai techniques. The earlier models used to score transactions based on global perspectives. Feedzai uses real-time behavioral profiling as well as historical profiling that has been proven to detect 61% more fraud. It also reduced the false alarms significantly. The historical data goes as far back as three years. In addition, Feedzai processes realtime data in 25 milli seconds against vast amounts of data at 99th percentile. This enables fraud to be detected almost as early as when it is committed.
Although the machine learning techniques are not enumerated, we will pend reviewing this for later. The takeaway is that Feedzai relies on Big Data and therefore runs on commodity hardware.
Moreover the monitoring and alerting components of Feedzai can work independently from its inflight transactions. Therefore for those purposes, Feedzai can work independently and in a non-intrusive manner. It is also deployed quickly as an appliance that can be trained and activated.

#codingexercise
Given a sorted dictionary of an alien language, find the order of characters
The solution includes the following:
1) Create a graph G with the number of vertices as the number of distinct alphabets in the alien language
2) For every word pair in sequence, find the first mismatching character pair between the words and draw an edge between them in G
3) do a topological sort of the graph and print the characters encountered.

#As we read about fraud detection, I'm going to see if federated identity can help alleviate fraud detection: https://1drv.ms/w/s!Ashlm-Nw-wnWsE3BHcaes2F7Lsoi