Cluster computing

Friday, September 1, 2017

We continue reading "Modern data Fraud Prevention at Big Data Scale". Feedzai enables companies to move from broad segment based scoring of transactions to individual oriented scoring with machine learning based techniques. Feedzai claims to use a new technology on a new platform. They claim to have highest fraud detection rates with lowest false positives. Feedzai uses real-time behavioral profiling as well as historical profiling that has been proven to detect 61% more fraud. They have true real time processing. They say they have true machine learning capabilities. Feedzai relies on Big Data and therefore runs on commodity hardware. The historical data goes as far back as three years. In addition, Feedzai processes realtime data in 25 milli seconds against vast amounts of data at 99th percentile. This enables fraud to be detected almost as early as when it is committed.
The Machine learning algorithms used include Random Forests and Support Vector machines. The former is helpful because it can be treated as an ensemble of decision trees which brings more robustness to meet the different kinds of transactions subjected to fraud detection. In addition, they handle noise and outliers better. Microsoft's R-package sets the standard for these types of algorithms.
The rxFastForest in MicrosoftML is a fast forest algorithm also used for binary classification or regression. It can be used for churn prediction. It builds several decision trees built using the regression tree learner in rxFastTrees. An aggregation over the resulting trees then finds a Gaussian distribution closest to the combined distribution for all trees in the model This helps to generalize fraud detection patterns well and is fast and easy to train and score.
Support Vector machines on the other hand are able to detect non-linear and complex patterns with good predictive power. These are sophisticated classification machines. These build a predictive model by finding the dividing line between two categories. In other words, the data is most distant to these lines and one of them is usually chosen as the best. The points that are closest to the line are the ones that determine the line and are called support vectors. Once the line is found, classifying is just a preference for putting the data in the right category.
#codingexercise
QuickSort partition
Partition(A, p, r)
x = A[r]
i = p - 1
for j = p to r-1
if A[j] <= x
i = i + 1
exchange A[i] with A[j]
exchange A[i+1] with A[r]
return i + 1

Thursday, August 31, 2017

We mentioned their machine learning capabilities. These include:
In-memory event streaming processing which enables fast response
use of NoSQL on commodity servers which enable it to scale
Continuous learning as history builds and the accruing transactions are used to learn
Detection of anomalies no matter how outlier they may be
Reduction in time to process the transactions and
reducing the cost overall for all transaction processing
The challenge that comes with fraud detection is that fraud often mimics genuine customer behavior so they are harder to tell apart. The classifiers used by Feedzai have very low false positives. The manually learned rules over the years had not yielded such low level of false positives as these algorithms do. Consequently, it the size and computation that distinguish Feedzai from its competitors.
#codingexercise
describe merge-sort
Merge-Sort(A,p,r)
if (p < r)
then q <- (p + r) / 2
Merge-Sort(A, p, q)
Merge-Sort(A, q+1, r)

Merge(A, p, q, r)

MERGE(A,p,q,r)
// Initialize L and R arrays with left and right partitions of A at boundary q
// and to have one more element at the end to have a max integer value
i = 1
j = 1
for k = p to r
if L[i] <= R[j]
A[k] = L[i]
i = i + 1
else
A[k] = R[j]
j = j + 1

Wednesday, August 30, 2017

We continue reading "Modern data Fraud Prevention at Big Data Scale". Feedzai enables companies to move from broad segment based scoring of transactions to individual oriented scoring with machine learning based techniques. Feedzai claims to use a new technology on a new platform. They claim to have highest fraud detection rates with lowest false positives. Feedzai uses real-time behavioral profiling as well as historical profiling that has been proven to detect 61% more fraud. They have true real time processing. They say they have true machine learning capabilities. Feedzai relies on Big Data and therefore runs on commodity hardware. The historical data goes as far back as three years. In addition, Feedzai processes realtime data in 25 milli seconds against vast amounts of data at 99th percentile. This enables fraud to be detected almost as early as when it is committed.
Feedzai has three primary deployment steps:
1) It evaluates data sets and models
2) It evaluates data sources
3) It connects to case management systems

If we compare Splunk with its connectors, machine learning abilities and use of Big Data, commodity machines and clusters for analytics on machine data in a time series database, it seems the primary difference is the customer orientation of data and analytics. That said, Splunk has immense power in the way it handles machine data. It can collect and tag these data from a variety of sources and it can enable a wide variety of alerts on the data. Even machine learning tools are available but the logic for fraud detection may need to be customized. Feedzai specializes in fraud detection.

#codingexercise
Find the weighted mean of elements with duplicates in a contiguous sorted sequence
Solution:
1. For each element in a contiguous sequence
2. Insert the element, count of repetitions in a dictionary
3. for each key-value pair in the dictionary
sum the value of element times the count
also sum the counts
4. divide the sums for the weighted mean.

#As we read about fraud detection, I'm going to see if delegated identity can help alleviate fraud detection: https://1drv.ms/w/s!Ashlm-Nw-wnWsE3BHcaes2F7Lsoi

Tuesday, August 29, 2017

We continue reading "Modern data Fraud Prevention at Big Data Scale". Feedzai enables companies to move from broad segment based scoring of transactions to individual oriented scoring with machine learning based techniques. Feedzai claims to use a new technology on a new platform. They claim to have highest fraud detection rates with lowest false positives. Feedzai uses real-time behavioral profiling as well as historical profiling that has been proven to detect 61% more fraud. They have true real time processing. They say they have true machine learning capabilities. Feedzai relies on Big Data and therefore runs on commodity hardware. The historical data goes as far back as three years. In addition, Feedzai processes realtime data in 25 milli seconds against vast amounts of data at 99th percentile. This enables fraud to be detected almost as early as when it is committed. Moreover the monitoring and alerting components of Feedzai can work independently from its inflight transactions. Therefore for those purposes, Feedzai can work independently and in a non-intrusive manner. It is also deployed quickly as an appliance that can be trained and activated.
Feedzai involves an in-memory analytics engine which can compute multi-dimensional fraud scores based on 250,000 KPI in the same second every second. This provides a new industry standard for real-time fraud protection. It also comes in useful to augment machine learning capabilities. For example, the individual transactions being scored are also used train the models. Moreover the scoring and flagging are intuitive which helps comprehension and reduces manual intervention.
The ability to process 100,000 events per second enables them to detect risk and fraud patterns that would have otherwise gone undetected. The actions taken by Feedzai are configurable from merely reporting to blocking. As such, it is a non-intrusive system. Approximately, ninety percent of all Feedzai customers connect the solution to message queuing but it comes with a variety of connectors that can take the feed from other sources. As opposed to a rules based engine where the deployment and refinement of rules may take time, Feedzai can install its analytic engine and connectors within a day.
If we compare Splunk with its connectors, machine learning abilities and use of Big Data, commodity machines and clusters for analytics on machine data in a time series database, it seems the primary difference is the customer orientation of data and analytics,

#codingexercise
We discussed an exercise yesterday involving topological sort. Let's revisit it:

topological sorting DFS ( V, E)

For each vertex v in V

V.color=white

V.d = nil

Time = 0

For each vertex v in V:

If v.color == white:

DFS-Visit (V, E)

DFS-VISIT (V,E, u)

time = time + 1

u.d = time

u.color = gray

foreach vertex v adjacent to u

If v.color == white

DFS-VISIT(V,E,v)

Else

If v.d <= u.d < u.f <= v.f throw back edge exception.

u.color = black

time = time + 1

u.f = time

#As we read about fraud detection, I'm going to see if delegated identity can help alleviate fraud detection: https://1drv.ms/w/s!Ashlm-Nw-wnWsE3BHcaes2F7Lsoi

Monday, August 28, 2017

We continue reading "Modern data Fraud Prevention at Big Data Scale". Feedzai enables companies to move from broad segment based scoring of transactions to individual oriented scoring with machine learning based techniques.Traditional approaches such as those based on SAS suffered from the limitation that they have become old and difficult to maintain. Second, they are inflexible and unable to keep up with dynamic requirements.Feedzai claims to use a new technology on a new platform. They claim to have highest fraud detection rates with lowest false positives. They have true real time processing. They have true machine learning capabilities. They run on commodity hardware. They are non-intrusive and they are easily deployed.
The difference comes from the approach taken by traditional versus Feedzai techniques. The earlier models used to score transactions based on global perspectives. Feedzai uses real-time behavioral profiling as well as historical profiling that has been proven to detect 61% more fraud. It also reduced the false alarms significantly. The historical data goes as far back as three years. In addition, Feedzai processes realtime data in 25 milli seconds against vast amounts of data at 99th percentile. This enables fraud to be detected almost as early as when it is committed.
Although the machine learning techniques are not enumerated, we will pend reviewing this for later. The takeaway is that Feedzai relies on Big Data and therefore runs on commodity hardware.
Moreover the monitoring and alerting components of Feedzai can work independently from its inflight transactions. Therefore for those purposes, Feedzai can work independently and in a non-intrusive manner. It is also deployed quickly as an appliance that can be trained and activated.

#codingexercise
Given a sorted dictionary of an alien language, find the order of characters
The solution includes the following:
1) Create a graph G with the number of vertices as the number of distinct alphabets in the alien language
2) For every word pair in sequence, find the first mismatching character pair between the words and draw an edge between them in G
3) do a topological sort of the graph and print the characters encountered.

#As we read about fraud detection, I'm going to see if federated identity can help alleviate fraud detection: https://1drv.ms/w/s!Ashlm-Nw-wnWsE3BHcaes2F7Lsoi

Sunday, August 27, 2017

Saturday, August 26, 2017

We continue discussing the ZooKeeper. It is a co-ordination service with elements from group messaging, shared registers and distributed lock services. It provides a interface that guarantees wait-free property and FIFO execution of requests from each client. Requests across all clients are also linearized.
We were discussing the throughput of ZooKeeper when the system is saturated and with various injected failure.The most dip in throughput occurred with the failures of the leader. On the other hand, failure of the followers is tolerated with a quorum and the leader election algorithm helps mitigate this further.
The latency of requests was also measured. The requests processed per second seemed to increase with the number of the workers but decrease with the number of servers. The average request latency was found to be between 1.2ms - 1.4 ms.
We conclude with discussion of related work as cited by the authors. They mention Chubby which also uses a file system interface and an agreement protocol to guarantee the replicas but it is a lock service. Clients using ZooKeeper can choose to implement locks. Also Chubby only allows clients to connect to the leader and not with any other server. ZooKeeper has better performance and a more relaxed consistency model
Some systems focus on fault-tolerance such as ISIS which transforms abstract type specifications into fault tolerant distributed objects thus making fault tolerance mechanism transparent to users. Other systems like Totem guarantee order of messages in an architecture that exploits hardware broadcasts of local area networks. ZooKeeper also implements the notion of synchronization on a virtual timeline and ordering of requests. ZooKeeper also supports a variety of network topology.
Some systems utilize a state machine replication as for example, Paxos that combines transaction logging for consensus with write-ahead logging for data recovery Some replicated state machines are fully Byzantine tolerant. ZooKeeper is not so but it can be made one without modifying the server code. Boxwood uses Paxos to form a distributed lock service but it is a higher level primitive while ZooKeeper does not restrict clients from having different primitives. Sinfonia introduced mini-transactions . a new paradigm for building scalable distributed systems. Sinfonia has been designed to store application data but ZooKeeper stores application metadata.Moreover ZooKeeper can add watches where as Sinfonia cannot. Dynamo allows clients to put data in a distributed key - value store. The key space in Dynamo is not hierarchical unlike ZooKeeper which also provides better consistency and durability guarantees.
#codingexercise
Find the number of elements who have the same minimum number of duplicates in a contiguous sorted sequence
Solution:
1. For each element in a contiguous sequence
2. Insert the element, count of repetitions in a dictionary
3. Find the min count from the values in a dictionary
4. for each key-value pair in the dictionary
if the value == min count
print the key

This can be improved without use of a hash table by retaining only a single key value pair that is updated when the value is lower than the previous. A count is maintained for every match with the key value pair and reset when the key value pair changes.
#Fraud detection service introduction: https://1drv.ms/w/s!Ashlm-Nw-wnWsEv9woJ7ynzJAPpv