Cluster computing

Tuesday, May 15, 2018

We revert back to the discussion on incremental. Breadth-First-Search.
In large social engineering graphs, there are updates of about 86,400 operations per second.The store-and-static-compute model worked because updates were batched and then graph processing applied on static snapshots from different points in time. It worked so long as the graph modifications were less frequent than static processing. With dynamic graph processing, we need a new framework. One such framework proposed was GraphIn which introduces a new programming model called Incremental-Gather-Apply-Scatter.
Incremental-Gather-Apply-Scatter can be applied to sub-graphs.

Full discussion : https://1drv.ms/w/s!Ashlm-Nw-wnWtkxOVeU-mbfydKxs

#codingexercise

int GetLongestIncreasingSubsequenceCount(List<int> A)
{
int n = A.Count;
int best = new List<int>(Enumerable.Repeat(0,n));
for (int i = 1; i < n; i++)
for (int j = 0; j < i; j++)
{
if (A[i] > A[j] && best[j] + 1 > best[i]){
best[i] = best[j] + 1;
}
}
return best.max();
}

Monday, May 14, 2018

Today I present the summary of a book:
Book summary:
Title: Behaving Badly
- New Morality in Politics, Sex, and Business
Author: Eden Collinsworth
About the Author - Ms.Collinsworth has been helping the Chinese business people on western deportment. She is also the author of "I stand corrected: How Teaching Western Manners in China Became its own unforgettable lesson".
She chose this topic because she seems sure that the culture in China resides comfortably in what the west considers moral ambiguity. At the same time she feels the shame in the west has become less and less when it comes to money
This book is a collection of true anecdotes that explore new and emerging trends in morality and their self-justification.
Part one:
"Wherein I Begin with the Definition of the Word": The author moved from China to England and sold her apartment in US to find a perfect place to start. She decided mailing people for their stories and interviewing them was better than researching in a library.
"According to a Convicted Murderer, It Has to Do with Character": A convicted murdered talks about learning morality in prison and it has to do with character. He seems to say it comes between impulse and action.
"A Neuroscientist Explains the Evolutionary Origins of Morality": This chapter tries explaining the role of natural selection where there are certain genetic traits that win over others.She discusses the "selfish gene theory" and "reciprocal altruism" theory where the motivation to get along with others comes from selfish expectations.
"A Brief History of Mankind's Attempts to Rein in Bad Behavior": The author makes the observation religions are more convinced of their own version of the moral truth and it becomes particularly noteworthy when they pronounce something incredible and ludicrous.
The next set of chapters describes a scorecard for morality.
"The Editor of the Financial Times Provides a Cost-Benefit Analysis of Principles": Business is rife with examples where cheating improves bottomline. Morality becomes beside the point in the institutional making, spending and the accumulation of money. Moral calculation even involves cost-benefit of the scenario where a perpetrator is caught and the fines are paid but the outlook improves.
"Instructions on How Not To Cheat": The author has heard that you can't impose a moral standard on people who don't wish to be moral. Nash's game thory is explained to her as where the participants will view any premise as a system with a framework of fixed points and variables that can be manipulated to achieve a specific result and they will try to maximize personal benefit and minimize risk.
"Pros and Cons of Doing the Right Thing": This chapter focuses on whistleblowers and why they do it. She argues that morality sustains itself only as long as we are determined to do the right thing. In particular she brings up an example of a westerner who challenges the unconditional loyalty in Japan that cloaks gross embezzlement.
"The Law: Tools of Control, or Instruments of enlightenment": This section focuses on the difference between law and justice. She brings up examples where we try to outwit the law and yet we maintain an inner higher standard where even if we don't break the law, we don't engage in it. Ethics on the other hand is what an organization decides to hold itself to. The author asserts that ethics is local and regional.
"The Political Function of Ethics": This section focuses on why the politics have no relation to morals. She uses mass shootings as an example. She also brings examples of "Make America Great Again" and from international politics.
The next set of chapters discusses Sex as a moral provocateur
"Monogamy (Not So Much Anymore)": These are easy reading examples of where the notion of marriage and till death do us apart has been adapted to suit one's needs. She talks about research in marital infidelity.
"The Screen as a Siren": Much of the examples the author mentions here go to show that technology has created a hyper-sexualized culture. She brings up cases where free pornography has proliferated or where certain people were aware of the codes of conduct but could not live up to them.
"Immoral women: Or just those having a better time": She brings up several stories of international divas and what they uphold or the impact they are having even in the face of the perception they are held to.
The next section focuses on taking the bother out of morality:
"Celebrities as Standard-bearers": This is probably the most riveting section of the book where she talks about the matriarch in the Kardashians family taking a sex tape and starting a multi-million dollar business that would have otherwise been forgotten
"Reality Redefined": This section talks about the role of artificial intelligence to go through an indulgence in sense gratification of all sorts. Some examples are in fact revealing in how we take new experiences as something not to be judged for.
"The Web Wonders What's So Great About The Truth": She talks about extensive research in how new technology has been changing the way we think and process information. She talks about neurological tools that enable us to remain moral. Self-control is one. Empathy is another. The author admits to a guarded reservation in thinking of the internet as something to futher measured intelligence.
"Ethically Sanitized Warfare": This talks about the morality in drone attacks - in particular, the argument that some lives are not as important as others and where as well as how they are decided.
"Immorality's Black Sun": This talks about some of the survivors and perpetrators of the Holocaust. The author returns to New York to some of them.
The next section talks about "The Future, or Something Like It":
"The Moral Vagaries Of Making Babies": This chapter cites some examples of the "edgy variations" on an alternative lifestyle particularly with regard to creating and even raising a family.
"Mapping a post-gay culture": This chapter explores the church's unwillingness to recognize sex, gender, current events and social power leading it to retreat on several fronts.
"Is it Progress If We Barter With Ethics": This is a discussion on population and food. Some examples include FDA condoning of techniques to enable DNA to be altered and edited.
"Programming Morality in Robots": This section focuses on how robots and software can detect moods and how our responses alter and play with moods.
"So Who Exactly Gets To Set The New Rules": This section introduces us to singularity and cosmism. The former is the notion to further human abilities in a man-machine unified way. The latter is the theory of human evolution where concepts of truth, reality, freedom and happiness will be deeply revised.
"WhereIn I Conclude By Looking Forward": This section talks about the next generation's perspective and not just that of the author.
"Epilogue": The book concludes with a take on modern day news, views and events.

Sunday, May 13, 2018

The user-defined SQL queries comparable to the user-defined functions in GraphIn include:
1. meta_computation:
This computation assigns vertex parent id and vertex degree. The corresponding SQL query
UPDATE Node
SET n.Degree = t.Count
FROM Node n
INNER JOIN (select e.Dest as ID, count(*) as Count
from Edge e
group by e.Dest
having e.Dest = NodeID) as t
on n.ID = t.ID
WHERE n.ID = NodeID

The LevelID is assigned based on the previously visited element. If there are no previously visited elements these are initialized to special values
[Aside: we could use recursive queries here for a levelwise tree]
WITH recursiveNode( level, id, parent_id, degree) AS
(SELECT 1, parent.id, parent.parent_id, parent.degree
FROM Node parent
WHERE parent.parent_id is NULL
UNION ALL
SELECT parent.level+1, child.id, parent.id, child.degree
FROM recursiveNode parent, Node child
WHERE child.id in (SELECT e.Dest FROM Edge e WHERE e.Src = parent.id))
SELECT level, id, parent_id, degree
FROM recursiveNode;

2. build_inconsistency_list:
This builds an inconsistency list that contains vertices with incorrect depth values with levels maintained in a minimum priority queue

3. check_property
The BFS depth property is checked in each case. This processing determines whether the current approach of windowing or incremental graph updates can be tolerated or whether the incremental approach should be abandoned in favor of a global update. When the frontiers are all downstream, a BFS can make progress across two snapshots – current and next because the visited nodes are not going to be visited again. A check_property in GraphIn determines based on a threshold such as the number of nodes affected in the immediate adjacencies. This is a constant time lookup versus a check of whether a node has been visited. Let us say we took materialized views: one for node and another for edges, for all the nodes visited, their parent_id might be assigned so a check for whether a node was visited merely involves checking if the node is root or has a parent_id.
Should the check_property fail, the views are dropped. If this check passes, the processing is incremental.
Since materialized view updates are costly due to disk access, it helps to make writes progressive as BFS unfolds. Optimistic frontier lookup may be cheaper than visited nodes lookup if we tolerate eventual consistency. The GraphIn approach does not state that the check_property evaluates the presence of adjacencies in the inconsistency list. Its goal is to determine if large portions of the graph are in an inconsistent state so as to avoid IGAS processing in favor of GAS processing. It allows the user to set a heuristic such as switch to static BFS if > 30% of inconsistent vertices have BFS depth < 2. This heuristic is based on the data as well as the type of updates being made to the evolving graph. But a heuristic does not guarantee consistency and idempotent operations.
With the help of materialized views and stream processing, we have the option of making this consistent as long as we can progress incrementally from start to finish without failing our check. If our check fails, we begin all over again. To check whether the graph evolved for the visited nodes, we can compare timestamps of our visit and those of the modified. Since the updates to our views are expected to be more recent than the original update to the node, we know if the upstream nodes have changed During the check we also need to process notifications from subscription to the inconsistency list that has evolved over the last duration. Making a clone of the entire graph nodes in our views is a way of separating evolution of the graph from analysis.
The logic above can also work with a starting point as a global snapshot of the graph and receiving an update activity set in batches each of which helps build the inconsistency list. This is different from receiving list of nodes that have been updated from the update activity during the analysis activity. In the static computation case, when the update is directly handled after a static pass, the inconsistency list results in changes in parent_id, level of affected nodes to the result from the static pass. Each iteration may grow or shrink the inconsistency list. At any point, the inconsistency list may determine if the update needs to be run statically on a new snapshot.
GAS works per vertex. On the other hand, all SQL queries can work on partitions when the user defined aggregates permit it. For example, we can partition vertices along with their outbound edges while synchronizing at the next level.
4. activate_frontier
The frontier is activated with the nodes having inconsistent depth
5. update_inconsistency_list
The frontier vertices are removed and inconsistent successors are added to inconsistency list
6. merge_state
All insertions and deletions to the Graph are merged. We do not exclude any because BFS requires both information before vertex level is recalculated.

Friday, May 11, 2018

In large social engineering graphs, there are updates of about 86,400 operations per second. Even if we keep the journal in a cloud database where there is no restriction to storage, fetching and updating states on the graph may take a long while. In such case, we can take snapshots of the graph and pin those against timestamps for consistency. Next, we separate the snapshot, replay and consistency checks as offline activities. In such cases, it becomes important to perform analytics on dynamic graphs instead of static graphs. The store-and-static-compute model worked because updates were batched and then graph processing applied on static snapshots from different points in time. It worked so long as the graph modifications were less frequent than static processing. With dynamic graph processing, we need a new framework. One such framework proposed was GraphIn which introduces a new programming model called Incremental-Gather-Apply-Scatter. In the gather phase, incoming messages are processed and combined into one message. In the Apply phase, vertices use the combined message to update their state. In the scatter phase, vertices can send a message to their neighbors along their edges.

This framework divides the continuous stream of updates into fixed size batches processed in the order of their arrival.

If the updates were recorded in the table as described by the tuple above, there would be a range of entries over which the updates are pending and would need to be applied to the graph. The completed updates are available for analysis and they can also be aged and archived. Although there may be billions of rows of entries, we can apply window functions with 'over' clause in sql queries to work on the equivalent of fixed size records from batches but in a streaming manner.

For example:

SELECT COUNT(*)

OVER ( PARTITION BY hash(u.timestamp DIV (60*60*24)) partitions 3 ) u1

FROM graphupdate u;

Full discussion : https://1drv.ms/w/s!Ashlm-Nw-wnWtkxOVeU-mbfydKxs

Thursday, May 10, 2018

We were discussing incremental graph transformations:
One way to do this was to use tables for transformation rules. A graph transformation rule consists of a (LHS, RHS, NAC) where

LHS: left hand side graph

RHS: right hand side graph

NAC: Negative application condition

The matching results from LHS to RHS where NAC prohibits the presence of certain objects and links. By defining the locations in the graph for the application of the rules, the tables itself go through very little changes. However, storing transformation rules with the help of metamodels and instance models requires a lot of preprocessing.

The notion of using a database to store the updates for the graph is a very powerful technique. The above strategy is very involved in that it requires a lot of pre-processing with the hope that subsequent processing is going to be incremental and cheap. This is in fact the case for existing large graphs. However, the ability to reconstruct graphs is equally feasible just like we reconstruct even large master databases such as product catalogs.

Consider a table of updates in the form of each node and edge to add or delete in the form of tuple<Type, operation, Src, Destination, Value, status, created, modified> where type = Node or edge, operation = add or remove, src or destinations are node ids for an edge and null otherwise and Value is the node data for node additions. The status indicates the set of progressive operations associated with this entry such as initialized->in progress->completed/error and the timestamps indicate the positions on the timeline for these entries as they arrive or are acted upon.

Each instance of the tuple is a corresponding node addition or removal and is atomic and independent of another. The sequence of operations in the list of tuples is the order in which the graph was updated. When the operations are journaled, it gives the replay log for reconstructing the graph and a subset of this table gives the incremental changes between two states.

In large social engineering graphs, there are updates of about 86,400 operations per second. Even if we keep the journal in a cloud database where there is no restriction to storage, fetching and updating states on the graph may take a long while. In such case, we can take snapshots of the graph and pin those against timestamps for consistency. Next, we separate the snapshot, replay and consistency checks as offline activities.

Wednesday, May 9, 2018

Yesterday we started discussing incremental graph transformations:
Incremental Graph Transformations:

First, let us review the API for requesting graph transformations. They generally work in the following manner: The following example was taken from Microsoft Graph:

Initially a request is made for the resource

Subsequent pages of the resource are retrieved using a keyword like nextLink

Final response with nextLink indicates end of the original collection which is usually from a snapshot

The delta query is then issued with the deltaLink keyword that enables additions, deletions or updates to be queried

The delta query needs to detect the incremental changes in the graph.

One way to provide incremental updates is to keep track of graph transformations in materialized views with the help of transformation rules that can be pattern matched with the graph.
A graph transformation rule consists of a (LHS, RHS, NAC) where
LHS : left hand side graph
RHS: right hand side graph
NAC: Negative application condition
The matching results from LHS to RHS where NAC prohibits the presence of certain objects and links.
The rules are formed using metamodels and instance models:
The metamodel is represented by type graph where the nodes are called classes A class may have attributes and demonstrates inheritance and associations.
The instance model describes instances of the metamodels where the nodes are objects and the edges are links
A class is mapped to a table with a single column where (class(I)) stores the identifiers of the objects of the specified class
An association has three columns (assoc(I,S,T)) which contains identifiers for the link, source and target objects
An attribute is mapped to a table with two columns (attr(I,V)) for the object identifier and the attribute value
By defining the locations in the graph for the application of the rules, the tables itself go through very little changes.

#codingexercise
Given two sorted arrays of vastly different lengths, what is the lowest complexity to find intersection elements - Find the Math.Max(start1, start2) and Math.Min(end1, end2) for the values in the two arrays. Then for this sub-range in the smaller array, perform binary search in the longer array. If the lengths of the two arrays are comparable, sequentially traverse elements in both array in sorted order and printing elements that appear in both arrays.

Tuesday, May 8, 2018

Today we discuss the benefits of jquery plugin or javascript SDK to represent analytical capabilities.
We are well familiar with the analytical capabilities that come with say, the Machine Learning packages such as sklearn and Microsoft ML package or complex reporting queries for dashboards. We have traditionally leveraged such analytical capabilities in the backend.
This works very well for several reasons:
1) The analysis can be run on Big Data. The more the data, the better the analysis. Consequently the backend and particularly the cloud services are better prepared for this task
2) The cloud services are elastic - they can pull in as much resource as needed for the execution of the queries and this works well for map-reduce processing
3) The backend is also suited to do this processing once for every client and application. Different views and viewmodels can use the same computation
4) Performance increases dramatically when the computations are as close to the data as possible. This has been one of the arguments for pushing the machine learning package into the sql server
5) Such compute and data intensive operations are hardly required on the frontend where the data may be very limited on a given page. Moreover, optimizations only happen when the compute and storage are broadened to the backend where they can be studied, cached, and replayed.
6) Complex queries can already be reduced to use a few primitives which are available as query operators in both backend and frontend. The application or client that needs to do client side processing of such queries have the choice to implement it themselves using these primitives

Having reviewed these reasons, we may ask if we have enough support on the client side and if a new plugin is justified. Let us make the purpose of the plugin clear. It provides the convenience to reuse some of the common patterns seen from the analysis queries across charts and dashboards in different domains. Patterns that are well-known such as pivot operations as well as esoteric ones such as from advanced dashboarding tools may be consolidated into a standard plugin or sdk.
The QueryBuilder plugin allows conditions to be specified which is great to build conditions into the query. If we could also introduce piping of results into operations that would also help.
The QueryBuilder can either build a query or allow different queries to work in sequence where the results of one query is piped to the other. Queries written in the form of shell commands and piped execution is a form of the latter while complex predicates in SQL queries is an example of the former.
#codingexercise https://ideone.com/kAibv2
#OpenAI has algorithms more than what ML package and Turi implement
We will discuss incremental graph transformations next : https://1drv.ms/w/s!Ashlm-Nw-wnWtkxOVeU-mbfydKxs