Cluster computing

The user-defined SQL queries comparable to the user-defined functions in GraphIn include:
1. meta_computation:
This computation assigns vertex parent id and vertex degree. The corresponding SQL query
UPDATE Node
SET n.Degree = t.Count
FROM Node n
INNER JOIN (select e.Dest as ID, count(*) as Count
from Edge e
group by e.Dest
having e.Dest = NodeID) as t
on n.ID = t.ID
WHERE n.ID = NodeID

The LevelID is assigned based on the previously visited element. If there are no previously visited elements these are initialized to special values
[Aside: we could use recursive queries here for a levelwise tree]
WITH recursiveNode( level, id, parent_id, degree) AS
(SELECT 1, parent.id, parent.parent_id, parent.degree
FROM Node parent
WHERE parent.parent_id is NULL
UNION ALL
SELECT parent.level+1, child.id, parent.id, child.degree
FROM recursiveNode parent, Node child
WHERE child.id in (SELECT e.Dest FROM Edge e WHERE e.Src = parent.id))
SELECT level, id, parent_id, degree
FROM recursiveNode;

2. build_inconsistency_list:
This builds an inconsistency list that contains vertices with incorrect depth values with levels maintained in a minimum priority queue

3. check_property
The BFS depth property is checked in each case. This processing determines whether the current approach of windowing or incremental graph updates can be tolerated or whether the incremental approach should be abandoned in favor of a global update. When the frontiers are all downstream, a BFS can make progress across two snapshots – current and next because the visited nodes are not going to be visited again. A check_property in GraphIn determines based on a threshold such as the number of nodes affected in the immediate adjacencies. This is a constant time lookup versus a check of whether a node has been visited. Let us say we took materialized views: one for node and another for edges, for all the nodes visited, their parent_id might be assigned so a check for whether a node was visited merely involves checking if the node is root or has a parent_id.
Should the check_property fail, the views are dropped. If this check passes, the processing is incremental.
Since materialized view updates are costly due to disk access, it helps to make writes progressive as BFS unfolds. Optimistic frontier lookup may be cheaper than visited nodes lookup if we tolerate eventual consistency. The GraphIn approach does not state that the check_property evaluates the presence of adjacencies in the inconsistency list. Its goal is to determine if large portions of the graph are in an inconsistent state so as to avoid IGAS processing in favor of GAS processing. It allows the user to set a heuristic such as switch to static BFS if > 30% of inconsistent vertices have BFS depth < 2. This heuristic is based on the data as well as the type of updates being made to the evolving graph. But a heuristic does not guarantee consistency and idempotent operations.
With the help of materialized views and stream processing, we have the option of making this consistent as long as we can progress incrementally from start to finish without failing our check. If our check fails, we begin all over again. To check whether the graph evolved for the visited nodes, we can compare timestamps of our visit and those of the modified. Since the updates to our views are expected to be more recent than the original update to the node, we know if the upstream nodes have changed During the check we also need to process notifications from subscription to the inconsistency list that has evolved over the last duration. Making a clone of the entire graph nodes in our views is a way of separating evolution of the graph from analysis.
The logic above can also work with a starting point as a global snapshot of the graph and receiving an update activity set in batches each of which helps build the inconsistency list. This is different from receiving list of nodes that have been updated from the update activity during the analysis activity. In the static computation case, when the update is directly handled after a static pass, the inconsistency list results in changes in parent_id, level of affected nodes to the result from the static pass. Each iteration may grow or shrink the inconsistency list. At any point, the inconsistency list may determine if the update needs to be run statically on a new snapshot.
GAS works per vertex. On the other hand, all SQL queries can work on partitions when the user defined aggregates permit it. For example, we can partition vertices along with their outbound edges while synchronizing at the next level.
4. activate_frontier
The frontier is activated with the nodes having inconsistent depth
5. update_inconsistency_list
The frontier vertices are removed and inconsistent successors are added to inconsistency list
6. merge_state
All insertions and deletions to the Graph are merged. We do not exclude any because BFS requires both information before vertex level is recalculated.

Cluster computing

Sunday, May 13, 2018

No comments:

Post a Comment