This article introduces my “skip-graphs” – a data structure that
enables improved performance on traditional graphs in Computer Science. It introduces layering based on
selective nodes. For example, we can choose nodes with centrality in each range
as a sub graph with which to capture the connectivity at this layer. Edges may
need to be "added" between these nodes to represent underlying
connectivity through lower level (lower centrality) nodes. We can consider
these not as skip lists but as skip graphs. The benefit is that the graph
algorithms can continue to operate on a smaller subset of the original graph
just so long as this sub graph can be completed with vertex selections and
addition of edges. Using a metric such as centrality and this technique of skipping
graphs, we can improve performance in almost all applications of the graph
where the size is very large.
In this technique, we have
introduced a hierarchy of centrality ranges. Starting from the lowest
centrality of 1 at the core the graph is spread out spatially from the origin
with increasing radius of higher centrality ranges. At each layer of the sphere the sub-graphs are fully formed with
their own edges. These edges do not interfere with the edges that are
originally present in the graph.
Selection of the nodes does not change existing vertices and edges
either linearly or spatially. It merely rearranges the graphs in the Z plane
where the z-axis is the centrality measure and the xy plane is where the graph
is depicted. By selecting the more connected nodes from the graph and adding
edges between them where there is underlying connectivity through less
connected nodes, we are creating a different sub-graph. Selected nodes may have
many paths between each other. Therefore, there can be multiple edges between
these selected vertices. These edges are maintained physically but bundled in a
single logical edge. The logical edge helps with general graph algorithms on
the subgraph of selected nodes so that we have fewer data points for answering
common questions. The physical edges help with finding reachability through one
or more physical edges. So, this has very limited application in that we use
the selected vertex and the new edges between them only for problems where they
are helpful and not for problems that require graph algorithms over all
vertices.
There is significant cost in building such sub-graphs and
maintaining it. The idea is that for graphs that don’t change but require many
analytical computations, we can do this once and not have to edit it.
The subgraph building operation is offline and does not interfere
with conventional approaches of using the graphs otherwise. By automating and
rendering additional data structures, we can improve performance from existing
computations that require all the vertices. This is a separation of one-time
setup cost versus running costs of computations.
One example of using such graphs is for time series data. The data
is a collection over time and therefore not susceptible to change in its
history. It accumulates more and more new data and the graphs can be
incremental. Such time series data is often found in data warehouses.
There can be many interpretations of skip
graphs. A simple graph with
fewer nodes is a new graph of existing nodes and has no overlap with the
existing original graph. It answers some of the queries directly with the nodes
and edges in the graph using traditional graph algorithms. Different techniques
yield different graphs and it can include a metric like the goodness of fit for
clusters or F-score. A skip list is a tower of increasingly sparse linked lists
sorted by key. The lists at the
higher-level act as express lanes that allow the sequence of nodes to be
traversed quickly. A skip graph, as per
James Aspnes and Gauri Shah, is distinguished from a skip list in that there
may be many lists at a particular level and every node participates in one of
these lists until the nodes are splintered into singletons. A skip graph is
equivalent to a collection of up to n skip lists that happen to share some of
their lower levels. In distributed computing, the graph must tolerate arbitrary
node failures and it should be usable even when a huge fraction of the nodes
fails. Each node that appears in higher level skip lists occurs based on some membership
vector generally using prefix memberships to a word. This form of membership makes a skip graph a
trie of skip lists. A membership vector is a random variable. There is an edge
between x and y whenever x and y are adjacent in some list.
No comments:
Post a Comment