Saturday, May 27, 2023

 

This article introduces my “skip-graphs” – a data structure that enables improved performance on traditional graphs in Computer Science. It introduces layering based on selective nodes. For example, we can choose nodes with centrality in each range as a sub graph with which to capture the connectivity at this layer. Edges may need to be "added" between these nodes to represent underlying connectivity through lower level (lower centrality) nodes. We can consider these not as skip lists but as skip graphs. The benefit is that the graph algorithms can continue to operate on a smaller subset of the original graph just so long as this sub graph can be completed with vertex selections and addition of edges. Using a metric such as centrality and this technique of skipping graphs, we can improve performance in almost all applications of the graph where the size is very large.

In this technique, we have introduced a hierarchy of centrality ranges. Starting from the lowest centrality of 1 at the core the graph is spread out spatially from the origin with increasing radius of higher centrality ranges. At each layer of the sphere the sub-graphs are fully formed with their own edges. These edges do not interfere with the edges that are originally present in the graph. 

Selection of the nodes does not change existing vertices and edges either linearly or spatially. It merely rearranges the graphs in the Z plane where the z-axis is the centrality measure and the xy plane is where the graph is depicted. By selecting the more connected nodes from the graph and adding edges between them where there is underlying connectivity through less connected nodes, we are creating a different sub-graph. Selected nodes may have many paths between each other. Therefore, there can be multiple edges between these selected vertices. These edges are maintained physically but bundled in a single logical edge. The logical edge helps with general graph algorithms on the subgraph of selected nodes so that we have fewer data points for answering common questions. The physical edges help with finding reachability through one or more physical edges. So, this has very limited application in that we use the selected vertex and the new edges between them only for problems where they are helpful and not for problems that require graph algorithms over all vertices.

There is significant cost in building such sub-graphs and maintaining it. The idea is that for graphs that don’t change but require many analytical computations, we can do this once and not have to edit it.

The subgraph building operation is offline and does not interfere with conventional approaches of using the graphs otherwise. By automating and rendering additional data structures, we can improve performance from existing computations that require all the vertices. This is a separation of one-time setup cost versus running costs of computations.

One example of using such graphs is for time series data. The data is a collection over time and therefore not susceptible to change in its history. It accumulates more and more new data and the graphs can be incremental. Such time series data is often found in data warehouses.

There can be many interpretations of skip graphs. A simple graph with fewer nodes is a new graph of existing nodes and has no overlap with the existing original graph. It answers some of the queries directly with the nodes and edges in the graph using traditional graph algorithms. Different techniques yield different graphs and it can include a metric like the goodness of fit for clusters or F-score. A skip list is a tower of increasingly sparse linked lists sorted by key.  The lists at the higher-level act as express lanes that allow the sequence of nodes to be traversed quickly.  A skip graph, as per James Aspnes and Gauri Shah, is distinguished from a skip list in that there may be many lists at a particular level and every node participates in one of these lists until the nodes are splintered into singletons. A skip graph is equivalent to a collection of up to n skip lists that happen to share some of their lower levels. In distributed computing, the graph must tolerate arbitrary node failures and it should be usable even when a huge fraction of the nodes fails. Each node that appears in higher level skip lists occurs based on some membership vector generally using prefix memberships to a word.  This form of membership makes a skip graph a trie of skip lists. A membership vector is a random variable. There is an edge between x and y whenever x and y are adjacent in some list.

No comments:

Post a Comment