Cluster computing

Saturday, April 24, 2021

Space Partition Tree and Graph (SPTAG) algorithm

Well known search algorithms such as page-rank can be easily described by matrix linear equations because they establish ranking in terms of the weights provided by other webpages. Matrices are vector representations for the web pages, and they help with algorithms that help determine whether two pages are similar or who the nearest neighbors are? While tables are useful for establishing relations, graphs are called natural databases because all kinds of relationships overlayed over the nodes as edges. For example, a table can define an ‘is-a’ or ‘has-a’ relationship that generalized by a join while graphs can have edges that are distinct for, say, ‘is-a-friend-of' relationship. Web pages have an inwards and outwards links structure, so the matrix and graph forms of the web are useful representations.

Bing Image search uses SPTAG algorithm. This algorithm redefines the search algorithm in terms of these vectors which can be compared based on some distance metric such as L2 distance or cosine distance with the query vector. It provides two methods: kd-tree and relative neighborhood graph and balanced k-means tree and relative neighborhood graph. These two methods correspond to index builder and searcher. The former reduces index building cost and the latter improves search accuracy even with high dimensional data. The search begins with the search in the search partition trees for finding initial starting points or seeds, which are then used to search in the neighborhood graphs. The searches in both trees and graphs are iteratively conducted.

The Query driven Iterated Neighborhood graph search for large scale indexing is described this way. An initial solution set is created by searching over trees T that were constructed to index the reference data points. This initial solution might not be good candidates, but they have high probabilities to be near the true nearest neighbors. The candidates are always stored in a priority queue Q whether the search is over trees or graphs. Next, with the solution set as seeds in the graph G, a search is conducted with the neighborhood expansions in a best fit manner. Once local solutions are discovered, the search pauses. Then, with these previously identified nearest neighbors from the previous step and with the search history, new seeds are generated from the trees. Then the solution is accepted with the criteria that the distances to the neighbors are smaller than those for the seed. An indication variable for the number of unexpanded promising points is used to track whether the local search has arrived at a solution. There might be some change introduced into the seeds according to the query and search history and this perturbation is helpful to arrive at a local solution. The iterations for perturbation, search and acceptance are repeated until one of the two termination condition is reached – an upper threshold for the number of points accessed is reached or the true nearest neighbors are reached.

The best-first method mentioned earlier is a mere modification of the Breadth-First graph traversal using the same priority queue Q and enqueueing the neighbors of each dequeued item. The difference between them is that there is a now new criteria based on the number of accessed points n, the number of promising points m and the number of accessed points r from trees. This completes the iteration- based neighborhood graph search described here.

Cluster computing

Saturday, April 24, 2021

No comments:

Post a Comment