Cluster computing

Friday, April 23, 2021

Page Rank Algorithm (to determine the quality of web search results) explained briefly:

PageRank is tested by the entire web search engine and the company that seemed to give the verb its name. At the heart of this algorithm is a technique to utilize the structure of inbound and outbound references to calculate the rank recursively. A given web page is linked from a set of pages and each of those origins is probably unaware of the significance of this page. If the origin has N outbound links, it assigns a score of 1/N for the link we are interested in to the given web page. If there were no links, it is equivalent to assigning a score of 0. Therefore, the scaling factor for a link is inversely dependent on the total number of outbound edges of the originating page for that link. This allows the rank of the given page to be computed as a simple sum of all the scaled ranks of the pages from which it is linked. The sum must be adjusted by a constant factor to accommodate the fact that there are pages that may have no forward links.

There are some special cases that make are accommodated by introducing new terms to the technique mentioned here but the ranking is as simple as that. One such case that explains a flaw that needs redress is when we have two pages that only link each other and a third page that links to one of them. Then this loop will accumulate rank but never distribute any rank because there are no outbound edges. This is called a rank sink and it is overcome by adding a term to the ranking called the rank source which is also adjusted by the same constant that was introduced for the consideration of pages with no outward edges.

Together with the accumulation of rank and the rank source, the ranking calculation allows us to compute the rank of the current page. We also know the final state of the distributed ranking over a set of pages because they will stabilize to a distribution that normalizes the overall ranking scores. Even though the calculation is recursive due to the unknown rank of the originating pages, it can be overcome with a starting set of values and the adjustment during each iteration that lets us progress towards the desired state. The iterations are stopped when the error is less than a threshold.

One issue with this model is dangling links which are links that point to pages with no outbound links. There are several of these. Fortunately, they do not affect the calculation of the page rank. So, they are ignored during the calculation, then they are added back.

The match of the search terms improves the precision, and the page rank improves the quality.

Comparisons between Page Rank algorithm and Microsoft Bing Algorithm:

The following comparison is drawn based on industry observations of the usages of the algorithms rather than their technical differences.

Bing utilizes the structure of a page and the metadata it gathers from the page prominently. Google infers the keywords and their relationships.

Bing uses website and content age as an indicator for its authority. The web page index might be built periodically every three months. Google shows fresh pages if relevance and content authority remain the same.

Google favors backlinks and evaluates their quality and quantity. Bing might be utilizing an internal backlink built on anchor text and social engineering usage.

Google favors text over images while Bing favors web pages with images. Certain HTML5 appears to be ignored by Google while Bing can recognize technologies like flash.

Bing might not read all the page, but it might but Google crawls through all the links before the ranking is calculated.

Bing might not index all the web pages particularly when there are no authoritative backlinks but Google crawls through and includes pages with dangling links.

Bing leverages ML algorithms for a much larger percentage of search queries than Google does.

Cluster computing

Friday, April 23, 2021

No comments:

Post a Comment