Saturday, April 24, 2021

Space Partition Tree and Graph (SPTAG) algorithm

Well known search algorithms such as page-rank can be easily described by matrix linear equations because they establish ranking in terms of the weights provided by other webpages. Matrices are vector representations for the web pages, and they help with algorithms that help determine whether two pages are similar or who the nearest neighbors are? While tables are useful for establishing relations, graphs are called natural databases because all kinds of relationships overlayed over the nodes as edges. For example, a table can define an ‘is-a’ or ‘has-a’ relationship that generalized by a join while graphs can have edges that are distinct for, say, ‘is-a-friend-of' relationship. Web pages have an inwards and outwards links structure, so the matrix and graph forms of the web are useful representations. 

Bing Image search uses SPTAG algorithm. This algorithm redefines the search algorithm in terms of these vectors which can be compared based on some distance metric such as L2 distance or cosine distance with the query vector. It provides two methods: kd-tree and relative neighborhood graph and balanced k-means tree and relative neighborhood graph. These two methods correspond to index builder and searcher. The former reduces index building cost and the latter improves search accuracy even with high dimensional data. The search begins with the search in the search partition trees for finding initial starting points or seeds, which are then used to search in the neighborhood graphs. The searches in both trees and graphs are iteratively conducted.

The Query driven Iterated Neighborhood graph search for large scale indexing is described this way. An initial solution set is created by searching over trees T that were constructed to index the reference data points. This initial solution might not be good candidates, but they have high probabilities to be near the true nearest neighbors. The candidates are always stored in a priority queue Q whether the search is over trees or graphs. Next, with the solution set as seeds in the graph G, a search is conducted with the neighborhood expansions in a best fit manner. Once local solutions are discovered, the search pauses. Then, with these previously identified nearest neighbors from the previous step and with the search history, new seeds are generated from the trees. Then the solution is accepted with the criteria that the distances to the neighbors are smaller than those for the seed. An indication variable for the number of unexpanded promising points is used to track whether the local search has arrived at a solution. There might be some change introduced into the seeds according to the query and search history and this perturbation is helpful to arrive at a local solution. The iterations for perturbation, search and acceptance are repeated until one of the two termination condition is reached – an upper threshold for the number of points accessed is reached or the true nearest neighbors are reached.

The best-first method mentioned earlier is a mere modification of the Breadth-First graph traversal using the same priority queue Q and enqueueing the neighbors of each dequeued item. The difference between them is that there is a now new criteria based on the number of accessed points n, the number of promising points m and the number of accessed points r from trees. This completes the iteration- based neighborhood graph search described here.


Friday, April 23, 2021

 Page Rank Algorithm (to determine the quality of web search results) explained briefly: 

PageRank is tested by the entire web search engine and the company that seemed to give the verb its name. At the heart of this algorithm is a technique to utilize the structure of inbound and outbound references to calculate the rank recursively. A given web page is linked from a set of pages and each of those origins is probably unaware of the significance of this page. If the origin has N outbound links, it assigns a score of 1/N for the link we are interested in to the given web page. If there were no links, it is equivalent to assigning a score of 0.  Therefore, the scaling factor for a link is inversely dependent on the total number of outbound edges of the originating page for that link. This allows the rank of the given page to be computed as a simple sum of all the scaled ranks of the pages from which it is linked.  The sum must be adjusted by a constant factor to accommodate the fact that there are pages that may have no forward links. 

There are some special cases that make are accommodated by introducing new terms to the technique mentioned here but the ranking is as simple as that. One such case that explains a flaw that needs redress is when we have two pages that only link each other and a third page that links to one of them. Then this loop will accumulate rank but never distribute any rank because there are no outbound edges. This is called a rank sink and it is overcome by adding a term to the ranking called the rank source which is also adjusted by the same constant that was introduced for the consideration of pages with no outward edges. 

Together with the accumulation of rank and the rank source, the ranking calculation allows us to compute the rank of the current page. We also know the final state of the distributed ranking over a set of pages because they will stabilize to a distribution that normalizes the overall ranking scores. Even though the calculation is recursive due to the unknown rank of the originating pages, it can be overcome with a starting set of values and the adjustment during each iteration that lets us progress towards the desired state. The iterations are stopped when the error is less than a threshold. 

One issue with this model is dangling links which are links that point to pages with no outbound links. There are several of these. Fortunately, they do not affect the calculation of the page rank. So, they are ignored during the calculation, then they are added back. 

The match of the search terms improves the precision, and the page rank improves the quality. 


Comparisons between Page Rank algorithm and Microsoft Bing Algorithm: 

The following comparison is drawn based on industry observations of the usages of the algorithms rather than their technical differences.  

Bing utilizes the structure of a page and the metadata it gathers from the page prominently. Google infers the keywords and their relationships. 

Bing uses website and content age as an indicator for its authority. The web page index might be built periodically every three months. Google shows fresh pages if relevance and content authority remain the same. 

Google favors backlinks and evaluates their quality and quantity. Bing might be utilizing an internal backlink built on anchor text and social engineering usage. 

Google favors text over images while Bing favors web pages with images. Certain HTML5 appears to be ignored by Google while Bing can recognize technologies like flash. 

Bing might not read all the page, but it might but Google crawls through all the links before the ranking is calculated.  

Bing might not index all the web pages particularly when there are no authoritative backlinks but Google crawls through and includes pages with dangling links. 

Bing leverages ML algorithms for a much larger percentage of search queries than Google does. 

 

 

Thursday, April 22, 2021

 Utilizing public cloud infrastructure for text summarization service:

Introduction: A text summarization service exists utilizing a word embeddings model. The API for the service serves a portal where users can enter text directly or upload files. This article investigates the migration of the web service to the public cloud using the cloud services.

Description: There is some planning involved when migrating such a service to use the public cloud services. First, the deployment of the service will have to move from the existing legacy virtual machine to one that is ML friendly with a GPU such as “STANDARD-NC6”. There is a BERT notebook that allows the selection of a GPU suitable for the NLP processing. Second, a training dataset is required if we plant to use BERT. There are other models available for the NLP service such as the PMI model, but it is typical to pick one and use it with the web service.  BERT just helps with transfer learning where knowledge gained from earlier training is used with novel cases not encountered during training.

There are specifically three different types of NLP services available from Azure. These are:

Azure HDInsight with Spark and Spark MLlib

Azure Databricks and 

Microsoft Cognitive Services

Since we want prebuilt model to use with our web service, we can use the Microsoft Cognitive Service.  If we were to create a custom model, we would use the Azure HDInsight with Spark MLLib and Spark NLP which also provides low-level tokenization, stemming, lemmatization, TF-IDF and sentence-boundary detection. Cognitive services do not support large documents and big data sets. 

Cognitive services provide the following APIs:

Linguistic Analysis API - for low level NLP processing such as tokenizer and part of speech tagging.

Language Understanding Intelligent Service (LUIS) API for entity/Intent identification and extraction.

Text analysis API for topic detection, sentiment analysis and language detection

Bing Spell check API for spelling check

Out of these we only need to review the text analysis API to extract key phrases and if we rank them then we can perform text summarization. The ranking may not be like those from word embeddings from a SoftMax classifier, but we don’t have to calculate similarity distance between key phrases. Instead, we allow the key phrase extraction to give us the terms and then extract sentences with those phrases. There is no Text Summarization web service API in Azure but there is an Agolo service in the Azure marketplace that provides NLP summarization for Enterprises. Agolo services summarize news feeds. Azure Databricks does not have an out-of-box NLP service but provides the infrastructure to create one. MPhasis deep insights text summarizer on AWS marketplace provides text summarization in three sentences of approximately 30 words for text snippets of size 512 words.

curl -v -X POST "https://westus2.api.cognitive.microsoft.com/text/analytics/v3.0/keyPhrases?model-version={string}&showStats={boolean}"

-H "Content-Type: application/json"

-H "Ocp-Apim-Subscription-Key: {subscription key}"

--data-ascii "{body}"


Wednesday, April 21, 2021

Implementing a service for Mobile Application using public cloud services:

Introduction: We discussed an automation for the device synchronization service using a queue, a database, and a parallel-task library. This article discusses the public cloud services which provide a rich framework to write device synchronization service.

We look at some of the options to write this service using public cloud features. We have the same requirements as last time which are data model, push release and controllers. If we took the queue, the database, and the parallel-task library, then we could find corresponding offerings from the public cloud as the global database, queue storage, service bus, WebJobs and Functions. But we move up the Azure stack to make use of more customized and integrated services. For example, Azure provides the following services for mobile computing: API management, which publishes APIs to developers, partners, and employees securely and at scale, Azure Notification hubs which send push notifications to any platform from any backend, Azure cognitive search, which is a cloud-based search that uses AI models, Azure cognitive services which adds smart API capabilities to enable contextual interactions. Spatial anchors which create multi-user and spatially aware mixed reality experiences, App Service which creates cloud apps for web and mobile, Azure Maps which add location context to data with their APIs and Azure communication services which powers Microsoft Teams.

Among these we find that the Notification hub to send push notifications to any platform from the back end has several desirable features that target our requirements such as it reaches all major platforms which includes iOS, Android, Windows, Kindle, and Baidu. It uses any backend in the cloud or on-premises. It can push to millions of devices with a fast broadcast from a single API call. It can customize push notifications by customer, language, and location. It can dynamically define and notify customer segments. It can scale instantly to millions of mobile devices. It produces a variety of compliance certifications. It can target any audience with dynamic tags. It works with Apple Push Notifications Service, Google Cloud Messaging, Microsoft Push notification service, and it customizes notifications to specific customers, language, and location. It makes localization easier with templates. It is designed for massive scale. It can be enhanced with security and backup services.

Usually there are multiple backend systems that want to send push notifications to the mobile applications. The push delivery system remains consistent across all the backend systems and the notification consumers. It involves a Service Bus for publishing and subscribing to events. The subscribers are mobile backend systems that translate the events to push notifications. It also involves a notification hub which registers and then receives notifications.

This workflow is described by the following steps:

SendMessage(connectionString)

ReceiveMessageAndSendNotification(connectionString)

InitNotificationAsync() which is run from a store application that receives notifications from the WebJobs in the backend systems and sends out notifications on a notifications channel for applications.

 

 

 

 

Tuesday, April 20, 2021

 Matrix factorization for recommendation systems.  

What is a recommendation system?  

A recommendation system provides choices when a user makes a purchase or a request. For example, this system shows the user what choices others make so that the user can be better informed. Another example is when the system shows the user what other items are like the one being requested.  

Where can I use a recommendation system?  

Recommendations include suggestions for online shopping, suggesting interesting websites, or finding items such as movies or music. Since it can be based on user preferences or item similarity, the choices are not restricted to any domain. This kind of recommendation system when applied to any catalog in any business, performs as a collaborative filter.   
There are certain filtering cases where divulging which users go with what preferences are helpful to the end-user. At other times, it is preferable to use item-based similarity. The similarity scores are computed in both cases. All other considerations being the same, the item-based approach is better for sparse datasets. Both user-based and item-based approaches perform similarly for the dense dataset.   

How does it work?   

To make a recommendation, first, a group sharing similar tastes is found and then the preferences of the group are used to make a ranked list of suggestions. This technique is called collaborative filtering. A common data structure that helps with keeping track of people and their preferences is a nested dictionary. This dictionary could use a quantitative ranking say on a scale of 1 to 5 to denote the preferences of the people in the selected group.  To find similar people to form a group, we use some form of a similarity score. One way to calculate this score is to plot the items that the people have ranked in common and use them as axes in a chart. Then the people who are close together on the chart can form a group.   

What is matrix factorization?   

User choices are often represented as a matrix.  For example, Netflix provided a training dataset of over 100 million ratings from nearly half a million users on over seventeen thousand movies. Given this matrix of ratings between m users and n movies, matrix factorization decomposes it into m *k users matrix and k * n movies matrix such that it becomes a recommendation system that leverages the latent features between two entities.  

How does Matrix factorization improve recommendation systems?   

Decomposition helps solve for x in the system of linear equations denoted by Ax = b where A is the given matrix and b is the set of recommendations from the participants. The benefit of matrix factorization is that it can be applied to sparse matrices where not all users rate all movies and do so only selectively. Matrix factorization is a technique for solving a linear system of equations.  

Given a user rating of movies in a matrix, it can be decomposed into a dot product of a user embedding matrix and a movie embedding matrix. Embedding refers to latent features corresponding to that entity. The user latent features and movie latent features can then be looked up corresponding to a movie-user rating. The embeddings matrix also serves as input to other neural net layers some of which can involve non-linear layers and introduce more refinements to predictions. Other matrix factorization techniques to utilize the initial matrix form include principal component analysis and singular value decomposition. The former brings out the salient contributors in the ratings while the latter brings out the salient preferences of a user such as 50% comedy fan + 50% romance fan where that user is only sensitive to these latent factors.  

When the model is trained on these ratings by users on movies, it can be used to predict the rating of a new movie by finding how similar it is to the other movies based on the user preferences.  

What are the different types of Matrix Factorization?  

Matrix factorizations differ in the form of factorizations such as lower triangular- upper triangular or matrix and its transpose and such others. Matrix factorizations also differ in the method in which they are factorized. There is one form that performs better for a variety of tasks. This is the Cholesky factorization which has advantages such as   

  • Simpler solving   

  • Fewer iterations   

  • The transformation from uncorrelated to correlated normal variables.    

  • more efficient than LU and QR decomposition   

  • Best for positive definite matrices   

  • Always a good bet for sparse data   

How is QR decomposition different from LU decomposition?  

QR decomposition is used to solve the Least Squares problem which solves a system of linear equations denoted by Xb = Y where the inverse of matrix X cannot be calculated and instead an approximation is taken such that sum of squared deviations is minimized. QR form is one where R is the upper triangular matrix. LU decomposition results in a lower triangular matrix and an upper triangular matrix. Gaussian elimination is a popular form of LU decomposition. It involves adding multiples of one row to the other, swapping two rows, and scaling a row to result in the lower and upper triangular matrices.   

 

 

Monday, April 19, 2021

Quantum computing:  

Introduction: Certain computation tasks can be executed exponentially faster on quantum computers than on classical computers. Noted Physicist Richard Feynman suggested that it was exponentially costly to simulate quantum systems with classical computers. He envisioned a few benchmarks. First, the computational space must be large, and the error rate must be small such that a quantum speedup can be observed. Second, a problem that is hard for classical computers must be easy for quantum computers.  Such questions that frame a benchmark can be executed on a superconducting qubit processor. Algorithms that take full advantage of quantum computing are still a nascent field. 


In quantum computing, the basic unit of quantum information is a qubit. This is a two-state quantum mechanical system. In classical systems a bit must be 0 or 1. Quantum mechanics allows the qubit to be in a coherent superposition of both states simultaneously which allows a qubit to hold a lot more information than just two discrete states. This property is fundamental to quantum mechanics and quantum computing. For example, the polarization of a single photon can have both vertical polarization as well as horizontal polarization. The spin of an electron can be taken as spin up and spin down. The first step in the quantum data storage was demonstrated with the coherent transfer of a superposition state in an electron spin “processing” qubit to a nuclear spin “memory” qubit.  The transmission protocol for two classical bits of information such as 00, 01, 10, or 11 can be transmitted by sending only one qubit from a sender say Alice to a receiver, say Bob. This is called superdense coding. It requires Alice to perform one of four quantum gate operations to prearrange the measurement Bob makes. Next, Bob can decode the information being sent by receiving Alice’s qubit, decoding it, operating on the pair, and measuring both.  Since both qubits are required to decode the information, it eliminates the risk of spoofing, tampering, repudiation, information disclosure, denial of service, and elevation of privilege which is also referred to as the STRIDE model of threat analysis. An eavesdropper, Eve, cannot figure out Alices’ qubit or Bob’s qubit and any such attempt would alert both Alice and Bob.  


A system of n-components requires only n bits in classical physics whereas, in quantum physics, it requires Math.pow(2, n) complex numbers and is referred to as the Hilbert space. The quantum state of a qubit can be represented as a linear superposition of its basis vectors usually called ket0 and ket1 vectors. This transforms the original information on n components into the high-dimensional linear vector Hilbert space.  

Representing vectors and matrices and using factorization for matrices is well-established in classical systems and these can be repurposed in the Hilbert space. Newer algorithms for factoring a number like Shor’s algorithm are faster than their predecessors but realizing the full potential of quantum computing for Shor’s algorithm still requires certain obstacles to be overcome with fault-tolerant logical qubits.