Cluster computing

Tuesday, April 27, 2021

Synchronization of state with remote (continued...)

Another mechanism to keep the state in sync across local and remote is the publisher-subscriber model. This model assumes that there is a master copy of the data, maintained by the publisher, and the updates can be bidirectional allowing the publisher to update the data for the subscribers and vice versa.

The publisher is responsible for determining which datasets have external access and when they are made available, they are called publications. Different scopes of datasets can be published to different subscribers, and they can be specified at runtime with the help of parameters. In such cases, subscribers map to different partitions of data. If the data overlaps, then the subscribers see a shared state. It is possible to have near real-time sharing of publications across subscribers on overlapped data with the help of versioning. Conflict resolution on updates for conflicting versions is easily resolved by the latest first strategy.

Common synchronization configurations also vary quite widely. Such a configuration refers to the arrangement of publisher and subscriber data. The publisher-subscriber model allows both peer-to-peer and hierarchical configurations. There are two hierarchical configurations that are quite popular. The first is the network aka tree topology and the second is the hub and spoke topology. Both configurations are useful for many subscribers. Unlike the hierarchical configuration, peer-to-peer configuration does not have a single authoritative data store. Further, the data updates do not make it to all the subscribers. Peer-to-peer configurations are best suited for fewer subscribers. Some of the challenges with peer-to-peer configurations include maintaining data integrity, implementing conflict detection and resolution, and programming synchronization logic. Generally, these are handled by messaging algorithms such as Paxos and with some concept of message sequencing or vector clock and gossip protocol.

Efficiency in data synchronization in these configurations and architectures comes from determining what data changes, how to scope it, and how to reduce the traffic associated with propagating the change. It is customary to have a synchronization layer on the client, a synchronization middleware on the server, and a network connection during the synchronization process that supports bidirectional updates. The basic synchronization process involves the initiation of synchronization – either on-demand or on a periodic basis, the preparation of data and its transmission to a server with authentication, the execution of the synchronization logic on the server-side to determine the updates and the transformations, the persistence of the changed data over a data adapter to one or more data stores, the detection and resolution of conflicts and finally the relaying of the results of the synchronization back to the client application.

Monday, April 26, 2021

Synchronization of state with remote

Introduction: Persistent data storage enables users to access enterprise data without being connected to the network but it is prone to becoming stale. Bidirectional refresh of data with master data is required and one way to achieving it is to do periodic synchronization which is a technique to propagate updates on the data on both local and remote. In this article, we review the nuances of such synchronization.

Description: The benefits of synchronization over an always-online solution is quite clear – reduced data transfer over the network, reduced loads on the enterprise server, faster data access, increased control over data availability. But it is less understood that there are different types of synchronization depending on the type of data. For example, the synchronization may be initiated for personal information management (PIM) such as email, calendar entries, etc as opposed to application files. The latter can be considered artifacts that artifact-independent synchronization services can refresh. Several such products are available and they do not require user involvement for a refresh. This means one or more files and applications can be set up for synchronization on remote devices although they are usually one-way transfers.

Data synchronization, on the other hand, performs a bidirectional exchange and sometimes transformation between two data stores. This is our focus area in this article. The server data store is usually larger because it holds data for more than one user and the local data store is usually limited by the size of the mobile device. The data transfer occurs over a synchronization middleware or layer. The middleware is set up on the server while the layer hosted on the client. This is the most common way for smart applications to access corporate data.

Synchronization might be treated as a web service with the usual three tiers comprising of the client, the middle-tier, and enterprise data. When the data is synchronized between an enterprise server and a persistent data store on the client, a modular layer on the client can provide a simple easy to use client API to control the process with little or no interaction from the client application. This layer may just need to be written or rewritten native to the host depending on whether the client is a mobile phone, laptop, or some other such device. With a simple invocation of the synchronization layer, a client application can expect the data in the local store to be refreshed.

The synchronization middleware resides on the server and this is where the bulk of the synchronization logic is written. There can be more than one data store behind the middleware on the server-side and there can be more than one client from the client-side. Some of the typical features of this server-side implementation includes data scoping, conflict detection and resolution, data transformation data compression, and security. These features are maintained with server performance and scalability. Two common forms of synchronization middleware are a standalone server application and a servlet running in a servlet engine. The standalone server is more tightly coupled to the operating system and provides better performance for large data. The J2EE application servers rely on an outside servlet engine and are better suited for high volume low payload data changes.

The last part of this synchronization solution is the data backend. While it is typically internal to the synchronization server, it is called out because it might have more than one data stores, technologies, and access mechanisms such as object-relational mapping.

Sunday, April 25, 2021

Planning for onboarding an existing text summarization service to Azure public cloud

Problem statement: This article is a continuation of an exploration for the proper commissioning of a text summarization service in the Azure public cloud. While the earlier article focused on the technical options available to expand and implement the text summarization service including the algorithm involved and its evaluation and comparison to a similar service on a different cloud, this article is specifically for onboarding the service to Azure public cloud in the most efficient, reliable, available and cost-effective manner. It follows up on the online training taken towards a certification in Windows Azure Fundamentals.

Article: Onboarding a service such as this one in the Azure public cloud is all about improving its deployment, using the proper subscription, planning for capacity and demand, optimizing the Azure resources, monitoring the service health, setting up management group, access control, security and privacy of the services and setting up the pricing controls and the support options. We look at these in more detail now.

Proper subscription: Many of the rate limits, quotas and the availability of services are quite sufficient in the very first tier of subscription. The Azure management console has a specific set of options to determine the scale required for the service.

Resource and Resource groups: The allocation of a resource group, identity and access control is certainly a requirement for the onboarding of a service. It is equally important to use the pricing calculator and TCO calculator in the Azure public cloud to determine the costs. Some back of the envelope calculation in terms of the bytes per request, number of requests per second, the latency, recovery time, recovery point, MTTR, MTBF help with determining the requirements and the resource management.

Optimizing the Azure resources: This is automated. If we are deploying a python Django application and a node.js frontend application, then it is important to make use of api gateway, load balancer, proxy and scalability options, certificates, domain name resources etc. The use of resources specific to the service as well as those that enhance its abilities must be methodically ruled off from the checklist that one can draw from the Azure management portal.

Monitoring the service health: Metrics specific to the text summarization service in terms of the size of the text condensed, the mode of delivery, the number of documents submitted to the system, the load on the service in terms of the distribution statistics and other such metrics will help determine if the service requires additional resources or when something goes wrong. Alerts can be set up for the thresholds so we can remain passive until we get an alert.

Management group, Identity and access control: Even if there is only one person in charge of the service, the setting up of a management group, user and access control formalizes and detaches that person so that anyone else can take on the administrator role. This option will also help set up registrations and notifications to that account so that it is easier to pass the responsibility around.

Security and privacy: The text summarizations service happens to be a stateless transparent transformer and transfer learning service which does not retain any data from the customer, so it does not need any further actions towards security and privacy. TLS setup on the service and use of proper certificates along with domain names will help keep it compute resource independent.

Advisor: Azure has an advisor capability that advises on the efficiencies possible with the deployment after the above-mentioned steps have been taken. This helps in streamlining the operations and reducing cost.

Conclusion: The service onboarding feature is critical towards the proper functioning of the service both in terms of its cost and benefits. When the public cloud knowledge center articles are followed up meticulously for the use of the Azure management portal in the deployment of the service, the service is guaranteed to improve its return on investment.

Saturday, April 24, 2021

Space Partition Tree and Graph (SPTAG) algorithm

Well known search algorithms such as page-rank can be easily described by matrix linear equations because they establish ranking in terms of the weights provided by other webpages. Matrices are vector representations for the web pages, and they help with algorithms that help determine whether two pages are similar or who the nearest neighbors are? While tables are useful for establishing relations, graphs are called natural databases because all kinds of relationships overlayed over the nodes as edges. For example, a table can define an ‘is-a’ or ‘has-a’ relationship that generalized by a join while graphs can have edges that are distinct for, say, ‘is-a-friend-of' relationship. Web pages have an inwards and outwards links structure, so the matrix and graph forms of the web are useful representations.

Bing Image search uses SPTAG algorithm. This algorithm redefines the search algorithm in terms of these vectors which can be compared based on some distance metric such as L2 distance or cosine distance with the query vector. It provides two methods: kd-tree and relative neighborhood graph and balanced k-means tree and relative neighborhood graph. These two methods correspond to index builder and searcher. The former reduces index building cost and the latter improves search accuracy even with high dimensional data. The search begins with the search in the search partition trees for finding initial starting points or seeds, which are then used to search in the neighborhood graphs. The searches in both trees and graphs are iteratively conducted.

The Query driven Iterated Neighborhood graph search for large scale indexing is described this way. An initial solution set is created by searching over trees T that were constructed to index the reference data points. This initial solution might not be good candidates, but they have high probabilities to be near the true nearest neighbors. The candidates are always stored in a priority queue Q whether the search is over trees or graphs. Next, with the solution set as seeds in the graph G, a search is conducted with the neighborhood expansions in a best fit manner. Once local solutions are discovered, the search pauses. Then, with these previously identified nearest neighbors from the previous step and with the search history, new seeds are generated from the trees. Then the solution is accepted with the criteria that the distances to the neighbors are smaller than those for the seed. An indication variable for the number of unexpanded promising points is used to track whether the local search has arrived at a solution. There might be some change introduced into the seeds according to the query and search history and this perturbation is helpful to arrive at a local solution. The iterations for perturbation, search and acceptance are repeated until one of the two termination condition is reached – an upper threshold for the number of points accessed is reached or the true nearest neighbors are reached.

The best-first method mentioned earlier is a mere modification of the Breadth-First graph traversal using the same priority queue Q and enqueueing the neighbors of each dequeued item. The difference between them is that there is a now new criteria based on the number of accessed points n, the number of promising points m and the number of accessed points r from trees. This completes the iteration- based neighborhood graph search described here.

Friday, April 23, 2021

Page Rank Algorithm (to determine the quality of web search results) explained briefly:

PageRank is tested by the entire web search engine and the company that seemed to give the verb its name. At the heart of this algorithm is a technique to utilize the structure of inbound and outbound references to calculate the rank recursively. A given web page is linked from a set of pages and each of those origins is probably unaware of the significance of this page. If the origin has N outbound links, it assigns a score of 1/N for the link we are interested in to the given web page. If there were no links, it is equivalent to assigning a score of 0. Therefore, the scaling factor for a link is inversely dependent on the total number of outbound edges of the originating page for that link. This allows the rank of the given page to be computed as a simple sum of all the scaled ranks of the pages from which it is linked. The sum must be adjusted by a constant factor to accommodate the fact that there are pages that may have no forward links.

There are some special cases that make are accommodated by introducing new terms to the technique mentioned here but the ranking is as simple as that. One such case that explains a flaw that needs redress is when we have two pages that only link each other and a third page that links to one of them. Then this loop will accumulate rank but never distribute any rank because there are no outbound edges. This is called a rank sink and it is overcome by adding a term to the ranking called the rank source which is also adjusted by the same constant that was introduced for the consideration of pages with no outward edges.

Together with the accumulation of rank and the rank source, the ranking calculation allows us to compute the rank of the current page. We also know the final state of the distributed ranking over a set of pages because they will stabilize to a distribution that normalizes the overall ranking scores. Even though the calculation is recursive due to the unknown rank of the originating pages, it can be overcome with a starting set of values and the adjustment during each iteration that lets us progress towards the desired state. The iterations are stopped when the error is less than a threshold.

One issue with this model is dangling links which are links that point to pages with no outbound links. There are several of these. Fortunately, they do not affect the calculation of the page rank. So, they are ignored during the calculation, then they are added back.

The match of the search terms improves the precision, and the page rank improves the quality.

Comparisons between Page Rank algorithm and Microsoft Bing Algorithm:

The following comparison is drawn based on industry observations of the usages of the algorithms rather than their technical differences.

Bing utilizes the structure of a page and the metadata it gathers from the page prominently. Google infers the keywords and their relationships.

Bing uses website and content age as an indicator for its authority. The web page index might be built periodically every three months. Google shows fresh pages if relevance and content authority remain the same.

Google favors backlinks and evaluates their quality and quantity. Bing might be utilizing an internal backlink built on anchor text and social engineering usage.

Google favors text over images while Bing favors web pages with images. Certain HTML5 appears to be ignored by Google while Bing can recognize technologies like flash.

Bing might not read all the page, but it might but Google crawls through all the links before the ranking is calculated.

Bing might not index all the web pages particularly when there are no authoritative backlinks but Google crawls through and includes pages with dangling links.

Bing leverages ML algorithms for a much larger percentage of search queries than Google does.

Thursday, April 22, 2021

Utilizing public cloud infrastructure for text summarization service:

Introduction: A text summarization service exists utilizing a word embeddings model. The API for the service serves a portal where users can enter text directly or upload files. This article investigates the migration of the web service to the public cloud using the cloud services.

Description: There is some planning involved when migrating such a service to use the public cloud services. First, the deployment of the service will have to move from the existing legacy virtual machine to one that is ML friendly with a GPU such as “STANDARD-NC6”. There is a BERT notebook that allows the selection of a GPU suitable for the NLP processing. Second, a training dataset is required if we plant to use BERT. There are other models available for the NLP service such as the PMI model, but it is typical to pick one and use it with the web service. BERT just helps with transfer learning where knowledge gained from earlier training is used with novel cases not encountered during training.

There are specifically three different types of NLP services available from Azure. These are:

Azure HDInsight with Spark and Spark MLlib

Azure Databricks and

Microsoft Cognitive Services

Since we want prebuilt model to use with our web service, we can use the Microsoft Cognitive Service. If we were to create a custom model, we would use the Azure HDInsight with Spark MLLib and Spark NLP which also provides low-level tokenization, stemming, lemmatization, TF-IDF and sentence-boundary detection. Cognitive services do not support large documents and big data sets.

Cognitive services provide the following APIs:

Linguistic Analysis API - for low level NLP processing such as tokenizer and part of speech tagging.

Language Understanding Intelligent Service (LUIS) API for entity/Intent identification and extraction.

Text analysis API for topic detection, sentiment analysis and language detection

Bing Spell check API for spelling check

Out of these we only need to review the text analysis API to extract key phrases and if we rank them then we can perform text summarization. The ranking may not be like those from word embeddings from a SoftMax classifier, but we don’t have to calculate similarity distance between key phrases. Instead, we allow the key phrase extraction to give us the terms and then extract sentences with those phrases. There is no Text Summarization web service API in Azure but there is an Agolo service in the Azure marketplace that provides NLP summarization for Enterprises. Agolo services summarize news feeds. Azure Databricks does not have an out-of-box NLP service but provides the infrastructure to create one. MPhasis deep insights text summarizer on AWS marketplace provides text summarization in three sentences of approximately 30 words for text snippets of size 512 words.

curl -v -X POST "https://westus2.api.cognitive.microsoft.com/text/analytics/v3.0/keyPhrases?model-version={string}&showStats={boolean}"

-H "Content-Type: application/json"

-H "Ocp-Apim-Subscription-Key: {subscription key}"

--data-ascii "{body}"

Wednesday, April 21, 2021

Implementing a service for Mobile Application using public cloud services:

Introduction: We discussed an automation for the device synchronization service using a queue, a database, and a parallel-task library. This article discusses the public cloud services which provide a rich framework to write device synchronization service.

We look at some of the options to write this service using public cloud features. We have the same requirements as last time which are data model, push release and controllers. If we took the queue, the database, and the parallel-task library, then we could find corresponding offerings from the public cloud as the global database, queue storage, service bus, WebJobs and Functions. But we move up the Azure stack to make use of more customized and integrated services. For example, Azure provides the following services for mobile computing: API management, which publishes APIs to developers, partners, and employees securely and at scale, Azure Notification hubs which send push notifications to any platform from any backend, Azure cognitive search, which is a cloud-based search that uses AI models, Azure cognitive services which adds smart API capabilities to enable contextual interactions. Spatial anchors which create multi-user and spatially aware mixed reality experiences, App Service which creates cloud apps for web and mobile, Azure Maps which add location context to data with their APIs and Azure communication services which powers Microsoft Teams.

Among these we find that the Notification hub to send push notifications to any platform from the back end has several desirable features that target our requirements such as it reaches all major platforms which includes iOS, Android, Windows, Kindle, and Baidu. It uses any backend in the cloud or on-premises. It can push to millions of devices with a fast broadcast from a single API call. It can customize push notifications by customer, language, and location. It can dynamically define and notify customer segments. It can scale instantly to millions of mobile devices. It produces a variety of compliance certifications. It can target any audience with dynamic tags. It works with Apple Push Notifications Service, Google Cloud Messaging, Microsoft Push notification service, and it customizes notifications to specific customers, language, and location. It makes localization easier with templates. It is designed for massive scale. It can be enhanced with security and backup services.

Usually there are multiple backend systems that want to send push notifications to the mobile applications. The push delivery system remains consistent across all the backend systems and the notification consumers. It involves a Service Bus for publishing and subscribing to events. The subscribers are mobile backend systems that translate the events to push notifications. It also involves a notification hub which registers and then receives notifications.

This workflow is described by the following steps:

SendMessage(connectionString)

ReceiveMessageAndSendNotification(connectionString)

InitNotificationAsync() which is run from a store application that receives notifications from the WebJobs in the backend systems and sends out notifications on a notifications channel for applications.