Cluster computing: April 2021

Friday, April 30, 2021

Mobile data

There are types of data storage that are specific to mobile applications. Generally, a SQLite database is popular for mobile devices. It does not provide encryption but SQLCipher can be used to do this. An example of a Key-Value store, on the other hand, is the Oracle Berkeley database. Couchbase allows native JSON artifacts to be stored while Mongo Realm and Object box can store objects.

Mobile applications are increasingly becoming smarter with one of the two popular mobile platforms that have their own app stores and find universal appeal among their customers. The Android platform comes with support for Java programming and the tools associated with software development in this well-established language. The Android studio supports emulator mode running and debugging of the application that thoroughly vets the application just like it would be tested on a physical device. Modern Android development tools include Kotlin, Coroutines, Dagger-hilt, Architecture components, MVVM, Room, Coil and FireBase. The last one is a pre-packaged, open-source bundle of code called extensions to automate common development tasks.

The Firebase services that are required to be enabled in the Firebase console for our purpose include Phone Auth, Cloud Firestore, Realtime Database, Storage and Composite Indexes. The Android Architecture Components include the following: a Navigation Component that handles in-app navigation with a single Activity, LiveData which has data objects that notify views when the underlying database changes, ViewModel which stores UI-related data that isn't destroyed on UI changes, DataBinding that generates a binding class for each XML layout file present in that module and allows you to more easily write code that interacts with views and declaratively binds observable data to UI elements, WorkManager which is an API that makes it easy to schedule deferrable, asynchronous tasks that are expected to run even if the app exits or the device restarts and Room which is an Object Relational Mapping between SQLite object and POJO.

The Dependency injection is handled via Dagger-Hilt which incorporates Inversion of control via Dagger Dependency Injection and the Hilt-ViewModel which injects dependencies to ViewModel.

The Firebase extensions support cloud messaging for sending notification to client application, the Cloud Firestore for flexible, scalable, NoSQL cloud database to store and sync data, Cloud Storage for store and serving user-generated content and Authentication for creating account with mobile number.

The Kotlin serializer converts specific classes to and from the JSON.Runtime library with core serialization API and supports libraries with different serialization formats. Coil-Kt is used as an image loading library for Android backends by Kotlin Coroutines.

#codingexercise: https://1drv.ms/w/s!Ashlm-Nw-wnWzA9GsoSyfDDANrXf?e=vu6Pah

Thursday, April 29, 2021

Synchronization of state with remote (continued...)

The key features of data synchronization include the following:

1. Data scoping and partitioning: Typically, the enterprise stores contain a lot more data than what is needed by client devices and their applications. This calls for scoping of data that needs to be synchronized. There are two ways to do this. First, we restrict the data to only those tables that pertain to the user from that client. If there are any data that does not pertain to the user, we can skip those. Second, the data that needs to be synchronized from those tables is minimized.
Partitioning can also be used to reduce the data that is synchronized. As is typical to partitioning, it can be horizontal or vertical. Usually, the vertical partitioning is done prior to horizontal because it trims the columns that need to be synchronized. The set of columns are easily found by comparing what is required for that user versus what is used by the application. After the columns are decided, a filter can be applied to reduce the rowset. Again, this can be done with the help of predicates that involves the user clause. Also reducing the scope and partition of the data, also reduces the errors introduced by way of conflicts which also improves performance.

2. Data compression: Another way to reduce an already scoped and partitioned data is to reduce the number of bytes that is used for its transfer. Data compression reduces this size. It is useful for reducing both time and cost although it incurs some overhead via compression and decompression routines. Some of these routines may be expensive for a mobile device more than it is for the server. Also, some data types are easy to compress while others aren’t. Therefore, the compression helps only in cases where those data types are used.

3. Data transformation: By the same argument as above, it is easy for the server to process and transform the data because of its compute and storage resources. Therefore, conversion to format that is suitable for mobile devices is easier done on the server side. Such transformations might even include conversion to data types that are compression friendly. Also, numerical data might be converted to string if the mobile devices find it easier to handle string.

4. Transactional integrity: This means that either all the change are committed or none of the changes are committed. Transactional changes occur in isolation. It should not affect others. Once the transaction is committed, its effects are persistent even against failures. Maintaining transactional behavior over network involves retries and this is not efficient. It is easier to enforce transactions within the store. If the synchronization involves remote transactions, then the rollbacks on the remote will require rollback of the entire synchronization and retry during the next synchronization. When databases allow net changes to be captured, there is an option to not look at transaction log if the individual transactions in the log are not important and the initial and final state are what matters. If the order of the changes made is also important, then transaction logs are important as well. A transaction log can be read in chronological order and executed on destination database.

#c#codingexercise

Given clock hands positions for different points of time as pairs A[I][0] and A[I][1] where the order of the hands does not matter but their angle enclosed, count the number of pairs of points of time where the angles are the same

public static int[] getClockHandsDelta(int[][] A) {

int[] angles = new int[A.length];

for (int i = 0; i < A.length; i++){

angles[i] = Math.max(A[i][0], A[i][1]) - Math.min(A[i][0],A[i][1]);

}

return angles;

}

public static int NChooseK(int n, int k)

{

if (k < 0 || k > n || n == 0) return 0;

if ( k == 0 || k == n) return 1;

return Factorial(n) / (Factorial(n-k) * Factorial(k));

}

public static int Factorial(int n) {

if (n <= 1) return 1;

return n * Factorial(n-1);

}

public static int countPairsWithIdenticalAnglesDelta(int[] angles){

Arrays.sort(angles);

int count = 1;

int result = 0;

for (int i = 1; i < angles.length; i++) {

if (angles[i] == angles[i-1]) {

count += 1;

} else {

if (count > 0) {

result += NChooseK(count, 2);

}

count = 1;

}

if (count > 0) {

result += NChooseK(count, 2);

count = 0;

}

return result;

}

        int [][] A = new int[5][2];
         A[0][0] = 1;    A[0][1] = 2;
         A[1][0] = 2;    A[1][1] = 4;
         A[2][0] = 4;    A[2][1] = 3;
         A[3][0] = 2;    A[3][1] = 3;
         A[4][0] = 1;    A[4][1] = 3;
1 2 1 1 2
1 1 1 2 2
4

Wednesday, April 28, 2021

Synchronization of state with remote (continued...)

Efficiency in data synchronization in these configurations and architectures come from determining what data changes, how to scope it and how to reduce the traffic associated with propagating the change. It is customary to have a synchronization layer on the client, a synchronization middleware on the server, and a network connection during the synchronization process that supports bidirectional updates. The basic synchronization process involves the initiation of synchronization – either on demand or on a periodic basis, the preparation of data and its transmission to a server with authentication, the execution of the synchronization logic on the server side to determine the updates and the transformations, the persistence of the changed data over a data adapter to one or more data stores, the detection and resolution of conflicts and finally the relaying of the results of the synchronization back to the client application.

The choice of synchronization technique depends on the situation. One of the factors that plays into this is the synchronization mode. There are two main modes of synchronization: snapshot and net change. Snapshots are the data as of a point of time. The data in a snapshot does not change. So it is useful to compare. This synchronization makes it possible to move large amount of data from one system to another. This is the case when the data has not changed at the remote location. Since snapshots might contain a large amount of data, a good network connection is required to transfer it. Updates to the product catalog or price list is a great use case for snapshot synchronization because the updates are collected in a snapshot, transferred and loaded at once on the destination store.

The net changes mode of synchronization can be considered slightly more efficient than the snapshot synchronization. In these cases, only the changed data is sent between the source and destination datastores and it reduces the network bandwidth and connection times. If the data were to change quite often on one server, only the initial and final state is required to create the changes that can then be made on the destination. Both the modes are bidirectional so the changes made in a local store can be propagated to the enterprise store. The net changes mode does not take into consideration the changes made in individual transactions. If those were important, the transaction log based synchronization may work better.

Transmission of data is also important in the effectiveness of synchronization technique. If an application is able to synchronize without the involvement of the user, it will work on any network – wired or wireless otherwise the latter usually requires human intervention to set up a connection. There are two types of data propagation methods – session based and message based. The session based synchronization method requires a direct connection. The updates can be done both ways and they can be acknowledged. The synchronization resumes even after a disruption. The point from which the synchronization resumes is usually the last committed transaction. The connection for this data propagation method can be established well in advance.

Message based synchronization requires the receiver to take the message and perform the changes. When this is done, a response is sent back. Messages help when there is no reliable network connection. The drawback is there is no control over when the messages will be acted upon and responded.

Tuesday, April 27, 2021

Synchronization of state with remote (continued...)

Another mechanism to keep the state in sync across local and remote is the publisher-subscriber model. This model assumes that there is a master copy of the data, maintained by the publisher, and the updates can be bidirectional allowing the publisher to update the data for the subscribers and vice versa.

The publisher is responsible for determining which datasets have external access and when they are made available, they are called publications. Different scopes of datasets can be published to different subscribers, and they can be specified at runtime with the help of parameters. In such cases, subscribers map to different partitions of data. If the data overlaps, then the subscribers see a shared state. It is possible to have near real-time sharing of publications across subscribers on overlapped data with the help of versioning. Conflict resolution on updates for conflicting versions is easily resolved by the latest first strategy.

Common synchronization configurations also vary quite widely. Such a configuration refers to the arrangement of publisher and subscriber data. The publisher-subscriber model allows both peer-to-peer and hierarchical configurations. There are two hierarchical configurations that are quite popular. The first is the network aka tree topology and the second is the hub and spoke topology. Both configurations are useful for many subscribers. Unlike the hierarchical configuration, peer-to-peer configuration does not have a single authoritative data store. Further, the data updates do not make it to all the subscribers. Peer-to-peer configurations are best suited for fewer subscribers. Some of the challenges with peer-to-peer configurations include maintaining data integrity, implementing conflict detection and resolution, and programming synchronization logic. Generally, these are handled by messaging algorithms such as Paxos and with some concept of message sequencing or vector clock and gossip protocol.

Efficiency in data synchronization in these configurations and architectures comes from determining what data changes, how to scope it, and how to reduce the traffic associated with propagating the change. It is customary to have a synchronization layer on the client, a synchronization middleware on the server, and a network connection during the synchronization process that supports bidirectional updates. The basic synchronization process involves the initiation of synchronization – either on-demand or on a periodic basis, the preparation of data and its transmission to a server with authentication, the execution of the synchronization logic on the server-side to determine the updates and the transformations, the persistence of the changed data over a data adapter to one or more data stores, the detection and resolution of conflicts and finally the relaying of the results of the synchronization back to the client application.

Monday, April 26, 2021

Synchronization of state with remote

Introduction: Persistent data storage enables users to access enterprise data without being connected to the network but it is prone to becoming stale. Bidirectional refresh of data with master data is required and one way to achieving it is to do periodic synchronization which is a technique to propagate updates on the data on both local and remote. In this article, we review the nuances of such synchronization.

Description: The benefits of synchronization over an always-online solution is quite clear – reduced data transfer over the network, reduced loads on the enterprise server, faster data access, increased control over data availability. But it is less understood that there are different types of synchronization depending on the type of data. For example, the synchronization may be initiated for personal information management (PIM) such as email, calendar entries, etc as opposed to application files. The latter can be considered artifacts that artifact-independent synchronization services can refresh. Several such products are available and they do not require user involvement for a refresh. This means one or more files and applications can be set up for synchronization on remote devices although they are usually one-way transfers.

Data synchronization, on the other hand, performs a bidirectional exchange and sometimes transformation between two data stores. This is our focus area in this article. The server data store is usually larger because it holds data for more than one user and the local data store is usually limited by the size of the mobile device. The data transfer occurs over a synchronization middleware or layer. The middleware is set up on the server while the layer hosted on the client. This is the most common way for smart applications to access corporate data.

Synchronization might be treated as a web service with the usual three tiers comprising of the client, the middle-tier, and enterprise data. When the data is synchronized between an enterprise server and a persistent data store on the client, a modular layer on the client can provide a simple easy to use client API to control the process with little or no interaction from the client application. This layer may just need to be written or rewritten native to the host depending on whether the client is a mobile phone, laptop, or some other such device. With a simple invocation of the synchronization layer, a client application can expect the data in the local store to be refreshed.

The synchronization middleware resides on the server and this is where the bulk of the synchronization logic is written. There can be more than one data store behind the middleware on the server-side and there can be more than one client from the client-side. Some of the typical features of this server-side implementation includes data scoping, conflict detection and resolution, data transformation data compression, and security. These features are maintained with server performance and scalability. Two common forms of synchronization middleware are a standalone server application and a servlet running in a servlet engine. The standalone server is more tightly coupled to the operating system and provides better performance for large data. The J2EE application servers rely on an outside servlet engine and are better suited for high volume low payload data changes.

The last part of this synchronization solution is the data backend. While it is typically internal to the synchronization server, it is called out because it might have more than one data stores, technologies, and access mechanisms such as object-relational mapping.

Sunday, April 25, 2021

Planning for onboarding an existing text summarization service to Azure public cloud

Problem statement: This article is a continuation of an exploration for the proper commissioning of a text summarization service in the Azure public cloud. While the earlier article focused on the technical options available to expand and implement the text summarization service including the algorithm involved and its evaluation and comparison to a similar service on a different cloud, this article is specifically for onboarding the service to Azure public cloud in the most efficient, reliable, available and cost-effective manner. It follows up on the online training taken towards a certification in Windows Azure Fundamentals.

Article: Onboarding a service such as this one in the Azure public cloud is all about improving its deployment, using the proper subscription, planning for capacity and demand, optimizing the Azure resources, monitoring the service health, setting up management group, access control, security and privacy of the services and setting up the pricing controls and the support options. We look at these in more detail now.

Proper subscription: Many of the rate limits, quotas and the availability of services are quite sufficient in the very first tier of subscription. The Azure management console has a specific set of options to determine the scale required for the service.

Resource and Resource groups: The allocation of a resource group, identity and access control is certainly a requirement for the onboarding of a service. It is equally important to use the pricing calculator and TCO calculator in the Azure public cloud to determine the costs. Some back of the envelope calculation in terms of the bytes per request, number of requests per second, the latency, recovery time, recovery point, MTTR, MTBF help with determining the requirements and the resource management.

Optimizing the Azure resources: This is automated. If we are deploying a python Django application and a node.js frontend application, then it is important to make use of api gateway, load balancer, proxy and scalability options, certificates, domain name resources etc. The use of resources specific to the service as well as those that enhance its abilities must be methodically ruled off from the checklist that one can draw from the Azure management portal.

Monitoring the service health: Metrics specific to the text summarization service in terms of the size of the text condensed, the mode of delivery, the number of documents submitted to the system, the load on the service in terms of the distribution statistics and other such metrics will help determine if the service requires additional resources or when something goes wrong. Alerts can be set up for the thresholds so we can remain passive until we get an alert.

Management group, Identity and access control: Even if there is only one person in charge of the service, the setting up of a management group, user and access control formalizes and detaches that person so that anyone else can take on the administrator role. This option will also help set up registrations and notifications to that account so that it is easier to pass the responsibility around.

Security and privacy: The text summarizations service happens to be a stateless transparent transformer and transfer learning service which does not retain any data from the customer, so it does not need any further actions towards security and privacy. TLS setup on the service and use of proper certificates along with domain names will help keep it compute resource independent.

Advisor: Azure has an advisor capability that advises on the efficiencies possible with the deployment after the above-mentioned steps have been taken. This helps in streamlining the operations and reducing cost.

Conclusion: The service onboarding feature is critical towards the proper functioning of the service both in terms of its cost and benefits. When the public cloud knowledge center articles are followed up meticulously for the use of the Azure management portal in the deployment of the service, the service is guaranteed to improve its return on investment.