Cluster computing

Sunday, January 2, 2022

Identity and payment

One of the most important interactions between individuals is payments. A payment service recognizes both the payer and the payee based on their identity. The service delivers a comprehensive one-stop shop for all forms of exchanges. This makes it easy and consistent for small businesses and individuals to follow and sign up for their program. The payment service virtualizes the payments across geography, policies, statutes and limitations while facilitating the mode of payment and receipts for the individuals.
Identity and Access Management is a critical requirement for this service without which the owner reference cannot be resolved. But this identity and access management does not need to be rewritten as part of the services offered. An identity cloud is a good foundation for IAM integration and existing membership providers across participating businesses can safely be resolved with this solution.
The use of a leading IAM provider for Identity cloud could help with the integration of identity resolution capability for this service. The translation of the owner to the owner_id required by the payments service is automatically resolved by referencing the identifier issued by the cloud IAM. Since the IAM is consistent and accurate, the mappings are straightforward one-to-one. The user does not need to sign in to generate the owner_id. They can be resolved with the integration of the membership provider owned by the IAM which might be on-premise for a client’s organization such as with Active Directory integration or a multi-tenant offering from their own identity cloud. Since the integration of different applications for enterprises is expected to be integrated with the Active Directory or IAM provider, the identity cloud can be considered as global for this purpose but the identities from different organizations will require to be isolated from each other in the identity cloud. The offload of identity to an IAM is a clean separation of concern for the payment services.

But the definition of payments and the definition of identities in these cases must share some overlap in the financial accounts from which the payments originate which leads neither the payment services nor the identity services from doing away with each other's participation. A microservices architectural style can resolve this interdependency with an exchange of calls between each other but there is no resiliency or continuity of business without high availability from each other. Instead, if financial account information were to become a form of identity, even a distributed ledger is sufficient to do away with both.

The difference between a B2B and a B2C reward points service stands out further in this regard when an employee can use the same sign-in without requiring signing into a bank as well. With the integration of enterprise-wide IAM to an identity cloud provider and the availability of attributes via SAML, the mapping of transactions to identity becomes automatic from the user experience point of view leading only to the use of Identity and Access Management service with the frontend. The payment service operates in the background with the payer and payee passed as referrals.

We assume that the payment service maintains financial information accurately in its layer. The service accumulating and redeeming from a notion of balance associated with an individual will update the same source of truth. This is neither a synchronous transaction nor a single repository and must involve some form of reconciliation while the identity may even disappear. Both the identity and payment services recognize the same individual for the transactions because of a mashup presented to the user and a backend registration between them. With the notion of payments included within an identity, there is more competition, less monopoly, and deregulating the economy. The payment service use of a notion of identity differs widely from that for the identity services as each plays up their capabilities with this representation. This leads to a more diverse form of experiences and ultimately a win-win for an individual.

The payment transactions must have changed data capture and some form of audit trail. This is essential to the post-transaction resolution, reconciliation, and investigations to their occurrence. The identity services facilitate a session identifier that can be used with single-sign-on for repeated transactions. This is usually obtained as a hash of the session cookie and is provided by the authentication and authorization server. The session identifier can be requested as part of the login process or by a separate API Call to a session endpoint. An active session removes the need to re-authenticate regardless of the transactions performed. It provides a familiar end-user experience and functionality. The session can also be used with user-agent features or extensions to assist with authentication such as password-manager or 2-factor device reader.

Finally, both services can manifest mobile applications, cloud services, and B2B multi-tenant SAAS offerings to their customers with or without each other and with little or no restrictions to their individual capabilities.

Saturday, January 1, 2022

Git Pull Request Iteration Changes

Problem statement: Git is a popular tool for software developers to share and review code before committing it to the mainline source code repository. Collaborators often exchange feedback and comments on a web user interface by creating a pull-request branch that captures the changes to the source branch. As it accrues the feedback and consequent modifications, it gives an opportunity for reviewers to finalize and settle on the changes before it is merged to the target branch. This change log is independent of the source and the target branch because both can be modified while the change capture is a passive recorder of those iterations. It happens to be called a pull request because every change made to the source branch is pulled as an iteration. The problem is that when the iterations increase to a large number, the updates made to the pull request become a sore point for the reviewers as they cannot see all the modifications on the latest revision. If the pull request iteration changes could be coalesced into fewer iterations, it becomes easier to follow and looks cleaner in retrospect. The purpose of this document is to find ways to do so, which by design was independent of the requestor and the reviewer.

Solution: A pull-request workflow consists of the following:

Step 1. Pull the changes to your local machine.

Step 2. Create a "branch” version

Step 3. Commit the changes

Step 4. Push the changes to a remote branch

Step 5. Open a pull-request between the source and the target branch.

Step 6. Receive and make iterations for changes desired.

Step 7. Complete the pull-request by merging the final iteration to the master.

When the iterations for Step 6 and Step 7 grow to a large number, a straightforward way to reduce iterations is to close one pull-request and then start another but developers and collaborators will need to track multiple pull-requests. Since this presents a moving target to complete, it would be preferable to modify a pull-request in place.

Git REST API offers an option to retrieve the pull-request revisions but not a way to edit them. For example, we have GET https://dev.azure.com/{organization}/{project}/_apis/git/repositories/{repositoryId}/pullRequests/{pullRequestId}/iterations/{iterationId}/changes?api-version=5.1 that gives the changes between two iterations but this is not something we can edit.

The git SDK also offers a similar construct to view the GitPullRequestIterationChanges but these are not modifiable.

Instead, we can run the following commands:

cd /my/fork

git checkout master

git commit –va –m “Coalescing PR comments”

git push

If the commits need to be coalesced, this can be done with the git pull –rebase command. The squash option is for merge. Bringing in changes from the master can be done either via rebase or merge and depending on the case, this can mean accepting HEAD revision for former or TAIL revision for latter.

The git log –first-parent can help view only the merge commit in the log.

If the pull request is being created from a fork, the pull-request author can grant permissions at the time of pull-request creation to allow contributors to upstream repositories to make commits to the pull requests compare branch.

Reference:

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork

https://stackoverflow.com/questions/7947322/preferred-github-workflow-for-updating-a-pull-request-after-code-review

https://stackoverflow.com/questions/16748115/how-to-modify-github-pull-request

Friday, December 31, 2021

This is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing that included the most recent discussion on Azure Data Lake which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category. This article focuses on Azure Data Lake which is suited to store and handle Big Data. This is built over Azure Blob Storage, so it provides native support for web-accessible documents. It is not a massive virtual data warehouse, but it powers a lot of analytics and is centerpiece of most solutions that conform to the Big Data architectural style. In this section, we continue our focus on the programmability aspects of Azure Data Lake.

We discussed that there are two forms of programming with Azure Data Lake – one that leverages U-SQL and another that leverages open-source programs such as for Apache Spark. U-SQL unifies the benefits of SQL with the expressive power of your own code. SQLCLR improves programmability of U-SQL by allowing user to write user-defined operators such as functions, aggregations and data types that can be used in conventional SQL-Expressions or required only an invocation from a SQL statement.

The other form of programming is largely applied to HD Insight as opposed to U-SQL for Azure data Lake analytics (ADLA) which target data in batch processing often involving map-reduce algorithms on Hadoop clusters. Also, Hadoop is inherently batch processing while Microsoft stack allows streaming as well. Some open source like Flink, Kafka, Pulsar and StreamNative support stream operators. While Kafka uses stream processing as a special case of batch processing, the Flink does just the reverse. Apache Flink also provides a SQL abstraction over its Table API. Sample Flink query might look like this:

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

env.fromCollection(List<Tuple2<String, Integer>> tuples)

.flatMap(new ExtractHashTags())

.keyBy(0)

.timeWindow(Time.seconds(30))

.sum(1)

.filter(new FilterHashTags())

.timeWindowAll(Time.seconds(30))

.apply(new GetTopHashTag())

.print();

Notice the use of pipelined execution and the writing of functions to do processing on input by basis. A sample function looks like this:

package org.apache.pulsar.functions.api.examples;

import java.util.function.Function;

public class ExclamationFunction implements Function<String, String> {

@Override

public String apply(String input) {

return String.format("%s!", input);

}

Both forms have their purpose and the choice depends on the stack used for the analytics.

Thursday, December 30, 2021

The power of the Azure Data Lake is better demonstrated by the U-SQL queries that can be written without consideration that this is being applied to the scale of Big Data. U-SQL unifies the benefits of SQL with the expressive power of your own code. SQLCLR improves programmability of U-SQL. Conventional SQL-Expressions like SELECT, EXTRACT, WHERE, HAVING, GROUP BY and DECLARE can be used as usual but C# Expressions improve User Defined types (UDTs), User defined functions (UDFs) and User defined aggregates (UDAs). These types, functions and aggregates can be used directly in a U-SQL script. For example, SELECT Convert.ToDateTime(Convert.ToDateTime(@dt).ToString("yyyy-MM-dd")) AS dt, dt AS olddt FROM @rs0; where @dt is a datetime variable makes the best of both C# and SQL. The power of SQL expressions can never be understated for many business use-cases and they suffice by themselves but having the SQL programmability implies that we can even take all the processing all the way into C# and have the SQL script just be an invocation.

The trouble with analytics pipeline is that developers prefer open-source solutions to build them. When we start accruing digital assets in the form of U-SQL scripts, the transition to working with something Apache Spark might not be straightforward or easy. The Azure analytics layer consists of both HD Insight and Azure data Lake analytics (HDLA) which target data differently. The HDInsight works on managed Hadoop clusters and allows developers to write map-reduce with open source. The ADLA is native to Azure and enables C#, SQL over job services. We will also recall that Hadoop was inherently batch processing while Microsoft stack allowed streaming as well. The steps to transform U-SQL scripts to Apache Spark include the following:

- Transform the job orchestration pipeline to include the new Spark programs

- Find the differences between how U-SQL and Spark manage your data.

- Transform the U-SQL scripts to Spark. Choose from one of Azure Data Factory Data Flow, Azure HDInsight Hive, Azure HDInsight Spark or Azure DataBricks services.

With these steps, it is possible to have the best of both worlds while leveraging the benefits of each.

Wednesday, December 29, 2021

The monitoring for the Azure Data Lake leverages the monitoring for the storage account. Azure Storage Analytics performs logging and provides metric data for a storage account. This data can be used to trace requests, analyze usage trends, and diagnose issues with the storage account.

The power of the Azure Data Lake is better demonstrated by the U-SQL queries that can be written without consideration that this is being applied to the scale of Big Data. U-SQL unifies the benefits of SQL with the expressive power of your own code. This is said to work very well with all kinds of data stores – file, object and relational. U-SQL works on the Azure ecosystem which involves the Azure data lake storage as the foundation and the analytics layer over it. The Azure analytics layer consists of both HD Insight and Azure data Lake analytics (HDLA) which target data differently. The HDInsight works on managed Hadoop clusters and allows developers to write map-reduce with open source. The ADLA is native to Azure and enables C#, SQL over job services. We will also recall that Hadoop was inherently batch processing while Microsoft stack allowed streaming as well. The benefit of the Azure storage is that it spans several kinds of data formats and stores. The ADLA has several other advantages over the managed Hadoop clusters in addition to working with a store for the universe. It enables limitless scale and enterprise grade with easy data preparation. The ADLA is built on Apache yarn, scales dynamically and supports a pay by query model. It supports Azure AD for access control and the U-SQL allows programmability like C#.

U-SQL supports big data analytics which generally have the characteristics that they require processing of any kind of data, allow use of custom algorithms, and scale to any size and be efficient.
This lets queries to be written for a variety of big data analytics. In addition, it supports SQL for Big Data which allows querying over structured data Also it enables scaling and parallelization. While Hive supported HiveSQL and Microsoft Scoop connector enabled SQL over big data and Apache Calcite became a SQL Adapter, U-SQL seems to improve the query language itself. It can unify querying over structured and unstructured data. It has declarative SQL and can execute local and remote queries. It increases productivity and agility. It brings in features from T-SQL, Hive SQL, and SCOPE which has been Microsoft's internal Big Data language. U-SQL is extensible, and it can be extended with C# and .NET
If we look at the pattern of separating query from data source, we quickly see it's no longer just a consolidation of data sources. It is also pushing down the query to the data sources and thus can act as a translator. Projections, filters and joins can now take place where the data resides. This was a design decision that came from the need to support heterogeneous data sources. Moreover, it gives a consistent unified view of the data to the user.

SQLCLR improves programmability of U-SQL. Conventional SQL-Expressions like SELECT, EXTRACT, WHERE, HAVING, GROUP BY and DECLARE can be used as usual but C# Expressions improve User Defined types (UDTs), User defined functions (UDFs) and User defined aggregates (UDAs). These types, functions and aggregates can be used directly in a U-SQL script. For example, SELECT Convert.ToDateTime(Convert.ToDateTime(@dt).ToString("yyyy-MM-dd")) AS dt, dt AS olddt FROM @rs0; where @dt is a datetime variable makes the best of both C# and SQL. The power of SQL expressions can never be understated for many business use-cases and they suffice by themselves but having the SQL programmability implies that we can even take all the processing all the way into C# and have the SQL script just be an invocation. This requires assembly to be registered and versioned. U-SQL runs code in x64 format. An uploaded assembly DLL and resource file, such as a different runtime, a native assembly or a configuration file can be at most 400 MB. The total size of all registered resources cannot be greater than 3 GB. There can only be one version of any given assembly. This is sufficient for many business cases which can often be written in the form of a UDF that can take simple parameters and output a simple datatype. These functions can even keep state between invocations. U-SQL comes with a test SDK and together with the local run SDK, script level tests can be authored. Azure Data Lake Tools for Visual Studio enables us to create U-SQL script test cases. A test data source can also be specified for these tests.

Tuesday, December 28, 2021

This is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing that included the most recent discussion on Azure Data Lake which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category. This article focuses on Azure Data Lake which is suited to store and handle Big Data. This is built over Azure Blob Storage, so it provides native support for web-accessible documents. It is not a massive virtual data warehouse, but it powers a lot of analytics and is centerpiece of most solutions the conform to the Big Data architectural style. In this section, we focus on Data Lake monitoring.

As we might expect from the use of Azure storage account, the monitoring for the Azure Data Lake leverages the monitoring for the storage account. Azure Storage Analytics performs logging and provides metric data for a storage account. This data can be used to trace requests, analyze usage trends, and diagnose issues with the storage account.

The storage account must enable it individually for each service that needs to be monitored. Blobs, queues, tables and file services are all subject to monitoring. The aggregated data is stored in a well-known blob designated for logging and in well-known tables, which may be accessed using the Blob service and Table service APIs. There is a 20TB limit for the metrics and this is besides what size is provisioned for data, so we can forget about resizing. When we monitor a storage service, the service health, capacity, availability and performance of the service is studied. The service health can be observed from the portal and the notifications can be subscribed to. The $MetricsCapacityBlob enables monitoring for the blob service. Storage Metrics records this data once per day. The capacity is measured in bytes and both the containerCount and ObjectCount are available per daily entry. Availability is monitored in the hourly and minute metrics tables that record primary transactions against blob, tables and queues. The availability data is a column in these tables. Performance is measured in the AverageE2ELatency and AverageServerLatency columns. The E2ELatency is recorded only for successful requests and the average server latency includes time that the client takes to send the data and receive acknowledgements from the storage service. A high value for the first and a low value for the second implies that the client is slow, or the network connectivity is poor. Nagle’s algorithm is a TCP optimization on the sender and it is designed to reduce network congestion by coalescing small send requests into larger TCP segments. So small segments are held back until a larger segment is available to send the data. But it does not work well with delayed acknowledgements that are an optimization on the receiver side. When the receiver delays the ack and the sender waits for the ack to send a small segment, the data transfer gets stalled. Turning off these optimizations enables with improved table, blob and queue usages.

Requests to create blobs for logging and the requests to create table entities for metrics are billable. Older logging and metrics data can be archived or truncated. As with all data using Azure storage, this is set via the retention policy on the containers. Metrics are stored for both the service and the API operations of that service which includes the percentages and the count of certain status messages. These features can help analyze the cost aspect of the usages.

Monday, December 27, 2021

Gen 2 is the current standard for building Enterprise Data Lakes on Azure. A data lake must store petabytes of data while handling bandwidths up to Gigabytes of data transfer per second. The hierarchical namespace of the object storage helps organize objects and files into a deep hierarchy of folders for efficient data access. The naming convention recognizes these folder paths by including the folder separator character in the name itself. With this organization and folder access directly to the object store, the performance of the overall usage of data lake is improved. The Azure Blob File System Driver for Hadoop is a mere shim over the Azure Data Lake Storage interface that supports file system semantics over blob storage. Fine grained access control lists and active directory integration round up the data security considerations. The data management and analytics form the core scenarios supported by Data Lake. For multi-region deployments, it is recommended to have the data landing in one region and then replicated globally using AzCopy, Azure Data Factory or third-party products which assist with migrating data from one place to another. The best practices for Azure Data Lake involve evaluating feature support and known issues, optimizing for data ingestion, considering data structures, performing ingestion, processing and analysis from several data sources and leveraging monitor telemetry

Azure Data Lake supports query acceleration and analytics framework. It significantly improves data processing by only retrieving data that is relevant to an operation. This cascades to reduced time and processing power for the end-to-end scenarios that are necessary to gain critical insights into stored data. Both ‘filtering predicates' and ‘column projections’ are enabled, and SQL can be used to describe them. Only the data that meets these conditions are transmitted. A request processes only one file so joins, aggregates and other query operators are not supported but the request can be in any format such as csv or json file formats. The query acceleration feature isn’t limited to Data Lake Storage. It is supported even on Blobs in storage accounts that form the persistence layer below the containers of the data lake. Even those without hierarchical namespace are supported by the Azure Data Lake query acceleration feature. The query acceleration is part of the data lake so applications can be switched with one another, and the data selectivity and improved latency continues across the switch. Since the processing is on the side of the Data Lake, the pricing model for query acceleration differs from that of the normal transactional model.

Gen2 also supports Premium block blob storage accounts that are ideal for big data analytics applications and workloads. These require low latency and a high number of transactions. Workloads can be interactive, IoT, streaming analytics, artificial intelligence and machine learning.