Cluster computing

Monday, January 3, 2022

This is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing that included the most recent discussion on Azure Data Lake which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category. This article focuses on Azure Data Lake which is suited to store and handle Big Data. This is built over Azure Blob Storage, so it provides native support for web-accessible documents. It is not a massive virtual data warehouse, but it powers a lot of analytics and is centerpiece of most solutions the conform to the Big Data architectural style.

Azure Data Lake supports query acceleration and analytics framework. It significantly improves data processing by only retrieving data that is relevant to an operation. This cascades to reduced time and processing power for the end-to-end scenarios that are necessary to gain critical insights into stored data. Both ‘filtering predicates' and ‘column projections’ are enabled, and SQL can be used to describe them. Only the data that meets these conditions are transmitted. A request processes only one file so joins, aggregates and other query operators are not supported but the request can be in any format such as csv or json file formats. The query acceleration feature isn’t limited to Data Lake Storage. It is supported even on Blobs in storage accounts that form the persistence layer below the containers of the data lake. Even those without hierarchical namespace are supported by the Azure Data Lake query acceleration feature. The query acceleration is part of the data lake so applications can be switched with one another, and the data selectivity and improved latency continues across the switch. Since the processing is on the side of the Data Lake, the pricing model for query acceleration differs from that of the normal transactional model.

There are three different client tools for working with Azure Data Lake Storage. These include: 1. Azure Portal which provides the convenience of a web user interface and explores all forms of blobs, tables, queues and files, 2. Azure Storage Explorer which can be downloaded and just as useful to explore as the Azure Portal 3. The Microsoft Visual Studio Cloud Explorer which supports exploring blobs, tables and queues but not files. There are a number of third party tools that are also available for working with Azure Storage Data.

The known issues with using Gen2 data lake storage include the following:

1. the similarities and contrasts between Gen 2 APIs, NFS 3.0 and Data Lake Storage APIs where all of them can operate on the same data but cannot write to the same instance of a file. The Gen2 or NFS 3.0 APIs can write to a file but it won’t be visibe to GET block list Blob API unless the file is being overwritten with a zero truncate option.

2. The PUT Blob (Page), Put Page, Get Page Ranges, Incremental Copy Blob and Put Page From URL APIs are not supported. Storage accounts that have a hierarchical namespace cannot permit unmanaged VM disks to be added to the account.

3. Access control Lists (ACLs) is widely used with storage entities by virtue of assignment or inheritance. Both operations allow access controls to be set in bulk via recursion. ACLs can be set via Azure Storage Explorer, PowerShell, Azure Cli, and .Net, Java and Python SDK but not via the Azure Portal that manages resources instead of data containers.

4. If anonymous read acccess has been granted to a data container, then ACLs will have no effect on read requests but they can still be applied to write requests.

5. Only the latest versions of the AzCopy such as v10 onwards and Azure Storage Explorer v1.6.0 or higher are supported.

Third party applications are best advised to use the REST APIs or SDK since they will continue to be supported. Deletion of logs from storage account can be performed with one of the above tools, REST APIs or SDK but the setting for retention days is not supported. Windows Azure Storage Blob Driver works only with Blob APIs and setting the multiprotocol access on Data Lake Storage won’t mitigate the issue of WASB driver not working with some common cases of Data Lake. If the parent folder of soft-deleted files or folders is renamed, then they won’t display correctly in the portal but the PowerShell and CLI can be used to restore them. An account with event subscription will not be able to read from secondary storage endpoints but this can be mitigated by removing the event subscription.

Sunday, January 2, 2022

Identity and payment

One of the most important interactions between individuals is payments. A payment service recognizes both the payer and the payee based on their identity. The service delivers a comprehensive one-stop shop for all forms of exchanges. This makes it easy and consistent for small businesses and individuals to follow and sign up for their program. The payment service virtualizes the payments across geography, policies, statutes and limitations while facilitating the mode of payment and receipts for the individuals.
Identity and Access Management is a critical requirement for this service without which the owner reference cannot be resolved. But this identity and access management does not need to be rewritten as part of the services offered. An identity cloud is a good foundation for IAM integration and existing membership providers across participating businesses can safely be resolved with this solution.
The use of a leading IAM provider for Identity cloud could help with the integration of identity resolution capability for this service. The translation of the owner to the owner_id required by the payments service is automatically resolved by referencing the identifier issued by the cloud IAM. Since the IAM is consistent and accurate, the mappings are straightforward one-to-one. The user does not need to sign in to generate the owner_id. They can be resolved with the integration of the membership provider owned by the IAM which might be on-premise for a client’s organization such as with Active Directory integration or a multi-tenant offering from their own identity cloud. Since the integration of different applications for enterprises is expected to be integrated with the Active Directory or IAM provider, the identity cloud can be considered as global for this purpose but the identities from different organizations will require to be isolated from each other in the identity cloud. The offload of identity to an IAM is a clean separation of concern for the payment services.

But the definition of payments and the definition of identities in these cases must share some overlap in the financial accounts from which the payments originate which leads neither the payment services nor the identity services from doing away with each other's participation. A microservices architectural style can resolve this interdependency with an exchange of calls between each other but there is no resiliency or continuity of business without high availability from each other. Instead, if financial account information were to become a form of identity, even a distributed ledger is sufficient to do away with both.

The difference between a B2B and a B2C reward points service stands out further in this regard when an employee can use the same sign-in without requiring signing into a bank as well. With the integration of enterprise-wide IAM to an identity cloud provider and the availability of attributes via SAML, the mapping of transactions to identity becomes automatic from the user experience point of view leading only to the use of Identity and Access Management service with the frontend. The payment service operates in the background with the payer and payee passed as referrals.

We assume that the payment service maintains financial information accurately in its layer. The service accumulating and redeeming from a notion of balance associated with an individual will update the same source of truth. This is neither a synchronous transaction nor a single repository and must involve some form of reconciliation while the identity may even disappear. Both the identity and payment services recognize the same individual for the transactions because of a mashup presented to the user and a backend registration between them. With the notion of payments included within an identity, there is more competition, less monopoly, and deregulating the economy. The payment service use of a notion of identity differs widely from that for the identity services as each plays up their capabilities with this representation. This leads to a more diverse form of experiences and ultimately a win-win for an individual.

The payment transactions must have changed data capture and some form of audit trail. This is essential to the post-transaction resolution, reconciliation, and investigations to their occurrence. The identity services facilitate a session identifier that can be used with single-sign-on for repeated transactions. This is usually obtained as a hash of the session cookie and is provided by the authentication and authorization server. The session identifier can be requested as part of the login process or by a separate API Call to a session endpoint. An active session removes the need to re-authenticate regardless of the transactions performed. It provides a familiar end-user experience and functionality. The session can also be used with user-agent features or extensions to assist with authentication such as password-manager or 2-factor device reader.

Finally, both services can manifest mobile applications, cloud services, and B2B multi-tenant SAAS offerings to their customers with or without each other and with little or no restrictions to their individual capabilities.

Saturday, January 1, 2022

Git Pull Request Iteration Changes

Problem statement: Git is a popular tool for software developers to share and review code before committing it to the mainline source code repository. Collaborators often exchange feedback and comments on a web user interface by creating a pull-request branch that captures the changes to the source branch. As it accrues the feedback and consequent modifications, it gives an opportunity for reviewers to finalize and settle on the changes before it is merged to the target branch. This change log is independent of the source and the target branch because both can be modified while the change capture is a passive recorder of those iterations. It happens to be called a pull request because every change made to the source branch is pulled as an iteration. The problem is that when the iterations increase to a large number, the updates made to the pull request become a sore point for the reviewers as they cannot see all the modifications on the latest revision. If the pull request iteration changes could be coalesced into fewer iterations, it becomes easier to follow and looks cleaner in retrospect. The purpose of this document is to find ways to do so, which by design was independent of the requestor and the reviewer.

Solution: A pull-request workflow consists of the following:

Step 1. Pull the changes to your local machine.

Step 2. Create a "branch” version

Step 3. Commit the changes

Step 4. Push the changes to a remote branch

Step 5. Open a pull-request between the source and the target branch.

Step 6. Receive and make iterations for changes desired.

Step 7. Complete the pull-request by merging the final iteration to the master.

When the iterations for Step 6 and Step 7 grow to a large number, a straightforward way to reduce iterations is to close one pull-request and then start another but developers and collaborators will need to track multiple pull-requests. Since this presents a moving target to complete, it would be preferable to modify a pull-request in place.

Git REST API offers an option to retrieve the pull-request revisions but not a way to edit them. For example, we have GET https://dev.azure.com/{organization}/{project}/_apis/git/repositories/{repositoryId}/pullRequests/{pullRequestId}/iterations/{iterationId}/changes?api-version=5.1 that gives the changes between two iterations but this is not something we can edit.

The git SDK also offers a similar construct to view the GitPullRequestIterationChanges but these are not modifiable.

Instead, we can run the following commands:

cd /my/fork

git checkout master

git commit –va –m “Coalescing PR comments”

git push

If the commits need to be coalesced, this can be done with the git pull –rebase command. The squash option is for merge. Bringing in changes from the master can be done either via rebase or merge and depending on the case, this can mean accepting HEAD revision for former or TAIL revision for latter.

The git log –first-parent can help view only the merge commit in the log.

If the pull request is being created from a fork, the pull-request author can grant permissions at the time of pull-request creation to allow contributors to upstream repositories to make commits to the pull requests compare branch.

Reference:

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork

https://stackoverflow.com/questions/7947322/preferred-github-workflow-for-updating-a-pull-request-after-code-review

https://stackoverflow.com/questions/16748115/how-to-modify-github-pull-request

Friday, December 31, 2021

We discussed that there are two forms of programming with Azure Data Lake – one that leverages U-SQL and another that leverages open-source programs such as for Apache Spark. U-SQL unifies the benefits of SQL with the expressive power of your own code. SQLCLR improves programmability of U-SQL by allowing user to write user-defined operators such as functions, aggregations and data types that can be used in conventional SQL-Expressions or required only an invocation from a SQL statement.

The other form of programming is largely applied to HD Insight as opposed to U-SQL for Azure data Lake analytics (ADLA) which target data in batch processing often involving map-reduce algorithms on Hadoop clusters. Also, Hadoop is inherently batch processing while Microsoft stack allows streaming as well. Some open source like Flink, Kafka, Pulsar and StreamNative support stream operators. While Kafka uses stream processing as a special case of batch processing, the Flink does just the reverse. Apache Flink also provides a SQL abstraction over its Table API. Sample Flink query might look like this:

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

env.fromCollection(List<Tuple2<String, Integer>> tuples)

.flatMap(new ExtractHashTags())

.keyBy(0)

.timeWindow(Time.seconds(30))

.sum(1)

.filter(new FilterHashTags())

.timeWindowAll(Time.seconds(30))

.apply(new GetTopHashTag())

.print();

Notice the use of pipelined execution and the writing of functions to do processing on input by basis. A sample function looks like this:

package org.apache.pulsar.functions.api.examples;

import java.util.function.Function;

public class ExclamationFunction implements Function<String, String> {

@Override

public String apply(String input) {

return String.format("%s!", input);

}

Both forms have their purpose and the choice depends on the stack used for the analytics.

Thursday, December 30, 2021

The power of the Azure Data Lake is better demonstrated by the U-SQL queries that can be written without consideration that this is being applied to the scale of Big Data. U-SQL unifies the benefits of SQL with the expressive power of your own code. SQLCLR improves programmability of U-SQL. Conventional SQL-Expressions like SELECT, EXTRACT, WHERE, HAVING, GROUP BY and DECLARE can be used as usual but C# Expressions improve User Defined types (UDTs), User defined functions (UDFs) and User defined aggregates (UDAs). These types, functions and aggregates can be used directly in a U-SQL script. For example, SELECT Convert.ToDateTime(Convert.ToDateTime(@dt).ToString("yyyy-MM-dd")) AS dt, dt AS olddt FROM @rs0; where @dt is a datetime variable makes the best of both C# and SQL. The power of SQL expressions can never be understated for many business use-cases and they suffice by themselves but having the SQL programmability implies that we can even take all the processing all the way into C# and have the SQL script just be an invocation.

The trouble with analytics pipeline is that developers prefer open-source solutions to build them. When we start accruing digital assets in the form of U-SQL scripts, the transition to working with something Apache Spark might not be straightforward or easy. The Azure analytics layer consists of both HD Insight and Azure data Lake analytics (HDLA) which target data differently. The HDInsight works on managed Hadoop clusters and allows developers to write map-reduce with open source. The ADLA is native to Azure and enables C#, SQL over job services. We will also recall that Hadoop was inherently batch processing while Microsoft stack allowed streaming as well. The steps to transform U-SQL scripts to Apache Spark include the following:

- Transform the job orchestration pipeline to include the new Spark programs

- Find the differences between how U-SQL and Spark manage your data.

- Transform the U-SQL scripts to Spark. Choose from one of Azure Data Factory Data Flow, Azure HDInsight Hive, Azure HDInsight Spark or Azure DataBricks services.

With these steps, it is possible to have the best of both worlds while leveraging the benefits of each.