Cluster computing

Thursday, January 6, 2022

This is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing that included the most recent discussion on Azure DNS which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category. In this article, we discuss delegation.

Azure DNS allows hosting a DNS zone and managing the DNS records for a domain in Azure. The domain must be delegated to the Azure DNS from the parent domain so that the DNS queries for that domain can reach Azure DNS. Since Azure DNS isn't the domain registrar, delegation must be configured properly. A domain registrar is a company who can provide internet domain names. An internet domain is purchased for legal ownership. This domain registrar must delegate to the Azure DNS.

The domain name system is a hierarchy of domains which starts from the root domain that starts with a ‘.’ followed by the top-level domains including ‘com’, ‘net’, ‘org’, etc. The second level domains are ‘org.uk’, ‘co.jp’ and so on. The domains in the DNS hierarchy are hosted using separate DNS zones. A DNS zone is used to host the DNS records for a particular domain.

There are two types of DNS Servers: 1) An authoritative DNS Server that hosts DNS zones and it answers the DNS queries for records in those zones only and 2) a recursive DNS server that doesn’t host DNS zones but queries the authoritative servers for answer. Azure DNS is an authoritative DNS service.

DNS clients in PCs or mobile devices call a recursive DNS server for the DNS queries their application needs. When a recursive DNS server receives a query for a DNS record, it finds the nameserver for the named domain by starting at the root nameserver and then walks down the hierarchy by following CNAMEs. The DNS maintains a special type of name record called an NS record which lets the parent zone point to the nameservers for a child zone. Setting up the NS records for the child zone in a parent zone is called delegating the domain. Each delegation has two copies of the NS records: one in the parent zone pointing to the child, and another in the child zone itself. These records are called authoritative NS records and they sit at the apex of the child zone.

The DNS records help with name resolution of services and resources. It can manage DNS records for external services as well. It supports private DNS domains as well which allows us to use custom domain names with private virtual networks.

It supports record sets where we can use an alias record that is set to refer to an Azure resource. If the IP address of the underlying resource changes, the alias record set updates itself during DNS resolution.

The DNS protocol prevents the assignment of a CNAME records at the zone apex. This restriction presents a problem when there are load balanced applications behind a Traffic Manager whose profile requires the creation of a CNAME record. This can be mitigated with Alias records which can be created at the zone apex.

Wednesday, January 5, 2022

This is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing that included the most recent discussion on Azure Data Lake which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category. This article focuses on Azure DNS. This is a hosting service for DNS domains that provides name resolution by using Microsoft Azure infrastructure. This lets us manage DNS records, but it is not something to buy a domain name with. For that, App Service domain works well. When the domains are available, they can be hosted in the Azure DNS for records management.

While Azure DNS has many features including activity logs, resource locking and Azure RBAC, the DNSSEC is not supported because HTTP/TLS is available instead. If it is required, then DNS zones can be hosted with third party DNS hosting providers.

Each resource must be given a name. With records in the domain name server, the resource becomes reachable and resolvable from a virtual network. The following options are available to configure the DNS settings for a resource: 1) using only the host file that resolves the name locally 2) using a private DNS zone to override the DNS resolution for a private endpoint with the zone linked to the virtual network, and 3) using a DNS forwarder with a rule to use the DNS Zone in another DNS server. It is preferrable to not override a zone that resolves public endpoints because if the connectivity to the public DNS goes down, those public endpoints will remain unreachable. That is why a different domain name with a suffix such as “core.windows.net” is recommended. Multiple zones with the same name for different virtual networks would need manual operations to merge the DNS records.

A common problem with traditional DNS records is dangling records. The DNS records that haven’t been updated to reflect changes to IP addresses are called dangling records. With a traditional DNS zone record, the target IP or CNAME no longer exists. It requires manual updates which can be costly. A delay in updating DNS records can potentially cause an extended outage for the users. Alias records avoid this situation by tightly coupling the lifecycle of a DNS record with an Azure resource.

Tuesday, January 4, 2022

This is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing that included the most recent discussion on Azure Data Lake which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category. This article focuses on Azure Data Lake which is suited to store and handle Big Data. This is built over Azure Blob Storage, so it provides native support for web-accessible documents. It is not a massive virtual data warehouse, but it powers a lot of analytics and is centerpiece of most solutions that conform to the Big Data architectural style.

This article talks about data ingestion from one location to another in an Azure Data Lake Gen 2 using Azure Synapse analytics. The Gen 2 is a source data store and will require a corresponding storage account. Azure Synapse analytics provides many features for data analysis and integration, but its pipelines are even more helpful to working with data.

In the Azure Synapse Analytics, we create a linked service which is a definition of a connection information to another service. When we add Azure Synapse Analytics and Azure Data Lake Gen 2 as linked services, we enable the data to flow continuously over the connection without requiring additional routines. The Azure Synapse Analytics UX has a manage tab where the option to create a linked services is provided under External Connections. The Azure Storage Data Lake Gen 2 connection will require an Account Key, a service principal, a managed identity and supported authentication types. The connection can be tested prior to use.

The pipeline definition in the Azure Synapse describes the logical flow for an execution of a set of activities. We require a copy activity in the pipeline to ingest data from Azure Data Lake Gen 2 into a dedicated SQL pool. A pipeline option is available under the Orchestrate tab which must be selected to associate activities with. The Move and Transform option in the activities pane has a copy-data option that can be dragged onto the pipeline canvas. The copy activity must be defined with a new source data store as the Azure Data Lake Storage Gen 2. The delimited text as the format must be specified along with the filepath as the source data and whether the first row has a header

With the pipeline configured this way, a debug run can be executed before the artifacts are published which can verify if everything is correct. Once the pipeline has run successfully, the publish-all option can be selected to publish entities to the Synapse Analytics service. When the successfully published message occurs, we can move on to triggering and monitoring the pipeline.

A trigger can be manually invoked with the Trigger Now option. When this is done, the monitor tab will display the pipeline run along with links under the Actions column. The details of the copy operation can then be viewed. The data written to the dedicated SQL pool can then be verified to be correct.

Monday, January 3, 2022

This is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing that included the most recent discussion on Azure Data Lake which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category. This article focuses on Azure Data Lake which is suited to store and handle Big Data. This is built over Azure Blob Storage, so it provides native support for web-accessible documents. It is not a massive virtual data warehouse, but it powers a lot of analytics and is centerpiece of most solutions the conform to the Big Data architectural style.

Azure Data Lake supports query acceleration and analytics framework. It significantly improves data processing by only retrieving data that is relevant to an operation. This cascades to reduced time and processing power for the end-to-end scenarios that are necessary to gain critical insights into stored data. Both ‘filtering predicates' and ‘column projections’ are enabled, and SQL can be used to describe them. Only the data that meets these conditions are transmitted. A request processes only one file so joins, aggregates and other query operators are not supported but the request can be in any format such as csv or json file formats. The query acceleration feature isn’t limited to Data Lake Storage. It is supported even on Blobs in storage accounts that form the persistence layer below the containers of the data lake. Even those without hierarchical namespace are supported by the Azure Data Lake query acceleration feature. The query acceleration is part of the data lake so applications can be switched with one another, and the data selectivity and improved latency continues across the switch. Since the processing is on the side of the Data Lake, the pricing model for query acceleration differs from that of the normal transactional model.

There are three different client tools for working with Azure Data Lake Storage. These include: 1. Azure Portal which provides the convenience of a web user interface and explores all forms of blobs, tables, queues and files, 2. Azure Storage Explorer which can be downloaded and just as useful to explore as the Azure Portal 3. The Microsoft Visual Studio Cloud Explorer which supports exploring blobs, tables and queues but not files. There are a number of third party tools that are also available for working with Azure Storage Data.

The known issues with using Gen2 data lake storage include the following:

1. the similarities and contrasts between Gen 2 APIs, NFS 3.0 and Data Lake Storage APIs where all of them can operate on the same data but cannot write to the same instance of a file. The Gen2 or NFS 3.0 APIs can write to a file but it won’t be visibe to GET block list Blob API unless the file is being overwritten with a zero truncate option.

2. The PUT Blob (Page), Put Page, Get Page Ranges, Incremental Copy Blob and Put Page From URL APIs are not supported. Storage accounts that have a hierarchical namespace cannot permit unmanaged VM disks to be added to the account.

3. Access control Lists (ACLs) is widely used with storage entities by virtue of assignment or inheritance. Both operations allow access controls to be set in bulk via recursion. ACLs can be set via Azure Storage Explorer, PowerShell, Azure Cli, and .Net, Java and Python SDK but not via the Azure Portal that manages resources instead of data containers.

4. If anonymous read acccess has been granted to a data container, then ACLs will have no effect on read requests but they can still be applied to write requests.

5. Only the latest versions of the AzCopy such as v10 onwards and Azure Storage Explorer v1.6.0 or higher are supported.

Third party applications are best advised to use the REST APIs or SDK since they will continue to be supported. Deletion of logs from storage account can be performed with one of the above tools, REST APIs or SDK but the setting for retention days is not supported. Windows Azure Storage Blob Driver works only with Blob APIs and setting the multiprotocol access on Data Lake Storage won’t mitigate the issue of WASB driver not working with some common cases of Data Lake. If the parent folder of soft-deleted files or folders is renamed, then they won’t display correctly in the portal but the PowerShell and CLI can be used to restore them. An account with event subscription will not be able to read from secondary storage endpoints but this can be mitigated by removing the event subscription.

Sunday, January 2, 2022

Identity and payment

One of the most important interactions between individuals is payments. A payment service recognizes both the payer and the payee based on their identity. The service delivers a comprehensive one-stop shop for all forms of exchanges. This makes it easy and consistent for small businesses and individuals to follow and sign up for their program. The payment service virtualizes the payments across geography, policies, statutes and limitations while facilitating the mode of payment and receipts for the individuals.
Identity and Access Management is a critical requirement for this service without which the owner reference cannot be resolved. But this identity and access management does not need to be rewritten as part of the services offered. An identity cloud is a good foundation for IAM integration and existing membership providers across participating businesses can safely be resolved with this solution.
The use of a leading IAM provider for Identity cloud could help with the integration of identity resolution capability for this service. The translation of the owner to the owner_id required by the payments service is automatically resolved by referencing the identifier issued by the cloud IAM. Since the IAM is consistent and accurate, the mappings are straightforward one-to-one. The user does not need to sign in to generate the owner_id. They can be resolved with the integration of the membership provider owned by the IAM which might be on-premise for a client’s organization such as with Active Directory integration or a multi-tenant offering from their own identity cloud. Since the integration of different applications for enterprises is expected to be integrated with the Active Directory or IAM provider, the identity cloud can be considered as global for this purpose but the identities from different organizations will require to be isolated from each other in the identity cloud. The offload of identity to an IAM is a clean separation of concern for the payment services.

But the definition of payments and the definition of identities in these cases must share some overlap in the financial accounts from which the payments originate which leads neither the payment services nor the identity services from doing away with each other's participation. A microservices architectural style can resolve this interdependency with an exchange of calls between each other but there is no resiliency or continuity of business without high availability from each other. Instead, if financial account information were to become a form of identity, even a distributed ledger is sufficient to do away with both.

The difference between a B2B and a B2C reward points service stands out further in this regard when an employee can use the same sign-in without requiring signing into a bank as well. With the integration of enterprise-wide IAM to an identity cloud provider and the availability of attributes via SAML, the mapping of transactions to identity becomes automatic from the user experience point of view leading only to the use of Identity and Access Management service with the frontend. The payment service operates in the background with the payer and payee passed as referrals.

We assume that the payment service maintains financial information accurately in its layer. The service accumulating and redeeming from a notion of balance associated with an individual will update the same source of truth. This is neither a synchronous transaction nor a single repository and must involve some form of reconciliation while the identity may even disappear. Both the identity and payment services recognize the same individual for the transactions because of a mashup presented to the user and a backend registration between them. With the notion of payments included within an identity, there is more competition, less monopoly, and deregulating the economy. The payment service use of a notion of identity differs widely from that for the identity services as each plays up their capabilities with this representation. This leads to a more diverse form of experiences and ultimately a win-win for an individual.

The payment transactions must have changed data capture and some form of audit trail. This is essential to the post-transaction resolution, reconciliation, and investigations to their occurrence. The identity services facilitate a session identifier that can be used with single-sign-on for repeated transactions. This is usually obtained as a hash of the session cookie and is provided by the authentication and authorization server. The session identifier can be requested as part of the login process or by a separate API Call to a session endpoint. An active session removes the need to re-authenticate regardless of the transactions performed. It provides a familiar end-user experience and functionality. The session can also be used with user-agent features or extensions to assist with authentication such as password-manager or 2-factor device reader.

Finally, both services can manifest mobile applications, cloud services, and B2B multi-tenant SAAS offerings to their customers with or without each other and with little or no restrictions to their individual capabilities.

Saturday, January 1, 2022

Git Pull Request Iteration Changes

Problem statement: Git is a popular tool for software developers to share and review code before committing it to the mainline source code repository. Collaborators often exchange feedback and comments on a web user interface by creating a pull-request branch that captures the changes to the source branch. As it accrues the feedback and consequent modifications, it gives an opportunity for reviewers to finalize and settle on the changes before it is merged to the target branch. This change log is independent of the source and the target branch because both can be modified while the change capture is a passive recorder of those iterations. It happens to be called a pull request because every change made to the source branch is pulled as an iteration. The problem is that when the iterations increase to a large number, the updates made to the pull request become a sore point for the reviewers as they cannot see all the modifications on the latest revision. If the pull request iteration changes could be coalesced into fewer iterations, it becomes easier to follow and looks cleaner in retrospect. The purpose of this document is to find ways to do so, which by design was independent of the requestor and the reviewer.

Solution: A pull-request workflow consists of the following:

Step 1. Pull the changes to your local machine.

Step 2. Create a "branch” version

Step 3. Commit the changes

Step 4. Push the changes to a remote branch

Step 5. Open a pull-request between the source and the target branch.

Step 6. Receive and make iterations for changes desired.

Step 7. Complete the pull-request by merging the final iteration to the master.

When the iterations for Step 6 and Step 7 grow to a large number, a straightforward way to reduce iterations is to close one pull-request and then start another but developers and collaborators will need to track multiple pull-requests. Since this presents a moving target to complete, it would be preferable to modify a pull-request in place.

Git REST API offers an option to retrieve the pull-request revisions but not a way to edit them. For example, we have GET https://dev.azure.com/{organization}/{project}/_apis/git/repositories/{repositoryId}/pullRequests/{pullRequestId}/iterations/{iterationId}/changes?api-version=5.1 that gives the changes between two iterations but this is not something we can edit.

The git SDK also offers a similar construct to view the GitPullRequestIterationChanges but these are not modifiable.

Instead, we can run the following commands:

cd /my/fork

git checkout master

git commit –va –m “Coalescing PR comments”

git push

If the commits need to be coalesced, this can be done with the git pull –rebase command. The squash option is for merge. Bringing in changes from the master can be done either via rebase or merge and depending on the case, this can mean accepting HEAD revision for former or TAIL revision for latter.

The git log –first-parent can help view only the merge commit in the log.

If the pull request is being created from a fork, the pull-request author can grant permissions at the time of pull-request creation to allow contributors to upstream repositories to make commits to the pull requests compare branch.

Reference:

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork

https://stackoverflow.com/questions/7947322/preferred-github-workflow-for-updating-a-pull-request-after-code-review

https://stackoverflow.com/questions/16748115/how-to-modify-github-pull-request