Cluster computing

Thursday, June 8, 2023

Sample Spark code for Databricks notebook:

Read CSV:

!pip install Keras-Preprocessing

import pickle

file_location = "abfss://container@storageaccount.dfs.core.windows.net/path/to/file.bin"

df = spark.read.format("binaryFile").option("pathGlobFilter", "*.bin").load(file_location)

val = df.rdd.map(lambda row: bytes(row.content)).first()

print(type(val))

tokenizer = pickle.loads(bytearray(val))

print(repr(tokenizer))

print(type(tokenizer))

print(tokenizer.word_index)

'''

<keras_preprocessing.text.Tokenizer object at 0x7f9d34bbb340>

{'key': value,

'''

Write CSV:

val mount_root = "/mnt/ContainerName/DirectoryName"

df.coalesce(1).write.format("csv").option("header","true").mode("OverWrite").save(s"dbfs:$mount_root/")

Sample Spark code with SAS URL for large csv in external storage:

import requests

CHUNK_SIZE=4096

filename = "filename1.csv"

with requests.get("<sas-url>", stream=True) as resp:

if resp.ok:

with open("/dbfs/" + filename, "wb") as f:

for chunk in resp.iter_content(chunk_size=CHUNK_SIZE):

f.write(chunk)

display(spark.read.csv("dbfs:/" + filename, header=True, inferSchema=True))

for extra large items, download to dbfs and work with python utilities:

import requests

import os

CHUNK_SIZE=4096

filename = "filename2"

if not os.path.isfile("/dbfs/" + filename):

print("downloading file...")

with requests.get("<sas-url>", stream=True) as resp:

if resp.ok:

with open("/dbfs/" + filename, "wb") as f:

for chunk in resp.iter_content(chunk_size=CHUNK_SIZE):

f.write(chunk)

print("file found...")

file found…

Wednesday, June 7, 2023

Options to access external storage from Databricks notebook:

This article describes the ways to access Azure Data Lake Storage from Azure Databricks Notebook.

One of the favored techniques for accessing external storage is with pass-through authentication.

We need a credential and a connection to the external storage with which we can write Spark applications in Python and the application runs on a cluster, so the configuration of the cluster is our first consideration. Starting with a cluster for a single user and enabling integrated pass-through authentication with the Azure Active directory enables the Spark application to recognize the developer account as native to Azure and does away with any syntax for passing credentials or forming connections, so that only the code for reading or writing is required.

The following sample provides a way to do that along with references to public documentation:

configs = {

# passthrough ADLS1: https://learn.microsoft.com/en-us/azure/databricks/data-governance/credential-passthrough/adls-passthrough

"fs.adl.oauth2.access.token.provider.type": "CustomAccessTokenProvider",

"fs.adl.oauth2.access.token.custom.provider": spark.conf.get("spark.databricks.passthrough.adls.tokenProviderClassName"),

# passthrough ADLS2

"fs.azure.account.auth.type" : "CustomAccessToken",

"fs.azure.account.custom.token.provider.class" : spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName"),

# OAuth: https://learn.microsoft.com/en-us/azure/databricks/dbfs/mounts

"fs.azure.account.auth.type":"OAuth",

"fs.azure.account.oauth.provider.type":"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",

"fs.azure.account.oauth2.client.id":"<application-id>",

"fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),

"fs.azure.account.oauth2.client.endpoint":"https://login.microsoftonline.com/tenantId/oauth2/token",

# personal access token or AAD token: https://learn.microsoft.com/en-us/azure/databricks/administration-guide/access-control/tokens

"key.for.access.token": "<bearer-token-from-quote-az-space-account-space-get-access-token-quote-cli-command"

}

dbutils.fs.mount(

source="wasbs://<container>@<storageaccount>.blob.core.windows.net/path/to/resources", # abfss for dfs

mount_point = "/mnt/resources",

extra_configs=configs

)

df = spark.read.format("binaryFile").option("pathGlobFilter", "*.bin").load("dbfs:/resources/file")

df.show()

dbutils.fs.unmount("/mnt/resources")

Tuesday, June 6, 2023

Azure Resource - overview for configuring App Services Introduction

Introduction

Azure App Services are small footprint web applications that must be configured properly to improve the developer experience and reduce operational engineering overhead.

The following guidelines includes information on the settings for the Terraform defined app_services data structure, so that there are fewer exceptions.

A sample app_service definition would look something like this in a Terraform variables file:

```

App-projectx-nonprod-1 = {

name = "app-projectx-nonprod-1"

tags = {

Category = "analytical"

}

resource_group_key = "rg-projectx-nonprod-01"

app_service_plan_key = "ASP-projectx-nonprod-01"

https_only = true

identity = {

type = "SystemAssigned"

identity_ids = null

}

mi_access = {

keyvault_key = "kv-projectx-nonprod-01"

}

app_settings = {

}

```

Attributes

Notice the tags are inherited from global and merged with what's defined in the section above. The costcenter and environment are set globally. The cradle is best determined by the app author and is good to be called out specifically in this app_service resource.

The resource group should be chosen such that app_service has the same lifespan as those of the other resources in the resource group. Choosing a new resource group for every app_service is not advisable. If the resource group is shared, your application will likely linger long after it has served its purpose.

The App Service plan are the computing resources you pay for hosting your app_service. You save money by putting multiple apps into one app service plan. If there are resources to handle the load, app_services can be added to the same app_service plan and when one app_service_plan is full, it can spill over to another. The SKU of the app_service plan determines the maximum number of applications to add as 8, 16, 32 and so on. It is safe to put multiple applications into the same application plan if the vCPU usage is regular. This is true for both non-production and production environment. If you do not want the app_service to share the same fate as others in the plan, a new app_service_plan can be created. You could also isolate your app into a new plan when the app is resource-intensive or has different scalability from those in the existing plan.

Secure the traffic to be https_only. This does not change even when the app_service is accessed only through an application gateway. It is also preferable to have a custom domain common to both the application gateway and the app_service and the hostname of the app_service is passed through. Application Gateways have concerns beyond the app_service, so your settings must not have any overrides to the default to begin with and once the gateway is functional, the URL for your application will be made available. App services making calls to other app_services via the azurewebsites.net URL will likely use this new custom domain name when it is made available by the gateway.

Your network for the app_service does not need to be made private or require private links and virtual networks integration. Unlike other resources, the IP address allocated to the app_service is not part of the subnets defined for the infrastructure. An app service that is reachable by public IP address can be fully secured with access restrictions such that the only IP address allowed for calls is that of the gateway. Please refrain from changing network settings for unreachability or for specifying in your certificates. Your default certificates and address for your app_service is sufficient until gateway consolidates the traffic.

You do not need to request certificates specific to your application name. It is usually a concern for the application gateway and not the app_service. Also, the default provisioning of the TLS over the azurewebsites.net is sufficient for ensuring encrypted traffic regardless of the origin.

The connections that your applications make to data sources are optimally routed by the network regardless of whether your app is the source or the sink. No additional steps are necessary to be taken at the network level. On the other hand, access control via identity helps to secure the connections that the application makes with data stores and other Azure resources. Please do not use security principals. The identity block in the section above generates a managed identity that is granted access to resources such as Key Vaults. A role such as Key Vault Secrets User is used to assign the role-based access to the managed identity of the app_service. No new credentials or connections need to be made by the app_service and it can read the secrets directly without passing any credentials. The keyvault_key specified in the mi_access section above determines the specific keyvault to which the app_service will have access to.

Some app_settings are customizable for the app_service but the use of environment variables is discouraged as much as possible. Specifically, client ids, credentials, passwords, and environment do not need to be passed in. These are available to be read dynamically and, in many cases, not required at all.

Alerts for your application have been included with your app_service and sample dashboards are available in the DSI IaC Common repository so that your application is observable and the time for troubleshooting is reduced.

Conclusion

With the above guidelines, we remove the unnecessary inclusions that creep up into the app_service configuration and call out the essentials so that there can be more focus on the business logic.

Monday, June 5, 2023

Isolation and access control in GreenField projects:

Innovation is critical to businesses who seek to break new ground or gain competitive advantage. Prototyping is central to innovation as new ideas are best incubated and demonstrated with new projects. The leap between a version 1 of a product and its user acceptance is facilitated when end-users can take it for a spin without being tethered to existing infrastructure, processes, and practices. A Greenfield project is one that unwraps a cloud offering for end-users from scratch. With isolation and access control such that the offering can be tried out in a sandbox and unraveled with many functionalities into an offering with terms such as runtime-in-a-box, cloud-in-a-box, the end-user is given the option to treat the entire instance as a private resource, a personal toy if you will but with the same capabilities as a shared instance for various organizations, teams and members.

The setup and deployment of tools on hosts were facilitated with installers that benefitted from commit and rollback transactions and software-maker-defined order of execution bringing with it several decades of experience in gaining end-user acceptance. Cloud solutions have rarely been seen as personal computing to warrant a similar experience for end-users and tend to be shared between teams and members with subscriptions and resource groups. Fortunately, cloud adoption has brought about significant strides and popularity of change tracking and control technologies and manifestations of existing and new cloud resources.

Infrastructure-as-a-code or Iac for short is a declarative paradigm that is a language for describing infrastructure and the state that it must achieve. The service that understands this language supports tags, RBAC, declarative syntax, locks, policies, and logs for the resources and their create, update, and delete operations which can be exposed via the command-line interface, scripts, web requests, and the user interface. Declarative style also helps to boost agility, productivity, and quality of work within the organizations.

Terraform’s appeal is that it can be used with multiple IaC providers for end-to-end integration. For example, it can deploy Azure Functions and a storage account with Azure, manage Microsoft Azure Active Directory users and groups, and provision repositories in GitHub with teams that correspond to those users and groups. It is the poetry like brevity of describing the IaC that makes it easier to explain the sequences and dependencies for describing the concepts for solutions to problems.

Isolation and Access control is not specific to cloud artifacts. It has been traditionally used with both code and data. The innovation is the leverage of GitHub repositories and teams introducing change tracking and control into processes that were previously as hidden as setup and deployment of products and cloud solutions as well as the organization and structure that filesystem brings without requiring the use of transactions and instead using robust idempotent operations. One of the benefits of using an independent repository is that we can achieve simultaneous publishing across replications or other scenarios including Continuous Integration/Continuous Deployment.

For example, the following example in IaC describes the concept of a massive copy tool as a personal cloud utility for moving Terabytes of data in hours with the convenience of isolation and access control for an end-user for her private collaborators.

Sunday, June 4, 2023

Packaging a cloud solution as a blueprint:

The setup and deployment of tools on hosts were facilitated with installers that benefitted from commit and rollback transactions and software maker defined order of execution bringing with it several decades of experience in gaining end-user acceptance. Cloud solutions have rarely been seen as personal computing to warrant a similar experience for end-users. This article explains the methods for packing and shipping cloud software solutions.

Azure Blueprints can be leveraged to allow an engineer or architect to sketch a project’s design parameters, define a repeatable set of resources that implements and adheres to an organization’s standards, patterns and requirements. It is a declarative way to orchestrate the deployment of various resource templates and other artifacts such as role assignments, policy assignments, ARM templates, and Resource Groups. Blueprint Objects are stored in the CosmosDB and replicated to multiple Azure regions. Since it is designed to setup the environment, it is different from resource provisioning. This package fits nicely into a CI/CD

While Azure Functions allow extensions via new resources, Azure Resource provider and ARM APIs provide extensions via existing resources. This eliminates the need to have new processes introduced around new resources and is a significant win for reusability and user convenience. Resources and their extensions can be written only in Bicep and ARM templates. Bicep provides more concise syntax and improved type safety, but they compile to ARM templates which is the de facto standard to declare and use Azure resources and supported by the unified Azure Resource Manager. Bicep is a new domain-specific language that was recently developed for authoring ARM templates by using an easier syntax. Bicep is typically used for resource deployments to Azure. It is a new deployment-specific language that was recently developed. Either or both JSON and Bicep can be used to author ARM templates and while JSON is ubiquitous, Bicep can only be used with Resource Manager Templates. In fact, Bicep has tooling that converts Bicep templates into standard Json Templates for ARM Resources by a process called transpilation. This conversion happens automatically but it can also be manually invoked. Bicep is succint so it provides a further incentive. The use of builtin functions, conditions and loops for repetitive resources infuses logic into the ARM templates.

With the standardization of the template, it can bring consistency across services and their resources with added benefits like policy as a code and repeated deployments across clouds and regions. The need for region agnostic deployments cannot be over-emphasized for foundational services that struggle with limitations. There are many clouds and regions to support, and the task of deployment could have significant cost when the services groan without the availability of suitable ARM Templates.

Other infrastructure providers like Kubernetes have a language that articulates state so that its control loop can reconcile these resources. The resources can be generated and infused with specific configuration and secret using a configMap generator and a secret generator respectively. For example, it can take an existing application.properties file and generate a configMap that can be applied to new resources. Kustomization allows us to override the registry for all images used in the containers for an application. There are two advantages to using it. First, it allows us to configure the individual components of the application without requiring changes in them. Second, it allows us to combine components from different sources and overlay them or even override certain configurations. The kustomize tool provides this feature. Kustomize can add configmaps and secrets to the deployments using their specific generators respectively. Kustomize is static declaration. It allows adding labels across components. We can choose the groups of Kubernetes resources dynamically using selectors, but they must be declared as yaml. This kustomization yaml is usually stored as manifests and applied on existing components so they refer to other yamls. Arguably, yaml is the most succint format of templates.

Azure Blueprints differ from ARM templates in that the former helps environment setup while the latter helps with resource provisioning. It is a package that comprises artifacts that declare resource groups, policies, role assignments, and ARM Template deployments. It can be composed and versioned and included in continuous integration and continuous delivery pipelines. The components of the package can be assigned to a subscription in a single operation, audited, and tracked. Although the components can be individually registered, the Blueprint facilitates a relationship to the template and an active connection.

There are two categories within the Blueprint – definitions for deployment that explain what should be deployed and the definitions for assignments that explain what was deployed. A previous effort to author ARM Templates become reusable in Azure Blueprint. In this way, Blueprint becomes bigger than just the templates and allows reusing an existing process to manage new resources.

A Blueprint focuses on standards, patterns, and requirements. The design can be reused to maintain consistency and compliance. It differs from an Azure policy in that it supports parameters with policies and initiatives. A policy is a self-contained manifest that governs resource properties during deployment and for already existing resources. Resources within a subscription adhere to the requirements and standards. When a Blueprint comprises resource templates and Azure policy along with parameters, it becomes holistic in cloud governance.

With the help of a blueprint, we can create resources as deployment stamps. The names of the resources can be parametrized to differentiate one stamp from another and as many times as they are invoked. All the resources will be mapped to the same resource group to ensure that their lifecycle is managed and held for the duration of the resource group.

Most of the deployment of cloud resources can be declarative with little or no requirement for scriptability. This is preferable to make the rollouts idempotent and preferably better tracked and managed via the portal. Scripts are only unavoidable when they involve resource provisionings that are outside the scope of the public cloud. For example, a GitHub repository is not a cloud resource and sometimes keeping track of the artifacts during the rollout process might be necessary because other storage might not be as preferable as the source control filesystem. It’s preferable to write the script in the form of commit, rollback and repair options.

Another feature of the deployment logic is the state reconciliation with the existing technologies. It is preferable to start with a clean state each time the rollout begins. This avoids state reconciliation altogether but when existing resources must be repaired, it is possible to leverage state persistence for the reconciliation.

Lastly, the levels are built from the ground up. So the tear down must be in the opposite direction so that the dependencies can be done away with. With principles such as resource acquisition is initialization, this reverse order can be enforced.

Saturday, June 3, 2023

This article describes the ways to access Azure Data Lake Storage from Azure Databricks Notepad. A developer writing an application in Python on an Azure Databricks Notepad might find it frustrating to find the right syntax to connect, read and write to external storage such as a storage account on Azure public cloud. It’s also confusing to go through at least three different documentation sources that include those from the Azure public cloud, Databricks and Apache Spark.

For example, the script to read would be:

#set the data lake file location:

file_location = "abfss://container@storageaccount.dfs.core.windows.net/path/to/file.csv"

#read in the data to dataframe df

df = spark.read.format("csv").option("inferSchema", "true").option("header","true").option("delimiter",",").load(file_location)

#display the dataframe

display(df)

val = df.rdd.map(lambda row: bytes(row.content)).first()

tokenizer = pickle.loads(bytearray(val))

Notice the protocol is Azure Blob File System or abfs for short. An additional s at the end indicates that the ABFS Hadoop client driver will always use the Transport Layer Security irrespective of the authentication method chosen.

This technique is preferred over traditional techniques of mounting the remote storage as a filesystem which involves an example like so:

val mount_root = "/mnt/ContainerName/DirectoryName"

df.coalesce(1).write.format("csv").option("header","true").mode("OverWrite").save(s"dbfs:$mount_root/")

or when used with on behalf of credential as

val configs = Map(

"fs.azure.account.auth.type" -> "CustomAccessToken",

"fs.azure.account.custom.token.provider.class" -> spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName")

)

// Optionally, <directory-name> can be added to the source URI of a mount point.

dbutils.fs.mount(

source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",

mountPoint = "/mnt/<mount-name>",

extraConfigs = configs)

If the abfss or the mount point is not used, the python script can still make a direct call to the Azure storage as if it were running on any host and outside Spark. Here the credentials need to be passed correctly. Usually, the Azure Identity Python library recommends one of two ways to determine the credential:

First, with the use of DefaultCredentials to go through all possible sources for determining it or

Second, chaining specific sources to lookup for credentials.

This can be explained with the help of a credential used with a BlobClient to access the external storage, as shown here:

The following environment variables will need to be set:

· AZURE_CLIENT_ID

· AZURE_TENANT_ID

· AZURE_USERNAME

· AZURE_PASSWORD

as explained in https://learn.microsoft.com/en-us/python/api/azure-identity/azure.identity.environmentcredential?view=azure-python

These can be set on an already created cluster by following the menu as:

Select your cluster => click on Edit => Advance Options => Edit or Enter new Environment Variables => Confirm and Restart.

Spark specific configuration is also helpfully explained in https://docs.databricks.com/clusters/configure.html#spark-config

Given that username and password are sensitive information, it is better to store them securely and access it dynamically. Another option is to use tokens for credentials.

For example, Databricks needs,

· DATABRICKS_SERVER_HOSTNAME

· DATABRICKS_HTTP_PATH

· DATABRICKS_TOKEN

to make a connection.

A sample credential and client to access storage account would, then read something like this:

%pip install azure-storage-blob azure-identity

from azure.storage.blob import BlobServiceClient

from azure.identity import ChainedTokenCredential, EnvironmentCredential, AzureCliCredential, ManagedIdentityCredential, DefaultAzureCredential

credential_chain = (

# Try EnvironmentCredential first

EnvironmentCredential(),

# Fallback to Azure CLI if EnvironmentCredential fails

AzureCliCredential()

)

credential = ChainedTokenCredential(*credential_chain)

default_credential = DefaultAzureCredential(exclude_interactive_browser_credential=False)

client = BlobServiceClient("https://mystorageaccount.blob.core.windows.net/", credential=credential) # or default_credential

blob_client = client.get_blob_client(container="myContainer", blob="/path/to/file.csv")

try:

stream = blob_client.download_blob()

print("Blob found.")

except Exception as ex:

print(ex.message)

print("No blob found.")

Thursday, June 1, 2023

Azure Resource Mover for cross-region movement of resources:

Business Continuity and Disaster Recovery standards of the public cloud recommend backup and restore with Azure Backup and Azure Site Recovery vaults and maintaining a paired region presence in modes such as active-passive or active-pilot. The cross-region movement of resources is talked about less from that angle, but it frequently supports ever-evolving business needs including those from mergers and acquisitions. Azure Resource Mover greatly reduces customer time and effort needed to move resources across regions by providing a built-in dependency analysis, planning, and testing that ensures the resources are prepared to move and then successfully moved to the desired region. Azure Resource Mover also provides a seamless and consistent single pane of glass experience that allows move orchestration for a range of Azure resources including virtual machines and databases. By virtue of its automation and single-click move capabilities, it reduces the time and effort needed to move resources across regions. This technique comes with the ability to rename resources and customize ip addresses in the target location that are important features to streamline and simplify the process.

Often used to take advantage of new Azure region expansions to be closer to the end-customers by reducing latency and to consolidate workloads into single regions to support merger and acquisitions, the Azure Resource Mover facilitates a wide range of resource types but does have its restrictions. It’s important to recognize the displacement involved as one resource group to another, one subscription to another or one region to another because different displacements come with different automations and as features of different resources. For example, Azure Resource Mover can move VMs and related disks but connected clusters for Microsoft.Kubernetes resource type cannot be moved across regions. Multi-region clusters are also possible with some resource types. Child resources can’t be moved independently from its parent resources. The naming convention indicates whether a resource is a child or a parent because they are prefixed with the parent resource type. For example, a Microsoft.ServiceBus/namespaces/queues resource cannot be moved independent of Microsoft.ServiceBus/namespaces resource.

Some resources can be moved with the help of their templates. This is the case with API management resources. The resource is exported with its template. Then it is adapted to the target region and recreated as a new resource. The same instance name is maintained if the source is deleted. Otherwise, the destination will require custom domain CNAME records to point to the new API management instance.

Databases also get a different treatment. Usually a cross-region read replica is used to move an existing server. If the service is provisioned with geo-redundant backup storage, we can use geo-restore to restore it in other regions. Another way to move resources is to clone an existing region to another region as in the case of an IoT hub. Resources like key-vaults can be created as many as one per region so they are often left out of the resource movement inventory. Public ip address configurations can be moved but these addresses cannot be retained. Finally, recovery services vaults can themselves be disabled in the originating region and recreated in the target region.