Wednesday, June 21, 2023

Large files and Azure Databricks users


When data analytics users want to analyze large amounts of data, they choose Azure Databricks workspace and browse the data from remote unlimited storage. Usually, this is an Azure storage account, or a data lake and the users find it convenient to pass through their Azure Active Directory credentials and work with the files using their location as

·        wasbs://container@storageaccount.blob.core.windows.net/prefix/to/blob for blobs and

·        abfss://container@storageaccount.dfs.core.windows.net/path/to/file for files

This is sufficient to read with Spark as:

df = spark.read.format("binaryFile").option("pathGlobFilter", "*.p").load(file_location)

and write but the configuration for Spark is such that it is limited to 2Gb max message and unless the file is partitioned, large files usually cause trouble to the users when the same statement fails.

In such cases, the users must make use of a few options, if they wanted to continue using their AAD credentials.

First, they can use a Shared access signature URL that they create with user delegation to read the file. For example,

if not os.path.isfile("/dbfs/" + filename):

 print("downloading file...")

with requests.get(sasUrl, stream=True) as resp:

  if resp.ok:

    with open("/dbfs/" + filename, "wb") as f:

      for chunk in resp.iter_content(chunk_size=CHUNK_SIZE):

        f.write(chunk)

print("file found...")

but it is somewhat hard to write interactively to a SAS URL because the request headers per the document to write with a SAS Token seems to be complained about. So, the writes need to be local and then the file uploaded.

from azure.storage.blob import ContainerClient

# https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.containerclient?view=azure-python

sas_url = "https://account.blob.core.windows.net/mycontainer?sv=2015-04-05&st=2015-04-29T22%3A18%3A26Z&se=2015-04-30T02%3A23%3A26Z&sr=b&sp=rw&sip=168.1.5.60-168.1.5.70&spr=https&sig=Z%2FRHIX5Xcg0Mq2rqI3OlWTjEg2tYkboXr1P9ZUXDtkk%3D"

container = ContainerClient.from_container_url(sas_url)

with open(SOURCE_FILE, "rb") as data:

     blob_client = container_client.upload_blob(name="myblob", data=data)

 

and second, they use the equivalent of mounting the remote storage as local filesystem but in this case, they must leverage a databricks access connector and the admin must grant permissions to the databricks user for using it in their workspace. The users can continue to use their AD credentials spanning the databricks workspace, the access connector and the external storage.

Third party options like smart_open discuss ways to do so such as follows:
!pip install smart_open[all]

!pip install azure-storage-blob

from smart_open import open

import os

# stream from Azure Blob Storage

connect_str="BlobEndpoint=https://storageaccount.blob.core.windows.net; \

SharedAccessSignature=<sasToken>"

transport_params = {

    'client': azure.storage.blob.BlobServiceClient.from_connection_string(connect_str),

}

filename = 'azure://container/path/to/file'

# stream content *into* Azure Blob Storage (write mode):

with open(filename, 'wb', transport_params=transport_params) as fout:

    fout.write(b'contents written here')

 

for line in open(filename, transport_params=transport_params):

    print(line)

but this fails with the error that the write using the SASURL is not permitted even though the SASURL was generated with write permission

 

Therefore, leveraging the Spark framework as much as possible and working with SASUrls for upload and download of large files are preferable. Fortunately, the file transfer of about 10GB file takes only a couple of minutes.

Tuesday, June 20, 2023

How to resolve IaC shortcomings? Part 5

A previous article discussed a resolution to IaC shortcomings for declaring resources with configuration not yet supported by an IaC repository. This article continues that discussion with native support for extensibility with Terraform and discusses the order and repetition involved in IaC deployments. 

IaC is an agreement between the IaC provider and the resource provider. An attribute of a resource can only be applied when the IaC-provider applies it in the way the resource provider expects and the resource-provider provisions in the way that the IaC provider expects. In many cases, this is honored but some attributes can get out of sync resulting in unsuccessful deployments of what might seem to be correct declarations. 

For instance, some attributes of a resource can be specified via the IaC provider but go completely ignored by the resource provider. If there are two attributes that can be specified, the resource-provider reserves the right to prioritize one over the other. Even when a resource attribute is correctly specified, the resource provider could mandate the destroy of existing resource and the creation of a new resource.  A more common case is one when where the IaC wants to add a new property for all resources of a specific resource type but there are already existing resources that do not have that property initialized. In such a case, the applying of the IaC change to add a new property will fail for existing instances but succeed for the new instances. Only by running the IaC twice, once to detect the missing property for the existing resources and initialize and second to correctly report the new property, will the IaC start succeeding in subsequent deployments. 

The destroy of an existing resource and the creation of a new resource is also required to keep the state in sync with the IaC. If the resource is being missed from the state, it might be interpreted as a resource that was not there in the IaC to begin with and require the destroy before the IaC recognized creation occurs. 

It is possible to make use of the best of both worlds with a folder structure that separates the Terraform templates into a folder called ‘module’ and the resource provider templates in another folder at the same level and named something like ‘subscription-deployments’ which includes native blueprints and templates. The GitHub workflow definitions will leverage proper handling of either location or trigger the workflow on any changes to either of these locations.  

The native support for extensibility depends on naming and logic.  

Naming is facilitated with canned prefixes/suffixes and dynamic random string to make each rollout independent of the previous. Some examples include: 

resource "random_string" "unique" { 

  count   = var.enable_static_website && var.enable_cdn_profile ? 1 : 0 

  length  = 8 

  special = false 

  upper   = false 

} 

  

Logic can be written out with PowerShell for Azure public cloud which is the de facto standard for automation language. Then a pseudo resource can be added using this logic as follows: 

resource "null_resource" "add_custom_domain" { 

  count = var.custom_domain_name != null ? 1 : 0 

  triggers = { always_run = timestamp() } 

  depends_on = [ 

    azurerm_app_service.web-app 

  ] 

  

  provisioner "local-exec" { 

    command = "pwsh ${path.module}/Setup-AzCdnCustomDomain.ps1" 

    environment = { 

      CUSTOM_DOMAIN      = var.custom_domain_name 

      RG_NAME            = var.resource_group_name 

      FRIENDLY_NAME      = var.friendly_name 

      STATIC_CDN_PROFILE = var.cdn_profile_name 

    } 

  } 

} 

  

PowerShell scripts can help with both the deployment as well as the pipeline automations. There are a few caveats with scripts because the general preference is for declarative and idempotent IaC rather than script so extensibility must be given the same due consideration as customization.  

All scripts can be stored in folders with names ending with ‘scripts’. 
These are sufficient to address the above-mentioned shortcomings in the Infrastructure-as-Code.