Cluster computing: Large files and Azure Databricks users

Wednesday, June 21, 2023

Large files and Azure Databricks users

When data analytics users want to analyze large amounts of data, they choose Azure Databricks workspace and browse the data from remote unlimited storage. Usually, this is an Azure storage account, or a data lake and the users find it convenient to pass through their Azure Active Directory credentials and work with the files using their location as

· wasbs://container@storageaccount.blob.core.windows.net/prefix/to/blob for blobs and

· abfss://container@storageaccount.dfs.core.windows.net/path/to/file for files

This is sufficient to read with Spark as:

df = spark.read.format("binaryFile").option("pathGlobFilter", "*.p").load(file_location)

and write but the configuration for Spark is such that it is limited to 2Gb max message and unless the file is partitioned, large files usually cause trouble to the users when the same statement fails.

In such cases, the users must make use of a few options, if they wanted to continue using their AAD credentials.

First, they can use a Shared access signature URL that they create with user delegation to read the file. For example,

if not os.path.isfile("/dbfs/" + filename):

print("downloading file...")

with requests.get(sasUrl, stream=True) as resp:

if resp.ok:

with open("/dbfs/" + filename, "wb") as f:

for chunk in resp.iter_content(chunk_size=CHUNK_SIZE):

f.write(chunk)

print("file found...")

but it is somewhat hard to write interactively to a SAS URL because the request headers per the document to write with a SAS Token seems to be complained about. So, the writes need to be local and then the file uploaded.

from azure.storage.blob import ContainerClient

# https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.containerclient?view=azure-python

sas_url = "https://account.blob.core.windows.net/mycontainer?sv=2015-04-05&st=2015-04-29T22%3A18%3A26Z&se=2015-04-30T02%3A23%3A26Z&sr=b&sp=rw&sip=168.1.5.60-168.1.5.70&spr=https&sig=Z%2FRHIX5Xcg0Mq2rqI3OlWTjEg2tYkboXr1P9ZUXDtkk%3D"

container = ContainerClient.from_container_url(sas_url)

with open(SOURCE_FILE, "rb") as data:

blob_client = container_client.upload_blob(name="myblob", data=data)

and second, they use the equivalent of mounting the remote storage as local filesystem but in this case, they must leverage a databricks access connector and the admin must grant permissions to the databricks user for using it in their workspace. The users can continue to use their AD credentials spanning the databricks workspace, the access connector and the external storage.

Third party options like smart_open discuss ways to do so such as follows:
!pip install smart_open[all]

!pip install azure-storage-blob

from smart_open import open

import os

# stream from Azure Blob Storage

connect_str="BlobEndpoint=https://storageaccount.blob.core.windows.net; \

SharedAccessSignature=<sasToken>"

transport_params = {

'client': azure.storage.blob.BlobServiceClient.from_connection_string(connect_str),

}

filename = 'azure://container/path/to/file'

# stream content *into* Azure Blob Storage (write mode):

with open(filename, 'wb', transport_params=transport_params) as fout:

fout.write(b'contents written here')

for line in open(filename, transport_params=transport_params):

print(line)

but this fails with the error that the write using the SASURL is not permitted even though the SASURL was generated with write permission

Therefore, leveraging the Spark framework as much as possible and working with SASUrls for upload and download of large files are preferable. Fortunately, the file transfer of about 10GB file takes only a couple of minutes.

Cluster computing

Wednesday, June 21, 2023

Large files and Azure Databricks users

No comments:

Post a Comment