Cluster computing

This article describes the ways to access Azure Data Lake Storage from Azure Databricks Notepad. A developer writing an application in Python on an Azure Databricks Notepad might find it frustrating to find the right syntax to connect, read and write to external storage such as a storage account on Azure public cloud. It’s also confusing to go through at least three different documentation sources that include those from the Azure public cloud, Databricks and Apache Spark.

We need a credential and a connection to the external storage with which we can write Spark applications in Python and the application runs on a cluster, so the configuration of the cluster is our first consideration. Starting with a cluster for a single user and enabling integrated pass-through authentication with the Azure Active directory enables the Spark application to recognize the developer account as native to Azure and does away with any syntax for passing credentials or forming connections, so that only the code for reading or writing is required.

For example, the script to read would be:

#set the data lake file location:

file_location = "abfss://container@storageaccount.dfs.core.windows.net/path/to/file.csv"

#read in the data to dataframe df

df = spark.read.format("csv").option("inferSchema", "true").option("header","true").option("delimiter",",").load(file_location)

#display the dataframe

display(df)

val = df.rdd.map(lambda row: bytes(row.content)).first()

tokenizer = pickle.loads(bytearray(val))

Notice the protocol is Azure Blob File System or abfs for short. An additional s at the end indicates that the ABFS Hadoop client driver will always use the Transport Layer Security irrespective of the authentication method chosen.

This technique is preferred over traditional techniques of mounting the remote storage as a filesystem which involves an example like so:

val mount_root = "/mnt/ContainerName/DirectoryName"

df.coalesce(1).write.format("csv").option("header","true").mode("OverWrite").save(s"dbfs:$mount_root/")

or when used with on behalf of credential as

val configs = Map(

"fs.azure.account.auth.type" -> "CustomAccessToken",

"fs.azure.account.custom.token.provider.class" -> spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName")

)

// Optionally, <directory-name> can be added to the source URI of a mount point.

dbutils.fs.mount(

source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",

mountPoint = "/mnt/<mount-name>",

extraConfigs = configs)

If the abfss or the mount point is not used, the python script can still make a direct call to the Azure storage as if it were running on any host and outside Spark. Here the credentials need to be passed correctly. Usually, the Azure Identity Python library recommends one of two ways to determine the credential:

First, with the use of DefaultCredentials to go through all possible sources for determining it or

Second, chaining specific sources to lookup for credentials.

This can be explained with the help of a credential used with a BlobClient to access the external storage, as shown here:

The following environment variables will need to be set:

· AZURE_CLIENT_ID

· AZURE_TENANT_ID

· AZURE_USERNAME

· AZURE_PASSWORD

as explained in https://learn.microsoft.com/en-us/python/api/azure-identity/azure.identity.environmentcredential?view=azure-python

These can be set on an already created cluster by following the menu as:

Select your cluster => click on Edit => Advance Options => Edit or Enter new Environment Variables => Confirm and Restart.

Spark specific configuration is also helpfully explained in https://docs.databricks.com/clusters/configure.html#spark-config

Given that username and password are sensitive information, it is better to store them securely and access it dynamically. Another option is to use tokens for credentials.

For example, Databricks needs,

· DATABRICKS_SERVER_HOSTNAME

· DATABRICKS_HTTP_PATH

· DATABRICKS_TOKEN

to make a connection.

A sample credential and client to access storage account would, then read something like this:

%pip install azure-storage-blob azure-identity

from azure.storage.blob import BlobServiceClient

from azure.identity import ChainedTokenCredential, EnvironmentCredential, AzureCliCredential, ManagedIdentityCredential, DefaultAzureCredential

credential_chain = (

# Try EnvironmentCredential first

EnvironmentCredential(),

# Fallback to Azure CLI if EnvironmentCredential fails

AzureCliCredential()

)

credential = ChainedTokenCredential(*credential_chain)

default_credential = DefaultAzureCredential(exclude_interactive_browser_credential=False)

client = BlobServiceClient("https://mystorageaccount.blob.core.windows.net/", credential=credential) # or default_credential

blob_client = client.get_blob_client(container="myContainer", blob="/path/to/file.csv")

Cluster computing

Saturday, June 3, 2023

No comments:

Post a Comment