This article describes the ways to access Azure Data Lake
Storage from Azure Databricks Notepad. A
developer writing an application in Python on an Azure Databricks Notepad might
find it frustrating to find the right syntax to connect, read and write to
external storage such as a storage account on Azure public cloud. It’s also
confusing to go through at least three different documentation sources that
include those from the Azure public cloud, Databricks and Apache Spark.
We need a credential and a connection to the external
storage with which we can write Spark applications in Python and the
application runs on a cluster, so the configuration of the cluster is our first
consideration. Starting with a cluster for a single user and enabling
integrated pass-through authentication with the Azure Active directory enables
the Spark application to recognize the developer account as native to Azure and
does away with any syntax for passing credentials or forming connections, so
that only the code for reading or writing is required.
For example, the script to read would be:
#set the data lake file location:
file_location =
"abfss://container@storageaccount.dfs.core.windows.net/path/to/file.csv"
#read in the data to dataframe df
df = spark.read.format("csv").option("inferSchema",
"true").option("header","true").option("delimiter",",").load(file_location)
#display the dataframe
display(df)
val = df.rdd.map(lambda row:
bytes(row.content)).first()
tokenizer =
pickle.loads(bytearray(val))
Notice the protocol is Azure Blob File System or abfs for
short. An additional s at the end indicates that the ABFS Hadoop client driver
will always use the Transport Layer Security irrespective of the authentication
method chosen.
This technique is preferred over traditional techniques of
mounting the remote storage as a filesystem which involves an example like so:
val mount_root =
"/mnt/ContainerName/DirectoryName"
df.coalesce(1).write.format("csv").option("header","true").mode("OverWrite").save(s"dbfs:$mount_root/")
or when used with on behalf of credential as
val configs = Map(
"fs.azure.account.auth.type" ->
"CustomAccessToken",
"fs.azure.account.custom.token.provider.class" ->
spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName")
)
// Optionally,
<directory-name> can be added to the source URI of a mount point.
dbutils.fs.mount(
source =
"abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
mountPoint = "/mnt/<mount-name>",
extraConfigs = configs)
If the abfss or the mount point is not used, the python
script can still make a direct call to the Azure storage as if it were running
on any host and outside Spark. Here the credentials need to be passed
correctly. Usually, the Azure Identity Python library recommends one of two
ways to determine the credential:
First, with the use of DefaultCredentials to go through all
possible sources for determining it or
Second, chaining specific sources to lookup for credentials.
This can be explained with the help of a credential used
with a BlobClient to access the external storage, as shown here:
The following environment variables will need to be set:
·
AZURE_CLIENT_ID
·
AZURE_TENANT_ID
·
AZURE_USERNAME
·
AZURE_PASSWORD
as explained in
https://learn.microsoft.com/en-us/python/api/azure-identity/azure.identity.environmentcredential?view=azure-python
These can be set on an already created cluster by following
the menu as:
Select your cluster => click on Edit => Advance
Options => Edit or Enter new Environment Variables => Confirm and
Restart.
Spark specific configuration is also helpfully explained in
https://docs.databricks.com/clusters/configure.html#spark-config
Given that username and password are sensitive information,
it is better to store them securely and access it dynamically. Another option
is to use tokens for credentials.
For example, Databricks needs,
·
DATABRICKS_SERVER_HOSTNAME
·
DATABRICKS_HTTP_PATH
·
DATABRICKS_TOKEN
to make a connection.
A sample credential and client to access storage account
would, then read something like this:
%pip install azure-storage-blob azure-identity
from azure.storage.blob import BlobServiceClient
from azure.identity import ChainedTokenCredential,
EnvironmentCredential, AzureCliCredential, ManagedIdentityCredential,
DefaultAzureCredential
credential_chain = (
# Try
EnvironmentCredential first
EnvironmentCredential(),
# Fallback to
Azure CLI if EnvironmentCredential fails
AzureCliCredential()
)
credential = ChainedTokenCredential(*credential_chain)
default_credential =
DefaultAzureCredential(exclude_interactive_browser_credential=False)
client = BlobServiceClient("https://mystorageaccount.blob.core.windows.net/",
credential=credential) # or default_credential
blob_client =
client.get_blob_client(container="myContainer",
blob="/path/to/file.csv")
try:
stream =
blob_client.download_blob()
print("Blob
found.")
except Exception as ex:
print(ex.message)
print("No
blob found.")
No comments:
Post a Comment