This article describes the ways to access Azure Data Lake Storage from
Azure Databricks Notebook. 
One of the favored techniques for
accessing external storage is with pass-through authentication. 
We need a credential and a
connection to the external storage with which we can write Spark applications
in Python and the application runs on a cluster, so the configuration of the
cluster is our first consideration. Starting with a cluster for a single user
and enabling integrated pass-through authentication with the Azure Active
directory enables the Spark application to recognize the developer account as
native to Azure and does away with any syntax for passing credentials or
forming connections, so that only the code for reading or writing is required.
The following sample provides a way
to do that along with references to public documentation:
configs = {
   
# passthrough ADLS1:
https://learn.microsoft.com/en-us/azure/databricks/data-governance/credential-passthrough/adls-passthrough
   
"fs.adl.oauth2.access.token.provider.type":
"CustomAccessTokenProvider",
   
"fs.adl.oauth2.access.token.custom.provider":
spark.conf.get("spark.databricks.passthrough.adls.tokenProviderClassName"),
   
# passthrough ADLS2
   
"fs.azure.account.auth.type" : "CustomAccessToken",
   
"fs.azure.account.custom.token.provider.class" :
spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName"),
   
# OAuth: https://learn.microsoft.com/en-us/azure/databricks/dbfs/mounts
   
"fs.azure.account.auth.type":"OAuth",
   
"fs.azure.account.oauth.provider.type":"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
   
"fs.azure.account.oauth2.client.id":"<application-id>",
   
"fs.azure.account.oauth2.client.secret":
dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
   
"fs.azure.account.oauth2.client.endpoint":"https://login.microsoftonline.com/tenantId/oauth2/token",
   
# personal access token or AAD token:
https://learn.microsoft.com/en-us/azure/databricks/administration-guide/access-control/tokens
   
"key.for.access.token":
"<bearer-token-from-quote-az-space-account-space-get-access-token-quote-cli-command"
}
dbutils.fs.mount(
   
source="wasbs://<container>@<storageaccount>.blob.core.windows.net/path/to/resources",
# abfss for dfs
   
mount_point = "/mnt/resources",
   
extra_configs=configs
)
df = spark.read.format("binaryFile").option("pathGlobFilter",
"*.bin").load("dbfs:/resources/file")
df.show()
dbutils.fs.unmount("/mnt/resources")
No comments:
Post a Comment