Wednesday, June 7, 2023

Options to access external storage from Databricks notebook:

 

This article describes the ways to access Azure Data Lake Storage from Azure Databricks Notebook.

One of the favored techniques for accessing external storage is with pass-through authentication.

We need a credential and a connection to the external storage with which we can write Spark applications in Python and the application runs on a cluster, so the configuration of the cluster is our first consideration. Starting with a cluster for a single user and enabling integrated pass-through authentication with the Azure Active directory enables the Spark application to recognize the developer account as native to Azure and does away with any syntax for passing credentials or forming connections, so that only the code for reading or writing is required.

 

The following sample provides a way to do that along with references to public documentation:

configs = {

    # passthrough ADLS1: https://learn.microsoft.com/en-us/azure/databricks/data-governance/credential-passthrough/adls-passthrough

    "fs.adl.oauth2.access.token.provider.type": "CustomAccessTokenProvider",

    "fs.adl.oauth2.access.token.custom.provider": spark.conf.get("spark.databricks.passthrough.adls.tokenProviderClassName"),

    # passthrough ADLS2

    "fs.azure.account.auth.type" : "CustomAccessToken",

    "fs.azure.account.custom.token.provider.class" : spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName"),

    # OAuth: https://learn.microsoft.com/en-us/azure/databricks/dbfs/mounts

    "fs.azure.account.auth.type":"OAuth",

    "fs.azure.account.oauth.provider.type":"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",

    "fs.azure.account.oauth2.client.id":"<application-id>",

    "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),

    "fs.azure.account.oauth2.client.endpoint":"https://login.microsoftonline.com/tenantId/oauth2/token",

    # personal access token or AAD token: https://learn.microsoft.com/en-us/azure/databricks/administration-guide/access-control/tokens

    "key.for.access.token": "<bearer-token-from-quote-az-space-account-space-get-access-token-quote-cli-command"

}

 

dbutils.fs.mount(

    source="wasbs://<container>@<storageaccount>.blob.core.windows.net/path/to/resources", # abfss for dfs

    mount_point = "/mnt/resources",

    extra_configs=configs

)

 

df = spark.read.format("binaryFile").option("pathGlobFilter", "*.bin").load("dbfs:/resources/file")

df.show()

dbutils.fs.unmount("/mnt/resources")

No comments:

Post a Comment