Wednesday, August 30, 2023

Databricks and active directory passthrough authentication.


Azure Databricks is used to process, store, clean, share, analyze, model, and monetize datasets with solutions from Business Intelligence to machine learning. It is used to build and deploy data engineering workflows, machine learning models, analytics dashboards and more.

It connects to different external storage locations including the Azure Data Lake Storage. Users logged in to the Azure Databricks instance can execute python code and use Spark platform to view tabular representation of the data stored in various formats on the external storage accounts. When they refer to a file on the external storage account, they need not specify credentials to connect, and their logged-in credential can be passed through to the remote storage account. For example: spark.read.format("parquet").load("abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/data")

This feature required two settings:

1.       When a compute cluster is created to execute the python code, it must have the checkbox to pass through the credentials, checked.

2.       It must also have the flag spark.databricks.passthrough.adls set to true

Until recently, the Sparks UI allowed this flag to be set but the configuration for passthrough changed with the new UI that facilitated Unity Catalog – a unified access control mechanism. Passthrough credentials and Unity Catalog are mutually exclusive. The flag no longer can be set to create new clusters with the new UI in most cases and this affected the implicit login required to authenticate the current user to the remote storage. The token provider used earlier was spark.databricks.passthrough.adls.gen2.tokenProviderClassName and with the new UI the login required more elaborate configuration. The error code encountered by the users when using the earlier clusters with the newer version Databricks UI is 403.

The newer configuration is the following:

spark.hadoop.fs.azure.account.oauth2.client.id.<datalake>.dfs.core.windows.net <sp client id>

spark.hadoop.fs.azure.account.auth.type.<datalake>.dfs.core.windows.net OAuth

spark.hadoop.fs.azure.account.oauth.provider.type.<datalake>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider

spark.hadoop.hadoop.fs.azure.account.oauth2.client.secret.<datalake>.dfs.core.windows.net {{secrets/yoursecretscope/yoursecretname}}

spark.hadoop.fs.azure.account.oauth2.client.endpoint.<datalake>.dfs.core.windows.net https://login.microsoftonline.com/<tenant>/oauth2/token

This requires a secret to be created but that is possible via the https://<databricks-instance#secrets/createScope URL. The value used would be {{secrets/yoursecretscope/yoursecretname}}

Finally, the 403 error also requires that the networking be checked. If the databricks and storage accounts are in different virtual networks, that of the storage account must allow list the subnets both private and public for the databricks instance.


No comments:

Post a Comment