Azure Databricks is used to process, store, clean, share,
analyze, model, and monetize datasets with solutions from Business Intelligence
to machine learning. It is used to build and deploy data engineering workflows,
machine learning models, analytics dashboards and more.
It connects to different external storage locations
including the Azure Data Lake Storage. Users logged in to the Azure Databricks
instance can execute python code and use Spark platform to view tabular
representation of the data stored in various formats on the external storage
accounts. When they refer to a file on the external storage account, they need
not specify credentials to connect, and their logged-in credential can be
passed through to the remote storage account. For example: spark.read.format("parquet").load("abfss://container@storageAccount.dfs.core.windows.net/external-location/path/to/data")
This feature required two
settings:
1.
When a compute cluster is created to execute the
python code, it must have the checkbox to pass through the credentials,
checked.
2.
It must also have the flag
spark.databricks.passthrough.adls set to true
Until recently, the Sparks UI allowed this flag to be set
but the configuration for passthrough changed with the new UI that facilitated
Unity Catalog – a unified access control mechanism. Passthrough credentials and
Unity Catalog are mutually exclusive. The flag no longer can be set to create
new clusters with the new UI in most cases and this affected the implicit login
required to authenticate the current user to the remote storage. The token
provider used earlier was spark.databricks.passthrough.adls.gen2.tokenProviderClassName
and with the new UI the login required more elaborate configuration. The error
code encountered by the users when using the earlier clusters with the newer
version Databricks UI is 403.
The newer configuration is the following:
spark.hadoop.fs.azure.account.oauth2.client.id.<datalake>.dfs.core.windows.net
<sp client id>
spark.hadoop.fs.azure.account.auth.type.<datalake>.dfs.core.windows.net
OAuth
spark.hadoop.fs.azure.account.oauth.provider.type.<datalake>.dfs.core.windows.net
org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.hadoop.hadoop.fs.azure.account.oauth2.client.secret.<datalake>.dfs.core.windows.net
{{secrets/yoursecretscope/yoursecretname}}
spark.hadoop.fs.azure.account.oauth2.client.endpoint.<datalake>.dfs.core.windows.net
https://login.microsoftonline.com/<tenant>/oauth2/token
This requires a secret to be created but that is possible
via the https://<databricks-instance#secrets/createScope URL. The value used
would be {{secrets/yoursecretscope/yoursecretname}}
Finally, the 403 error also requires that the networking be
checked. If the databricks and storage accounts are in different virtual
networks, that of the storage account must allow list the subnets both private
and public for the databricks instance.
No comments:
Post a Comment