When data analytics users want to analyze large amounts of
data, they choose Azure Databricks workspace and browse the data from remote
unlimited storage. Usually, this is an Azure storage account, or a data lake
and the users find it convenient to pass through their Azure Active Directory
credentials and work with the files using their location as
·
wasbs://container@storageaccount.blob.core.windows.net/prefix/to/blob
for blobs and
·
abfss://container@storageaccount.dfs.core.windows.net/path/to/file
for files
This is sufficient to read with Spark as:
df =
spark.read.format("binaryFile").option("pathGlobFilter",
"*.p").load(file_location)
and write but the configuration for Spark is such that it is
limited to 2Gb max message and unless the file is partitioned, large files
usually cause trouble to the users when the same statement fails.
In such cases, the users must make use of a few options, if
they wanted to continue using their AAD credentials.
First, they can use a Shared access signature URL that they
create with user delegation to read the file. For example,
if not
os.path.isfile("/dbfs/" + filename):
print("downloading file...")
with
requests.get(sasUrl, stream=True) as resp:
if resp.ok:
with open("/dbfs/" + filename,
"wb") as f:
for chunk in
resp.iter_content(chunk_size=CHUNK_SIZE):
f.write(chunk)
print("file
found...")
but it is somewhat hard to write interactively to a SAS URL because
the request headers per the document to write with a SAS Token seems to be
complained about. So, the writes need to be local and then the file uploaded.
from
azure.storage.blob import ContainerClient
#
https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.containerclient?view=azure-python
sas_url =
"https://account.blob.core.windows.net/mycontainer?sv=2015-04-05&st=2015-04-29T22%3A18%3A26Z&se=2015-04-30T02%3A23%3A26Z&sr=b&sp=rw&sip=168.1.5.60-168.1.5.70&spr=https&sig=Z%2FRHIX5Xcg0Mq2rqI3OlWTjEg2tYkboXr1P9ZUXDtkk%3D"
container =
ContainerClient.from_container_url(sas_url)
with
open(SOURCE_FILE, "rb") as data:
blob_client =
container_client.upload_blob(name="myblob", data=data)
and second, they use the equivalent of mounting the remote
storage as local filesystem but in this case, they must leverage a databricks
access connector and the admin must grant permissions to the databricks user
for using it in their workspace. The users can continue to use their AD
credentials spanning the databricks workspace, the access connector and the
external storage.
Third party options
like smart_open discuss ways to do so such as follows:
!pip install smart_open[all]
!pip install
azure-storage-blob
from smart_open
import open
import os
# stream from
Azure Blob Storage
connect_str="BlobEndpoint=https://storageaccount.blob.core.windows.net;
\
SharedAccessSignature=<sasToken>"
transport_params
= {
'client':
azure.storage.blob.BlobServiceClient.from_connection_string(connect_str),
}
filename =
'azure://container/path/to/file'
# stream
content *into* Azure Blob Storage (write mode):
with open(filename,
'wb', transport_params=transport_params) as fout:
fout.write(b'contents written here')
for line in
open(filename, transport_params=transport_params):
print(line)
but this fails
with the error that the write using the SASURL is not permitted even though the
SASURL was generated with write permission
Therefore,
leveraging the Spark framework as much as possible and working with SASUrls for
upload and download of large files are preferable. Fortunately, the file
transfer of about 10GB file takes only a couple of minutes.
No comments:
Post a Comment