This is a continuation of the articles on Azure Data
Platform and
discusses data ingestion for Data Lakes.
ADLS Gen2 or data lake for short can hold petabytes of data
and store it as files and folders with file-level security and scale. ADF can
load data into the data lake with high throughput and massive
parallelization. A point to point copy
activity can take several minutes for gigabytes of data over the internet and
the scale out provided by ADF makes it all the more appealing to use. This
article discusses some of the considerations made towards data ingestion that
inevitably occurs for loading data into a data lake.
First, the prerequisites for data ingestion must be called
out. An Azure subscription and a storage account with Azure data lake gen2
enabled are required at the minimum. The source of the data can be another
cloud storage or on-premises. ADF can work with many types of data sources but
we will focus on the case of ingesting files and folders of varying size and
number and to the tune of petabytes in size.
Creating an ADF from the Azure portal is easy to follow with
the steps outlined in the user interface. It has an Ingest tile to launch the
Copy Data tool. In the properties window, the built-in copy task is available
which can be run on-demand or periodic basis.
The data source must be specified and this can point to an existing S3
compatible storage on-premises that provides the file-system used for the files
and folders to be copied. This will require an integration runtime, an access
key and secret. There are three types of integration runtime: Azure,
self-hosted and Azure-SSIS. A
self-hosted integration runtime is capable of running copy activity between a
cloud data store and a data store in private network. The transform activities are dispatched
against the compute resources on-premises. The Azure integration runtime is
used for data flows while the Azure-SSIS is for relational data. The
self-hosted integration runtime makes outbound HTTP connections to the internet
so it can sit behind the firewall and not require a direct link to the cloud.
It runs on Windows Operating System and a single logical instance can be
associated with multiple physical on-premises machines in active-active mode.
The transfer of data must be in binary mode, and it must
recursively traverse the files and folders. The destination must be configured
as the data lake by pointing to the Azure subscription and the storage account
with the ADLS Gen2 option specified. The destination folder structure can
preserve that of the origin and a preview option ensures that the data can be
copied correctly. ADF can extract zip files prior to sending them to the data
lake. The copy operation is tracked with a task. The pipeline comprises of the
copy task and the monitor task. When the pipeline run completes successfully,
its activity details can be viewed to rerun it if necessary.
All the activity runs have status, copy duration,
throughput, data read, files read, data written, files written, peak
connections for both read and write, the parallel copies used, the data
integration units, the queue and the transfer durations to provide complete
information on the activities performed for monitoring or troubleshooting.
No comments:
Post a Comment