Cluster computing

Friday, April 7, 2023

This is a continuation of the articles on Azure Data Platform and discusses data ingestion for Data Lakes.

ADLS Gen2 or data lake for short can hold petabytes of data and store it as files and folders with file-level security and scale. ADF can load data into the data lake with high throughput and massive parallelization. A point to point copy activity can take several minutes for gigabytes of data over the internet and the scale out provided by ADF makes it all the more appealing to use. This article discusses some of the considerations made towards data ingestion that inevitably occurs for loading data into a data lake.

First, the prerequisites for data ingestion must be called out. An Azure subscription and a storage account with Azure data lake gen2 enabled are required at the minimum. The source of the data can be another cloud storage or on-premises. ADF can work with many types of data sources but we will focus on the case of ingesting files and folders of varying size and number and to the tune of petabytes in size.

Creating an ADF from the Azure portal is easy to follow with the steps outlined in the user interface. It has an Ingest tile to launch the Copy Data tool. In the properties window, the built-in copy task is available which can be run on-demand or periodic basis. The data source must be specified and this can point to an existing S3 compatible storage on-premises that provides the file-system used for the files and folders to be copied. This will require an integration runtime, an access key and secret. There are three types of integration runtime: Azure, self-hosted and Azure-SSIS. A self-hosted integration runtime is capable of running copy activity between a cloud data store and a data store in private network. The transform activities are dispatched against the compute resources on-premises. The Azure integration runtime is used for data flows while the Azure-SSIS is for relational data. The self-hosted integration runtime makes outbound HTTP connections to the internet so it can sit behind the firewall and not require a direct link to the cloud. It runs on Windows Operating System and a single logical instance can be associated with multiple physical on-premises machines in active-active mode.

The transfer of data must be in binary mode, and it must recursively traverse the files and folders. The destination must be configured as the data lake by pointing to the Azure subscription and the storage account with the ADLS Gen2 option specified. The destination folder structure can preserve that of the origin and a preview option ensures that the data can be copied correctly. ADF can extract zip files prior to sending them to the data lake. The copy operation is tracked with a task. The pipeline comprises of the copy task and the monitor task. When the pipeline run completes successfully, its activity details can be viewed to rerun it if necessary.

All the activity runs have status, copy duration, throughput, data read, files read, data written, files written, peak connections for both read and write, the parallel copies used, the data integration units, the queue and the transfer durations to provide complete information on the activities performed for monitoring or troubleshooting.

Cluster computing

Friday, April 7, 2023

No comments:

Post a Comment