Cluster computing

Saturday, April 8, 2023

Using Azure Data Factory to upload and transform data to Azure Data Lake Storage Gen2.

This is a continuation of the articles on Azure Data Platform as they appear here and discusses data security and compliance for ADF and Data Lakes.

ADF can work with many types of data sources and ingest files and folders of varying size and number and to the tune of petabytes in size. Microsoft Purview can be used to govern, protect, and manage data estates. It provides integrated coverage, helps address the fragmentation of data across rest and transit. This kind of solution helps an organization to protect sensitive data across clouds, applications and devices and identify data risks and manage regulatory compliance requirements. It helps to create an up-to-date map of the entire data estate that includes data classification and end-to-end lineage, identify sensitive data and create a secure environment for data consumers to find valuable data and generate insights about how the data is stored and used. With data in ADF and data lake, such a report is very helpful to meet compliance with standards such as SOC, ISO, HiTrust, FedRamp and HIPAA.

ADF can be connected to Microsoft Purview. There are two options to do so:

1. Connect to Microsoft Purview account in Azure Data Factory and

2. Register Data Factory in Microsoft Purview

3. Complete with Azure Monitor based alerts

The prerequisites are an Owner or Contributor role on the ADF to connect to a Microsoft Purview Account and ADF to have a system assigned managed identity enabled. The connection takes merely the Azure subscription to locate the Purview account and one of them is selected. The connection information is stored in the ADF resource. ADF’s managed identity is used to authenticate lineage push operations from the ADF to the Microsoft Purview account. The Data Curator role on the root collection of the Microsoft Purview must be assigned to the managed identity of the ADF.

Both this connection and the Purview integration capabilities can be monitored. The default integration capability is the data lineage pipeline. When this pipeline is executed, the lineage information is transmitted to the Purview account. The search bar at the top center of the Data Factoring authoring UI can be used to search for data and perform actions. This is very helpful to understand the data based on metadata, lineage, and annotations. Many organizations heavily rely on tagging and metadata and even to the point of specifying paths and dedicating storage containers for such information.

With the data searched by Microsoft Purview, it is possible to create Linked Service, Dataset, or dataflow over the data.

All the activity runs from ADF have status, copy duration, throughput, data read, files read, data written, files written, peak connections for both read and write, the parallel copies used, the data integration units, the queue and the transfer durations to provide complete information on the activities performed for monitoring or troubleshooting.

Cluster computing

Saturday, April 8, 2023

No comments:

Post a Comment