Cluster computing

This is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing that included the most recent discussion on Azure Data Lake which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category. This article focuses on Azure Data Lake which is suited to store and handle Big Data. This is built over Azure Blob Storage, so it provides native support for web-accessible documents. It is not a massive virtual data warehouse, but it powers a lot of analytics and is centerpiece of most solutions that conform to the Big Data architectural style.

This article talks about data ingestion from one location to another in an Azure Data Lake Gen 2 using Azure Synapse analytics. The Gen 2 is a source data store and will require a corresponding storage account. Azure Synapse analytics provides many features for data analysis and integration, but its pipelines are even more helpful to working with data.

In the Azure Synapse Analytics, we create a linked service which is a definition of a connection information to another service. When we add Azure Synapse Analytics and Azure Data Lake Gen 2 as linked services, we enable the data to flow continuously over the connection without requiring additional routines. The Azure Synapse Analytics UX has a manage tab where the option to create a linked services is provided under External Connections. The Azure Storage Data Lake Gen 2 connection will require an Account Key, a service principal, a managed identity and supported authentication types. The connection can be tested prior to use.

The pipeline definition in the Azure Synapse describes the logical flow for an execution of a set of activities. We require a copy activity in the pipeline to ingest data from Azure Data Lake Gen 2 into a dedicated SQL pool. A pipeline option is available under the Orchestrate tab which must be selected to associate activities with. The Move and Transform option in the activities pane has a copy-data option that can be dragged onto the pipeline canvas. The copy activity must be defined with a new source data store as the Azure Data Lake Storage Gen 2. The delimited text as the format must be specified along with the filepath as the source data and whether the first row has a header

With the pipeline configured this way, a debug run can be executed before the artifacts are published which can verify if everything is correct. Once the pipeline has run successfully, the publish-all option can be selected to publish entities to the Synapse Analytics service. When the successfully published message occurs, we can move on to triggering and monitoring the pipeline.

A trigger can be manually invoked with the Trigger Now option. When this is done, the monitor tab will display the pipeline run along with links under the Actions column. The details of the copy operation can then be viewed. The data written to the dedicated SQL pool can then be verified to be correct.

Cluster computing

Tuesday, January 4, 2022

No comments:

Post a Comment