This is
a continuation of a series of articles on operational engineering aspects of
Azure public cloud computing that included the most recent discussion on Azure Data Lake which is a full-fledged
general availability service that provides similar Service Level Agreements as
expected from others in the category. This article focuses on Azure Data Lake
which is suited to store and handle Big Data. This is built over Azure Blob Storage,
so it provides native support for web-accessible documents. It is not a massive
virtual data warehouse, but it powers a lot of analytics and is centerpiece of
most solutions that conform to the Big Data architectural style.
This article talks about data ingestion
from one location to another in an Azure Data Lake Gen 2 using Azure Synapse
analytics. The Gen 2 is a source data
store and will require a corresponding storage account. Azure Synapse analytics provides many
features for data analysis and integration, but its pipelines are even more
helpful to working with data.
In the Azure Synapse Analytics, we create
a linked service which is a definition of a connection information to another
service. When we add Azure Synapse Analytics and Azure Data Lake Gen 2 as
linked services, we enable the data to flow continuously over the connection
without requiring additional routines. The Azure Synapse Analytics UX has a
manage tab where the option to create a linked services is provided under
External Connections. The Azure Storage Data Lake Gen 2 connection will require
an Account Key, a service principal, a managed identity and supported
authentication types. The connection can be tested prior to use.
The pipeline definition in the Azure
Synapse describes the logical flow for an execution of a set of activities. We
require a copy activity in the pipeline to ingest data from Azure Data Lake Gen
2 into a dedicated SQL pool. A pipeline option is available under the
Orchestrate tab which must be selected to associate activities with. The Move
and Transform option in the activities pane has a copy-data option that can be
dragged onto the pipeline canvas. The copy activity must be defined with a new
source data store as the Azure Data Lake Storage Gen 2. The delimited text as
the format must be specified along with the filepath as the source data and
whether the first row has a header
With the pipeline configured this way, a
debug run can be executed before the artifacts are published which can verify
if everything is correct. Once the pipeline has run successfully, the
publish-all option can be selected to publish entities to the Synapse Analytics
service. When the successfully published message occurs, we can move on to
triggering and monitoring the pipeline.
A trigger can be manually invoked with
the Trigger Now option. When this is done, the monitor tab will display the
pipeline run along with links under the Actions column. The details of the copy
operation can then be viewed. The data written to the dedicated SQL pool can
then be verified to be correct.
No comments:
Post a Comment