A previous post introduced some of the best practices using Azure Data Platform. It covered
various options about structured and unstructured storage. This article covers
some of the considerations regarding data in transit.
Azure Data Factory is frequently used to
extract-transform-load data to the cloud. Even if there are on-premises SSIS
data tasks to perform, Azure Data Factory can help to migrate the data from
on-premises to the cloud. There are different components within the Azure Data
Factory that help to meet this goal. The Linked Services provide connections to
external resources that contain datasets to work with. The Pipeline has one or
more activities that can be triggered to control/transform the data. The
Integration Runtime provides the compute environment for data integration
execution that involves flow, transform and movement. This can be Azure-based,
self-hosted and an integration of Azure-SSIS that can lift and shift existing
SSIS workloads. The pipeline and activities define actions to perform on the
data. For example, a pipeline could contain a set of activities that ingest and
clean log data and then kick off a mapping data flow to analyze the log data.
When the pipeline is deployed and scheduled, all the activities can be managed
as a set instead of each one individually. Activities can be grouped into data
movement activities, data transformation activities, and control activities.
Each activity can take zero or more input and output datasets. Azure Data
Factory enables us to author workflows that orchestrate complex ETL, ELT and
data integration tasks in a flexible way that involves graphical and code-based
data pipelines with Continuous Integration / Continuous Deployment support.
Let us take a scenario for a company that has a variety of
data from a variety of sources and requires automations for data ingestion and
analysis in the cloud. It has a variety of data available, from a variety of
sources and requires expertise in business analysis, data engineering, and data
science to define an analytics solution. For this purpose, it leverages the new
data analytics platform in Azure that has been discussed so far. The current
practice in this company captures a variety of data about manufacturing and
marketing and stores it in a centralized repository. The size of this
repository limits the data capture to about one week’s worth of data and
supports data formats in the form of json, csv, and text. Additionally, data
also exists in another cloud in a publicly available object storage. The
company is expecting to meet the following objectives with regard to data
storage, data movement, and data analytics and insights. The data storage must
be such that months of data to the tune of petabytes can be stored and support
access control at the file level. Data must be regularly ingested from both the
on-premises and AWS. The existing connectivity and accessibility to data cannot
be changed. The analytics platform must support Spark and be available to the
other cloud. Security demands the workspace used for analytics must be made
available only to the head office.
A possible solution for the above scenario is one that could
store the data in the Azure Data Lake Storage Gen2 because it scales to
petabytes of data and comes with hierarchical namespace and POSIX-like access
control list. The data can be copied or moved with Azure Data Factory that has
a self-hosted integration runtime that is running on-premises and can access
the on-premises storage privately. Even when certain data might be in the other
cloud, Azure Data Factory can leverage the built-in Azure Integration runtime
to access it. There are many services to choose from for the analytics solution,
but Databricks provides the analytics because it could potentially work in both
public clouds. The premium plan for Databricks can restrict the workspace use
only to the head office. It also supports Azure AD credential pass-through when
it is to be used for securing data lake storage.