Cluster computing

Saturday, March 25, 2023

A previous post introduced some of the best practices using Azure Data Platform. It covered various options about structured and unstructured storage. This article covers some of the considerations regarding data in transit.

Azure Data Factory is frequently used to extract-transform-load data to the cloud. Even if there are on-premises SSIS data tasks to perform, Azure Data Factory can help to migrate the data from on-premises to the cloud. There are different components within the Azure Data Factory that help to meet this goal. The Linked Services provide connections to external resources that contain datasets to work with. The Pipeline has one or more activities that can be triggered to control/transform the data. The Integration Runtime provides the compute environment for data integration execution that involves flow, transform and movement. This can be Azure-based, self-hosted and an integration of Azure-SSIS that can lift and shift existing SSIS workloads. The pipeline and activities define actions to perform on the data. For example, a pipeline could contain a set of activities that ingest and clean log data and then kick off a mapping data flow to analyze the log data. When the pipeline is deployed and scheduled, all the activities can be managed as a set instead of each one individually. Activities can be grouped into data movement activities, data transformation activities, and control activities. Each activity can take zero or more input and output datasets. Azure Data Factory enables us to author workflows that orchestrate complex ETL, ELT and data integration tasks in a flexible way that involves graphical and code-based data pipelines with Continuous Integration / Continuous Deployment support.

Let us take a scenario for a company that has a variety of data from a variety of sources and requires automations for data ingestion and analysis in the cloud. It has a variety of data available, from a variety of sources and requires expertise in business analysis, data engineering, and data science to define an analytics solution. For this purpose, it leverages the new data analytics platform in Azure that has been discussed so far. The current practice in this company captures a variety of data about manufacturing and marketing and stores it in a centralized repository. The size of this repository limits the data capture to about one week’s worth of data and supports data formats in the form of json, csv, and text. Additionally, data also exists in another cloud in a publicly available object storage. The company is expecting to meet the following objectives with regard to data storage, data movement, and data analytics and insights. The data storage must be such that months of data to the tune of petabytes can be stored and support access control at the file level. Data must be regularly ingested from both the on-premises and AWS. The existing connectivity and accessibility to data cannot be changed. The analytics platform must support Spark and be available to the other cloud. Security demands the workspace used for analytics must be made available only to the head office.

A possible solution for the above scenario is one that could store the data in the Azure Data Lake Storage Gen2 because it scales to petabytes of data and comes with hierarchical namespace and POSIX-like access control list. The data can be copied or moved with Azure Data Factory that has a self-hosted integration runtime that is running on-premises and can access the on-premises storage privately. Even when certain data might be in the other cloud, Azure Data Factory can leverage the built-in Azure Integration runtime to access it. There are many services to choose from for the analytics solution, but Databricks provides the analytics because it could potentially work in both public clouds. The premium plan for Databricks can restrict the workspace use only to the head office. It also supports Azure AD credential pass-through when it is to be used for securing data lake storage.

Cluster computing

Saturday, March 25, 2023

No comments:

Post a Comment