Cluster computing

This is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing that included the most recent discussion on Azure SQL Edge which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category. This article focuses on Azure Data Lake which is suited to store and handle Big Data. This is built over Azure Blob Storage, so it provides native support for web-accessible documents. It is not a massive virtual data warehouse, but it powers a lot of analytics and is centerpiece of most solutions the conform to the Big Data architectural style.

The Gen 1 Data Lake was not integrated with the Blob Storage but Gen 2 does. There’s support for file-system semantics in Gen 2 and file security. Since, these features are provided from blob storage, it comes with the best practices in storage engineering that include replication groups, high availability, tiered data storage and storage class, aging and retention policies.

Gen 2 is the current standard for building Enterprise Data Lakes on Azure. A data lake must store petabytes of data while handling bandwidths up to Gigabytes of data transfer per second. The hierarchical namespace of the object storage helps organize objects and files into a deep hierarchy of folders for efficient data access. The naming convention recognizes these folder paths by including the folder separator character in the name itself. With this organization and folder access directly to the object store, the performance of the overall usage of data lake is improved.

Both the object store containers and the containers exposed by Data Lake are transparently available to applications and services. The Blob storage features such as diagnostic logging, access tiers and lifecycle management policies are available to the account. The integration with Blob Storage is only one aspect of the integration from Azure Data Lake. Many other services are also integrated with Azure Data Lake to support data ingestion, data analytics and reporting with visual representations. The data management and analytics form the core scenarios supported by Data Lake. Fine grained access control lists and active directory integration round up the data security considerations. Even if the data lake comprises a few data asset types, some planning phase is required to avoid the dreaded data swamp analogy. Governance and organization are key to avoiding this situation. When the size and number of data systems are several, a robust data catalog system is required. Since Data Lake is a PaaS service, it can support multiple accounts at no overhead. A minimum of three lakes is recommended during the discovery and design phase due to the following factors:

1. Isolation of data environments and predictability

2. Features and functionality at the storage account level or regional versus global data lakes

3. The use of a data catalog, data governance and project tracking tools

For multi-region deployments, it is recommended to have the data landing in one region and then replicated globally using AzCopy, Azure Data Factory or third-party products which assist with migrating data from one place to another.

The best practices for Azure Data Lake involve evaluating feature support and known issues, optimizing for data ingestion, considering data structures, performing ingestion, processing and analysis from several data sources and leveraging monitor telemetry

Cluster computing

Sunday, December 26, 2021

No comments:

Post a Comment