This is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing that included the most recent discussion on Azure SQL Edge which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category. This article focuses on Azure Data Lake which is suited to store and handle Big Data. This is built over Azure Blob Storage, so it provides native support for web-accessible documents. It is not a massive virtual data warehouse, but it powers a lot of analytics and is centerpiece of most solutions the conform to the Big Data architectural style.
The Gen 1 Data Lake was not integrated with the Blob
Storage but Gen 2 does. There’s support for file-system semantics in Gen 2 and
file security. Since, these features are provided from blob storage, it comes
with the best practices in storage engineering that include replication groups,
high availability, tiered data storage and storage class, aging and retention
policies.
Gen 2 is the current standard for building Enterprise
Data Lakes on Azure. A data lake must
store petabytes of data while handling bandwidths up to Gigabytes of data
transfer per second. The hierarchical namespace of the object storage helps
organize objects and files into a deep hierarchy of folders for efficient data
access. The naming convention recognizes these folder paths by including the
folder separator character in the name itself. With this organization and
folder access directly to the object store, the performance of the overall
usage of data lake is improved.
Both the object store containers and the containers
exposed by Data Lake are transparently available to applications and services.
The Blob storage features such as diagnostic logging, access tiers and
lifecycle management policies are available to the account. The integration
with Blob Storage is only one aspect of the integration from Azure Data Lake.
Many other services are also integrated with Azure Data Lake to support data
ingestion, data analytics and reporting with visual representations. The data
management and analytics form the core scenarios supported by Data Lake. Fine
grained access control lists and active directory integration round up the data
security considerations. Even if the data lake comprises a few data asset
types, some planning phase is required to avoid the dreaded data swamp analogy.
Governance and organization are key to avoiding this situation. When the size
and number of data systems are several, a robust data catalog system is
required. Since Data Lake is a PaaS service, it can support multiple accounts
at no overhead. A minimum of three lakes is recommended during the discovery
and design phase due to the following factors:
1.
Isolation of data environments and
predictability
2.
Features and functionality at the
storage account level or regional versus global data lakes
3.
The use of a data catalog, data
governance and project tracking tools
For multi-region deployments, it is recommended to have
the data landing in one region and then replicated globally using AzCopy, Azure
Data Factory or third-party products which assist with migrating data from one
place to another.
The best practices for Azure Data Lake involve evaluating
feature support and known issues, optimizing for data ingestion, considering
data structures, performing ingestion, processing and analysis from several
data sources and leveraging monitor telemetry
No comments:
Post a Comment