This is a continuation of a series of articles on
operational engineering aspects of Azure public cloud computing that included
the most recent discussion on Azure Data Lake which is a full-fledged general
availability service that provides similar Service Level Agreements as expected
from others in the category. This article focuses on Azure Data Lake which is
suited to store and handle Big Data. This is built over Azure Blob Storage, so
it provides native support for web-accessible documents. It is not a massive
virtual data warehouse, but it powers a lot of analytics and is centerpiece of
most solutions the conform to the Big Data architectural style.
Gen 2 is the current standard for building Enterprise
Data Lakes on Azure. A data lake must
store petabytes of data while handling bandwidths up to Gigabytes of data
transfer per second. The hierarchical namespace of the object storage helps
organize objects and files into a deep hierarchy of folders for efficient data
access. The naming convention recognizes these folder paths by including the
folder separator character in the name itself. With this organization and
folder access directly to the object store, the performance of the overall
usage of data lake is improved. The Azure Blob File System Driver for Hadoop is
a mere shim over the Azure Data Lake Storage interface that supports file
system semantics over blob storage. Fine grained access control lists and
active directory integration round up the data security considerations. The
data management and analytics form the core scenarios supported by Data Lake.
For multi-region deployments, it is recommended to have the data landing in one
region and then replicated globally using AzCopy, Azure Data Factory or
third-party products which assist with migrating data from one place to
another. The best practices for Azure Data Lake involve evaluating feature
support and known issues, optimizing for data ingestion, considering data structures,
performing ingestion, processing and analysis from several data sources and
leveraging monitor telemetry
Azure Data Lake supports query acceleration and analytics
framework. It significantly improves data processing by only retrieving data
that is relevant to an operation. This cascades to reduced time and processing
power for the end-to-end scenarios that are necessary to gain critical insights
into stored data. Both ‘filtering predicates' and ‘column projections’ are
enabled, and SQL can be used to describe them. Only the data that meets these
conditions are transmitted. A request
processes only one file so joins, aggregates and other query operators are not
supported but the request can be in any format such as csv or json file
formats. The query acceleration feature isn’t limited to Data Lake Storage. It
is supported even on Blobs in storage accounts that form the persistence layer
below the containers of the data lake. Even those without hierarchical
namespace are supported by the Azure Data Lake query acceleration feature. The
query acceleration is part of the data lake so applications can be switched
with one another, and the data selectivity and improved latency continues
across the switch. Since the processing is on the side of the Data Lake, the
pricing model for query acceleration differs from that of the normal
transactional model.
Gen2 also supports Premium block blob storage accounts
that are ideal for big data analytics applications and workloads. These require
low latency and a high number of transactions. Workloads can be interactive,
IoT, streaming analytics, artificial intelligence and machine learning.
No comments:
Post a Comment