This is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing that included the most recent discussion on Azure Data Lake which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category. This article focuses on Azure Data Lake which is suited to store and handle Big Data. This is built over Azure Blob Storage, so it provides native support for web-accessible documents. It is not a massive virtual data warehouse, but it powers a lot of analytics and is centerpiece of most solutions the conform to the Big Data architectural style.
Azure Data Lake supports query acceleration and analytics framework. It
significantly improves data processing by only retrieving data that is relevant
to an operation. This cascades to reduced time and processing power for the
end-to-end scenarios that are necessary to gain critical insights into stored
data. Both ‘filtering predicates' and ‘column projections’ are enabled, and SQL
can be used to describe them. Only the data that meets these conditions are
transmitted. A request processes only
one file so joins, aggregates and other query operators are not supported but
the request can be in any format such as csv or json file formats. The query
acceleration feature isn’t limited to Data Lake Storage. It is supported even
on Blobs in storage accounts that form the persistence layer below the
containers of the data lake. Even those without hierarchical namespace are
supported by the Azure Data Lake query acceleration feature. The query
acceleration is part of the data lake so applications can be switched with one
another, and the data selectivity and improved latency continues across the
switch. Since the processing is on the side of the Data Lake, the pricing model
for query acceleration differs from that of the normal transactional model.
There are three different client tools for working with Azure Data Lake
Storage. These include: 1. Azure Portal which provides the convenience of a web
user interface and explores all forms of blobs, tables, queues and files, 2.
Azure Storage Explorer which can be downloaded and just as useful to explore as
the Azure Portal 3. The Microsoft Visual Studio Cloud Explorer which supports
exploring blobs, tables and queues but not files. There are a number of third
party tools that are also available for working with Azure Storage Data.
The known issues with using Gen2 data lake storage include the
following:
1. the similarities and contrasts between Gen 2 APIs, NFS 3.0 and Data
Lake Storage APIs where all of them can operate on the same data but cannot
write to the same instance of a file.
The Gen2 or NFS 3.0 APIs can write to a file but it won’t be visibe to
GET block list Blob API unless the file is being overwritten with a zero
truncate option.
2. The PUT Blob (Page), Put Page, Get Page Ranges, Incremental Copy Blob
and Put Page From URL APIs are not supported. Storage accounts that have a
hierarchical namespace cannot permit unmanaged VM disks to be added to the
account.
3. Access control Lists (ACLs) is widely used with storage entities by
virtue of assignment or inheritance. Both operations allow access controls to
be set in bulk via recursion. ACLs can be set via Azure Storage Explorer,
PowerShell, Azure Cli, and .Net, Java and Python SDK but not via the Azure
Portal that manages resources instead of data containers.
4. If anonymous read acccess has been granted to a data container, then
ACLs will have no effect on read requests but they can still be applied to
write requests.
5. Only the latest versions of the AzCopy such as v10 onwards and Azure
Storage Explorer v1.6.0 or higher are supported.
Third party applications are best advised to use the REST APIs or SDK
since they will continue to be supported. Deletion of logs from storage account
can be performed with one of the above tools, REST APIs or SDK but the setting
for retention days is not supported. Windows Azure Storage Blob Driver works
only with Blob APIs and setting the multiprotocol access on Data Lake Storage
won’t mitigate the issue of WASB driver not working with some common cases of
Data Lake. If the parent folder of soft-deleted files or folders is renamed,
then they won’t display correctly in the portal but the PowerShell and CLI can
be used to restore them. An account with event subscription will not be able to
read from secondary storage endpoints but this can be mitigated by removing the
event subscription.
No comments:
Post a Comment