Monday, January 3, 2022

This is a continuation of a series of articles on operational engineering aspects of Azure public cloud computing that included the most recent discussion on Azure Data Lake which is a full-fledged general availability service that provides similar Service Level Agreements as expected from others in the category. This article focuses on Azure Data Lake which is suited to store and handle Big Data. This is built over Azure Blob Storage, so it provides native support for web-accessible documents. It is not a massive virtual data warehouse, but it powers a lot of analytics and is centerpiece of most solutions the conform to the Big Data architectural style.

Azure Data Lake supports query acceleration and analytics framework. It significantly improves data processing by only retrieving data that is relevant to an operation. This cascades to reduced time and processing power for the end-to-end scenarios that are necessary to gain critical insights into stored data. Both ‘filtering predicates' and ‘column projections’ are enabled, and SQL can be used to describe them. Only the data that meets these conditions are transmitted.  A request processes only one file so joins, aggregates and other query operators are not supported but the request can be in any format such as csv or json file formats. The query acceleration feature isn’t limited to Data Lake Storage. It is supported even on Blobs in storage accounts that form the persistence layer below the containers of the data lake. Even those without hierarchical namespace are supported by the Azure Data Lake query acceleration feature. The query acceleration is part of the data lake so applications can be switched with one another, and the data selectivity and improved latency continues across the switch. Since the processing is on the side of the Data Lake, the pricing model for query acceleration differs from that of the normal transactional model.

There are three different client tools for working with Azure Data Lake Storage. These include: 1. Azure Portal which provides the convenience of a web user interface and explores all forms of blobs, tables, queues and files, 2. Azure Storage Explorer which can be downloaded and just as useful to explore as the Azure Portal 3. The Microsoft Visual Studio Cloud Explorer which supports exploring blobs, tables and queues but not files. There are a number of third party tools that are also available for working with Azure Storage Data.

The known issues with using Gen2 data lake storage include the following:

1. the similarities and contrasts between Gen 2 APIs, NFS 3.0 and Data Lake Storage APIs where all of them can operate on the same data but cannot write to the same instance of a file.  The Gen2 or NFS 3.0 APIs can write to a file but it won’t be visibe to GET block list Blob API unless the file is being overwritten with a zero truncate option.

2. The PUT Blob (Page), Put Page, Get Page Ranges, Incremental Copy Blob and Put Page From URL APIs are not supported. Storage accounts that have a hierarchical namespace cannot permit unmanaged VM disks to be added to the account.

3. Access control Lists (ACLs) is widely used with storage entities by virtue of assignment or inheritance. Both operations allow access controls to be set in bulk via recursion. ACLs can be set via Azure Storage Explorer, PowerShell, Azure Cli, and .Net, Java and Python SDK but not via the Azure Portal that manages resources instead of data containers.

4. If anonymous read acccess has been granted to a data container, then ACLs will have no effect on read requests but they can still be applied to write requests.

5. Only the latest versions of the AzCopy such as v10 onwards and Azure Storage Explorer v1.6.0 or higher are supported.

Third party applications are best advised to use the REST APIs or SDK since they will continue to be supported. Deletion of logs from storage account can be performed with one of the above tools, REST APIs or SDK but the setting for retention days is not supported. Windows Azure Storage Blob Driver works only with Blob APIs and setting the multiprotocol access on Data Lake Storage won’t mitigate the issue of WASB driver not working with some common cases of Data Lake. If the parent folder of soft-deleted files or folders is renamed, then they won’t display correctly in the portal but the PowerShell and CLI can be used to restore them. An account with event subscription will not be able to read from secondary storage endpoints but this can be mitigated by removing the event subscription.

No comments:

Post a Comment