Cluster computing

Wednesday, April 5, 2023

The topic discussed here are some key considerations for the use of Azure Data Lake Storage Gen2 which is a highly scalable and cost-effective data lake solution for big data analytics.

The first and foremost consideration is that a data lake is not appropriate where a data warehouse is suitable. They are complimentary solutions that work together to help us derive key insights from our data. A data lake is a store for all types of data from various sources besides the unstructured versus structured data differentiation. A retail company can store the past 5 years’ worth of sales data in a data lake, and in addition they can process, and they can process data from social media to extract the new trends in consumption and intelligence from retail analytics solutions and then generate a highly structured data set that is suitable to store in the warehouse. ADLS Gen2 can offer faster performance and Hadoop compatible access with the hierarchical namespace, lower cost, and security with fine grained access controls and native AAD integration. It can become a hyperscale repository that can store up to petabytes of data and hundreds of Gbps in throughput.

Other considerations include what will the data lake store and for how long, what portion of the data is used in analytical workloads, who needs access to what parts of the data lakes, what are the various analytical workloads and what am I going to run on the data lake, what are the types of workloads that will use the data, what are the transaction patterns and the analytics patterns and what is the budget I’m working with.

Organization and managing of data in the data lake is also a key concern for this single data store. Some users of this store can claim end-to-end ownership of the pipeline and others have a central team that manages, operates and governs the data lake. This calls for different topologies of the data lake such as centralized or federated data lake strategies, implementing with single or multiple storage accounts, and with globally shared or regionally specific footprints.

When the customers of the data lake are both internal and external, the scenarios may be subject to different requirements from security and compliance perspectives, query patterns, access requirements and cost and billing. Storage accounts can be created in different subscriptions for development and production environments and subject to different SLAs. The account boundaries determine the management of logical sets of data in a unified or isolated manner. Subscription limits and quotas apply to resources used with the Azure data lake such as the VM cores and ADF instances. Managed identities and SPNs must have different privileges than those who are merely reading the data.

Backup and archive is done with the help of scripts that create action and filter objects to apply blob tiering to block blobs matching a certain criteria and storage accounts must be provisioned in a brand-new resource group and storage account based upon input variables.

Cluster computing

Wednesday, April 5, 2023

No comments:

Post a Comment