The topic discussed here are some key considerations for the use of Azure Data Lake Storage Gen2 which is a highly scalable and cost-effective data lake solution for big data analytics.
The first and foremost consideration is that a data lake is
not appropriate where a data warehouse is suitable. They are complimentary solutions that work
together to help us derive key insights from our data. A data lake is a store
for all types of data from various sources besides the unstructured versus
structured data differentiation. A
retail company can store the past 5 years’ worth of sales data in a data lake,
and in addition they can process, and they can process data from social media
to extract the new trends in consumption and intelligence from retail analytics
solutions and then generate a highly structured data set that is suitable to
store in the warehouse. ADLS Gen2 can offer faster performance and Hadoop
compatible access with the hierarchical namespace, lower cost, and security
with fine grained access controls and native AAD integration. It can become a
hyperscale repository that can store up to petabytes of data and hundreds of
Gbps in throughput.
Other considerations include what will the data lake store
and for how long, what portion of the data is used in analytical workloads, who
needs access to what parts of the data lakes, what are the various analytical
workloads and what am I going to run on the data lake, what are the types of
workloads that will use the data, what are the transaction patterns and the
analytics patterns and what is the budget I’m working with.
Organization and managing of data in the data lake is also a
key concern for this single data store. Some users of this store can claim end-to-end ownership of the
pipeline and others have a central team that manages, operates and governs the
data lake. This calls for different topologies of the data lake such as
centralized or federated data lake strategies, implementing with single or
multiple storage accounts, and with globally shared or regionally specific
footprints.
When the customers of the data lake are both internal and
external, the scenarios may be subject to different requirements from security
and compliance perspectives, query patterns, access requirements and cost and
billing. Storage accounts can be created in different subscriptions for
development and production environments and subject to different SLAs. The
account boundaries determine the management of logical sets of data in a
unified or isolated manner. Subscription limits and quotas apply to resources
used with the Azure data lake such as the VM cores and ADF instances. Managed
identities and SPNs must have different privileges than those who are merely
reading the data.
Backup and archive is done with the help of scripts that
create action and filter objects to apply blob tiering to block blobs matching
a certain criteria and storage accounts must be provisioned in a brand-new
resource group and storage account based upon input variables.
No comments:
Post a Comment