The infrastructural challenges of working with
data modernization tools and products has often mandated a simplicity in the
overall deployment. Consider an application such as a Streaming Data Platform
and its deployment on-premises includes several components for the ingestion
store and the analytics computing platform as well as the metrics and
management dashboards that are often independently sourced and require a great
deal of tuning. The same applies to performance improvements in data lakes and
event driven frameworks although by design they are elastic, pay per-use and
scalable.
The solution integration for data modernization
often deals with such challenges across heterogeneous products. Solutions often
demand more simplicity and functionality from the product. There are also quite
a few parallels to be drawn between solution integration with the cloud
services and the product development of data platforms and products. With such
technical similarities, the barrier for product development of data products is
lowered and simultaneously the business needs to make it easier for the
consumer to plug in the product for their data handling, driving the product
upwards into the solution space, often referred to as the platform space.
With this backdrop, let us see how a data
platform provides data management, processing and delivery as services, within
a data lake architecture within a data lake architecture that utilizes the
scalability of object storage.
Event based data is by nature unstructured data. Data Lakes are popular for storing and
handling such data. It is not a massive virtual data warehouse, but it powers a
lot of analytics and is the centerpiece of most solutions that conform
to the Big Data architectural style. A data lake must store petabytes of data
while handling bandwidths up to Gigabytes of data transfer per second. The
hierarchical namespace of the object storage helps organize objects and files
into a deep hierarchy of folders for efficient data access. The naming
convention recognizes these folder paths by including the folder separator
character in the name itself. With this organization and folder access directly
to the object store, the performance of the overall usage of data lake is
improved. A mere shim over the Data Lake Storage interface that supports file
system semantics over blob storage is welcome for organizing and accessing such
data. The data management and analytics form the core scenarios supported by
Data Lake. For multi-region deployments, it is recommended to have the data
landing in one region and then replicated globally. The best practices for Data
Lake involve evaluating feature support and known issues, optimizing for data
ingestion, considering data structures, performing ingestion, processing and
analysis from several data sources, and leveraging monitor telemetry. When the
Data Lake supports query acceleration and analytics framework, it significantly
improves data processing by only retrieving data that is relevant to an
operation. This cascades to reduced time and processing power for the
end-to-end scenarios that are necessary to gain critical insights into stored
data. Both ‘filtering predicates' and ‘column projections’ are enabled, and SQL
can be used to describe them. Only the data that meets these conditions are
transmitted. A request processes only
one file so joins, aggregates and other query operators are not supported but
the request can be in any format such as csv or Json file formats. The query
acceleration feature isn’t limited to Data Lake Storage. It is supported even
on Blobs in storage accounts that form the persistence layer below the
containers of the data lake. Even those without hierarchical namespace are
supported by the Data Lake query acceleration feature. The query acceleration
is part of the data lake so applications can be switched with one another, and
the data selectivity and improved latency continues across the switch. Since
the processing is on the side of the Data Lake, the pricing model for query
acceleration differs from that of the normal transactional model. Fine grained
access control lists and active directory integration round up the data
security considerations.
Data lakes may
serve to reduce complexity in storing data but also introduce new challenges
around managing, accessing, and analyzing data. Deployments fail without
properly addressing these challenges which include:
-
The process of procuring, managing
and visualizing data assets is not easy to govern.
-
The ingestion and querying require
performance and latency tuning from time to time.
-
The realization of business purpose
in terms of the time to value can vary often involving coding.
These are addressed by automations and best
practices.