Data Lakes are popular for storing and handling Big Data and IoT
events. It is not a massive virtual data warehouse, but it powers a lot of
analytics and is the centerpiece of most solutions that conform to the Big Data
architectural style. A data lake must store petabytes of data while handling
bandwidths up to Gigabytes of data transfer per second. The hierarchical
namespace of the object storage helps organize objects and files into a deep
hierarchy of folders for efficient data access. The naming convention recognizes
these folder paths by including the folder separator character in the name
itself. With this organization and folder access directly to the object store,
the performance of the overall usage of data lake is improved. A mere shim over
the Data Lake Storage interface that supports file system semantics over blob
storage is welcome for organizing and accessing such data. The data management
and analytics form the core scenarios supported by Data Lake. For multi-region
deployments, it is recommended to have the data landing in one region and then
replicated globally. The best practices for Data Lake involve evaluating
feature support and known issues, optimizing for data ingestion, considering
data structures, performing ingestion, processing and analysis from several
data sources and leveraging monitor telemetry. When the Data Lake supports
query acceleration and analytics framework, it significantly improves data
processing by only retrieving data that is relevant to an operation. This
cascades to reduced time and processing power for the end-to-end scenarios that
are necessary to gain critical insights into stored data. Both ‘filtering
predicates' and ‘column projections’ are enabled, and SQL can be used to
describe them. Only the data that meets these conditions are transmitted. A request processes only one file so joins,
aggregates and other query operators are not supported but the request can be
in any format such as csv or json file formats. The query acceleration feature
isn’t limited to Data Lake Storage. It is supported even on Blobs in storage
accounts that form the persistence layer below the containers of the data lake.
Even those without hierarchical namespace are supported by the Data Lake query
acceleration feature. The query acceleration is part of the data lake so
applications can be switched with one another, and the data selectivity and
improved latency continues across the switch. Since the processing is on the
side of the Data Lake, the pricing model for query acceleration differs from that
of the normal transactional model. Fine grained access control lists and active
directory integration round up the data security considerations
A
checklist helps with migrating sensitive data to the cloud and provides
benefits to overcome the common pitfalls regardless of the source of the data.
It serves merely as a blueprint for a smooth secure transition.
Characterizing permitted use is the first step for data teams need to take to
address data protection for reporting. Modern privacy laws specify not only
what constitutes sensitive data but also how the data can be used. Data
obfuscation and redacting can help with protecting against exposure. In
addition, data teams must classify the usages and the consumers. Once sensitive
data is classified, and purpose-based usage scenarios are addressed, role-based
access control must be defined to protect future growth.
Devising
a strategy for governance is the next step; this is meant to prevent intruders
and is meant to boost data protection by means of encryption and database
management. Fine grained access control such as attribute or purpose-based ones
also help in this regard.
Embracing
a standard for defining data access policies can help to limit the explosion of
mappings between users and the permissions for data access; this gains
significance when a monolithic data management environment is migrated to the
cloud. Failure to establish a standard for defining data access policies can
lead to unauthorized data exposure.
When
migrating to the cloud in a single stage with all at once data migration must
be avoided as it is operationally risky. It is critical to develop a plan for
incremental migration that facilitates development testing and deployment of a
data protection framework which can be applied to ensure proper governance.
Decoupling data protection and security policies from the underlying platform
allows organizations to tolerate subsequent migrations.
There
are different types of sanitizations such as redaction, masking, obfuscation,
encryption tokenization and format preserving encryption. Among these static
protection in which clear text values are sanitized and stored in their
modified form and dynamic protection in which clear text data is transformed
into a ciphertext are most used.
Finally defining and implementing data
protection policies brings several additional processes such as validation,
monitoring, logging, reporting, and auditing. Having the right tools and
processes in place when migrating sensitive data to the cloud will allay concerns
about compliance and provide proof that can be submitted to oversight agencies.
Compliance
goes beyond applying rules and becomes a process to verify that laws are
observed. The right tools and processes can allay concerns about compliance.
No comments:
Post a Comment