Cluster computing

Saturday, December 31, 2022

Data Modernization

Data technologies in recent years have popularized both structured and unstructured storage. This is fueled by applications that are embracing cloud resources. The two trends are happening simultaneously and are reinforcing each other.

Data modernization means moving data from legacy databases to modern databases. It comes at a time when many databases are doubling their digital footprint. Unstructured data is the biggest contributor to this growth and includes images, audio, video, social media comments, clinical notes, and such others. Organizations have shifted from a data architecture based on relational enterprise-based data warehouses to data lakes based on big data. If the survey from IT spends is to be believed, a great majority of organizations are already on their way towards data modernization with those in the Finance service firms leading the way. These organizations reported data security planning as part of their data modernization activities. They consider the tools and technology that are available in the marketplace as the third most important reason in their decision making.

Drivers for one-time data modernization plans include security and governance, strategy and plan, tools and technology, and talent. Data modernization is a key component of, or reason for, migrating to the cloud. The rate of adoption of external services in the data planning and implementation is about 44% for these organizations.

Data Lakes are popular for storing and handling Big Data and IoT events. It is not a massive virtual data warehouse, but it powers a lot of analytics and is the centerpiece of most solutions that conform to the Big Data architectural style. A data lake must store petabytes of data while handling bandwidths up to Gigabytes of data transfer per second. The hierarchical namespace of the object storage helps organize objects and files into a deep hierarchy of folders for efficient data access. The naming convention recognizes these folder paths by including the folder separator character in the name itself. With this organization and folder access directly to the object store, the performance of the overall usage of data lake is improved. A mere shim over the Data Lake Storage interface that supports file system semantics over blob storage is welcome for organizing and accessing such data. The data management and analytics form the core scenarios supported by Data Lake. For multi-region deployments, it is recommended to have the data landing in one region and then replicated globally. The best practices for Data Lake involve evaluating feature support and known issues, optimizing for data ingestion, considering data structures, performing ingestion, processing and analysis from several data sources and leveraging monitor telemetry. When the Data Lake supports query acceleration and analytics framework, it significantly improves data processing by only retrieving data that is relevant to an operation. This cascades to reduced time and processing power for the end-to-end scenarios that are necessary to gain critical insights into stored data. Both ‘filtering predicates' and ‘column projections’ are enabled, and SQL can be used to describe them. Only the data that meets these conditions are transmitted. A request processes only one file so joins, aggregates and other query operators are not supported but the request can be in any format such as csv or json file formats. The query acceleration feature isn’t limited to Data Lake Storage. It is supported even on Blobs in storage accounts that form the persistence layer below the containers of the data lake. Even those without hierarchical namespace are supported by the Data Lake query acceleration feature. The query acceleration is part of the data lake so applications can be switched with one another, and the data selectivity and improved latency continues across the switch. Since the processing is on the side of the Data Lake, the pricing model for query acceleration differs from that of the normal transactional model. Fine grained access control lists and active directory integration round up the data security considerati

Cluster computing

Saturday, December 31, 2022

No comments:

Post a Comment