Cluster computing

Tuesday, June 14, 2022

Data in transit and multitenancy:

This is a continuation of series of articles on hosting solutions and services on Azure public cloud with the most recent discussion on Service Fabric as per the summary here. This article discusses the architectural considerations for a multitenant application.

Data in transit usually goes through some set of operations that are usually referred to as Extract-Transform-Load or ETL in short. In the earlier days, online transactional data would be stored in relational databases and the data would eventually make its way to the data warehouses. Data transfers between databases and warehouses used ETL commonly. A designer tool was made available that made it easy to setup the data flows and the operators.

A data warehouse consolidated the data. All the analysis stacks were built over this data at rest. Nowadays, a pipeline is built to facilitate capturing data, providing streaming access and providing infinite storage. Companies are showing increased usage of event based processing and pipelined execution. This form of execution involved acquisition, extraction, integration, analysis, and interpretation

As unstructured data storage became popular and the map-reduce methods gained usage over this data, many other forms of storage came into the picture which made data transfers even more necessary and widespread.

Access to the data had to be standardized. Fortunately, with the widespread usage of Object Storage, data access became easy and ubiquitous with S3 API. Object storage was more than web accessible limitless cloud storage. It also served as a storage where data transfers can be registered, chronicled, studied and improved –all on premise over an organizations network with no limit to the number of sites or instances supported. Object storage transformed from being a passive storage product to a storage layer that actively builds metadata, maintains organizations, rebuilds indexes from data and metadata and performs vectorized execution centric optimizations. This we can refer to as data harvesting.

When harvesting was deeper, it is like network packet capture. The salient notion is that data capture is like harvesting. Both make use of proxies as a man in the middle which provide little or no disruption to ongoing operations while facilitating the storage and analysis of data for future. Data capture goes a step deeper into the hierarchy of application stack by tapping into the network. Most data in transit are already in the form of network packets. Data is fragmented and sent to the destinations over protocols. They are also encrypted. However, the end points of the network tunnel are safe with the sender and receiver. A networking layer that facilitates publishing of data to interested and authorized subscribers prior to send and receive enables the decoupling of a packet sniffer to be specific to the sender or receiver. In such a case, we alleviate the data harvesting from data centric operations to the network layer where the TCP send and receive queue support publishing and subscribing with added annotations.

With the improved data transfers and ease of use data can remain in transit albeit within a more encompassing storage layer. Storage and networking can both implement publisher subscriber mechanisms.

Cluster computing

Tuesday, June 14, 2022

No comments:

Post a Comment