Cluster computing

Saturday, January 7, 2023

The infrastructural challenges of working with data modernization tools and products has often mandated a simplicity in the overall deployment. Consider an application such as a Streaming Data Platform and its deployment on-premises includes several components for the ingestion store and the analytics computing platform as well as the metrics and management dashboards that are often independently sourced and require a great deal of tuning. The same applies to performance improvements in data lakes and event driven frameworks although by design they are elastic, pay per-use and scalable.

The solution integration for data modernization often deals with such challenges across heterogeneous products. Solutions often demand more simplicity and functionality from the product. There are also quite a few parallels to be drawn between solution integration with the cloud services and the product development of data platforms and products. With such technical similarities, the barrier for product development of data products is lowered and simultaneously the business needs to make it easier for the consumer to plug in the product for their data handling, driving the product upwards into the solution space, often referred to as the platform space.

With this backdrop, let us see how a data platform provides data management, processing and delivery as services, within a data lake architecture within a data lake architecture that utilizes the scalability of object storage.

Event based data is by nature unstructured data. Data Lakes are popular for storing and handling such data. It is not a massive virtual data warehouse, but it powers a lot of analytics and is the centerpiece of most solutions that conform to the Big Data architectural style. A data lake must store petabytes of data while handling bandwidths up to Gigabytes of data transfer per second. The hierarchical namespace of the object storage helps organize objects and files into a deep hierarchy of folders for efficient data access. The naming convention recognizes these folder paths by including the folder separator character in the name itself. With this organization and folder access directly to the object store, the performance of the overall usage of data lake is improved. A mere shim over the Data Lake Storage interface that supports file system semantics over blob storage is welcome for organizing and accessing such data. The data management and analytics form the core scenarios supported by Data Lake. For multi-region deployments, it is recommended to have the data landing in one region and then replicated globally. The best practices for Data Lake involve evaluating feature support and known issues, optimizing for data ingestion, considering data structures, performing ingestion, processing and analysis from several data sources, and leveraging monitor telemetry. When the Data Lake supports query acceleration and analytics framework, it significantly improves data processing by only retrieving data that is relevant to an operation. This cascades to reduced time and processing power for the end-to-end scenarios that are necessary to gain critical insights into stored data. Both ‘filtering predicates' and ‘column projections’ are enabled, and SQL can be used to describe them. Only the data that meets these conditions are transmitted. A request processes only one file so joins, aggregates and other query operators are not supported but the request can be in any format such as csv or Json file formats. The query acceleration feature isn’t limited to Data Lake Storage. It is supported even on Blobs in storage accounts that form the persistence layer below the containers of the data lake. Even those without hierarchical namespace are supported by the Data Lake query acceleration feature. The query acceleration is part of the data lake so applications can be switched with one another, and the data selectivity and improved latency continues across the switch. Since the processing is on the side of the Data Lake, the pricing model for query acceleration differs from that of the normal transactional model. Fine grained access control lists and active directory integration round up the data security considerations.

Data lakes may serve to reduce complexity in storing data but also introduce new challenges around managing, accessing, and analyzing data. Deployments fail without properly addressing these challenges which include:

- The process of procuring, managing and visualizing data assets is not easy to govern.

- The ingestion and querying require performance and latency tuning from time to time.

- The realization of business purpose in terms of the time to value can vary often involving coding.

These are addressed by automations and best practices.

Friday, January 6, 2023

Data Modernization – continued

This article picks up the discussion on data modernization with an emphasis on the expanded opportunities to restructure the data. Legacy systems were inherently built as online transaction processing systems and online analytical processing systems and usually as a monolithic server. With the shift to microservices for application modernization, data can now be owned by individual microservices that can choose to use the technology stack and specifically the database that makes most sense for that microservice without undue influence or encumbrance from other services. The popularity of unstructured storage – both big data for batch processing and event storage for streaming applications are evident from the shift to data lakes. That said, this does not mean relational storage is not required.

Event Driven architecture consists of event producers and consumers. Event producers are those that generate a stream of events and event consumers are ones that listen for events

The scale out can be adjusted to suit the demands of the workload and the events can be responded to in real time. Producers and consumers are isolated from one another. In some extreme cases such as IoT, the events must be ingested at very high volumes. There is scope for a high degree of parallelism since the consumers are run independently and in parallel, but they are tightly coupled to the events. Network latency for message exchanges between producers and consumers is kept to a minimum. Consumers can be added as necessary without impacting existing ones.

Some of the benefits of this architecture include the following: The publishers and subscribers are decoupled. There are no point-to-point integrations. It's easy to add new consumers to the system. Consumers can respond to events immediately as they arrive. They are highly scalable and distributed. There are subsystems that have independent views of the event stream.

Some of the challenges faced with this architecture include the following: Event loss is tolerated so if there needs to be guaranteed delivery, this poses a challenge. Some IoT traffic mandate a guaranteed delivery Events are processed in exactly the order they arrive. Each consumer type typically runs in multiple instances, for resiliency and scalability. This can pose a challenge if the processing logic is not idempotent, or the events must be processed in order.

Some of the best practices demonstrated by this code. Events should be lean and mean and not bloated. Services should share only IDs and/or a timestamp. Large data transfer between services in this case is an antipattern. Loosely coupled event driven systems are best.

The Big Compute architectural style refers to the requirements for many cores to handle the compute for the business such as for image rendering, fluid dynamics, financial risk modeling, oil exploration, drug design and engineering stress analysis. The scale out of the computational tasks is achieved by their discrete, isolated, and finite nature where some input is taken in raw form and processed into an output. The scale out can be adjusted to suit the demands of the workload and the outputs can be conflated as is customary with map-reduce problems. Since the tasks are run independently and in parallel, they are tightly coupled. Network latency for message exchanges between tasks is kept to a minimum. The commodity VMs used from the infrastructure is usually the higher end of the compute in that tier. Simulations and number crunching such as for astronomical calculations involve hundreds if not thousands of such compute.

Some of the benefits of this architecture include the following: 1) high performance due to the parallelization of tasks. 2) ability to scale out to arbitrarily large number of cores, 3) ability to utilize a wide variety of compute units and 4) dynamic allocation and deallocation of compute.

Some of the challenges faced with this architecture include the following: Managing the VM architecture, the volume of number crunching, the provisioning of thousands of cores on time and getting diminishing returns from additional cores.

Some of the best practices demonstrated by this code include It exposes a well-designed API to the client. It can auto scale to handle changes in the load. It caches semi-static data. It uses a CDN to host static content. It uses a polyglot persistence when appropriate. It partitions data to improve scalability, it reduces contention, and optimizes performance.

Thursday, January 5, 2023

This is a continuation of the Walkie Talkie application discussion from a previous post.

A Walkie Talkie application allows us to listen to all the activity on our chosen channel, then hit the big Speak button when it’s our turn to talk. Most applications require WiFi or Mobile network connections but this one doesn’t because it leverages Bluetooth stack.

The point-to-point connectivity is made private by the pairing of Bluetooth devices. APIs are available from most mobile platforms that include the following:

- Scan for Bluetooth devices

- Query the local Bluetooth adapter for paired Bluetooth devices

- Establish RFCOMM channels

- Connect to the other devices through service discovery.

- Transfer data to and from other devices.

- Manage multiple connections.

For Bluetooth-enabled devices to transmit data between each other, they must first form a channel of communication using a pairing process. One device makes itself available for incoming connection requests. Another device finds the discoverable device using a service discovery process. After the discoverable device accepts the pairing request, the two devices complete the bonding process in which they exchange security keys. When the session is complete, the device that initiated the pairing request releases the channel that had linked it to the discoverable device. An application can make use of the APIs by declaring several permissions in the manifest file. Once the application has permission to the Bluetooth adapter, it can call the APIs for three steps to make a connection. These are:

1. Find nearby Bluetooth devices

2. Connect to a Bluetooth device

3. Transfer data with the connected device

This application itself will have to provide user controls, navigations and experience that are typical for any application on a mobile application. The navigation to the homepage, the display of the talk button, the refreshes to the page from both from user navigations to it from external applications or from internal navigations within the application.

Finally, the application must demonstrate that it handles all the lifecycle and display events that are associated with the application. If these handlers are correctly written, the user experience in viewing the report will be smooth and satisfying.

Wednesday, January 4, 2023

Introduction:

Walkie-Talkie applications have demonstrated significant usage during calamities and remote or rural developments. They became notably popular after hurricanes wreaked havoc on the east coast of the United States. They have features like Push-to-talk that make them ideal for point-to-point communication. While the messenger applications, including WhatsApp work best when there is uninterrupted and high bandwidth connectivity, these applications perform better even in poor network connectivity. Applications like Zello have reached an audience of over 150 million users worldwide. Users especially the elderly have found it easy to use and like the radio calls used to request help and rescue in mission critical operations.

Many walkie-talkie applications have saturated the market when there is a mobile network or Wi-Fi connectivity and can reach the end user based on some lookup of the end-user in a centralized directory. However, they hold immense promise for two-party direct communications even in the absence of a 2G or Wi-Fi network if the radio communications can be extended over device-to-device networks such as Bluetooth.

The appeal of radio like communication over Bluetooth is that it can work on cell phones and independent of geography or network. Establishing network connectivity over two phones with such an application installed requires only the pairing of the Bluetooth devices. Since there is no external network or line of sight involved, the utility expands to climbers on a mountain or offshore activities with no loss of fidelity and extends to a wide variety of handheld smartphones.

Less than 1% of the walkie-talkie applications have made this offering simply because they rely on an IP connectivity and a third-party IP provider. The Bluetooth stack-based communication eliminates the need for a third party. Since it requires a separate stack from those for the Wi-Fi based ones, these applications have yet to tap into these expanded possibilities.

When this is enabled, it can help even virtualize the communication regardless of whether it occurs between the devices or between the networks. This becomes a driver for seamless communication which was not possible until now. In such a case, the users can look forward to making this communication during both outages and normal operations.

The Push-To-Talk feature and the ability to send Morse codes have already been perfected by the existing applications so that the user interface remains the big button to send the voice transmission. Even the Bluetooth pairing is well established by both the Android and IPhone operating systems and all that is required from these applications, is the ability for the user interface to dedicate a communication between the two points.

The Bluetooth drivers for the application enables the pairing to be completed and to allow the transmission for the duration of the send. They provide a convenience over the existing pairing by being stateful when necessary. The endpoints can be private ip addresses to enable the rest of the communication platform and mode of operation to be independent of the connectivity involved.

Tuesday, January 3, 2023

Data Lakes are popular for storing and handling Big Data and IoT events. It is not a massive virtual data warehouse, but it powers a lot of analytics and is the centerpiece of most solutions that conform to the Big Data architectural style. A data lake must store petabytes of data while handling bandwidths up to Gigabytes of data transfer per second. The hierarchical namespace of the object storage helps organize objects and files into a deep hierarchy of folders for efficient data access. The naming convention recognizes these folder paths by including the folder separator character in the name itself. With this organization and folder access directly to the object store, the performance of the overall usage of data lake is improved. A mere shim over the Data Lake Storage interface that supports file system semantics over blob storage is welcome for organizing and accessing such data. The data management and analytics form the core scenarios supported by Data Lake. For multi-region deployments, it is recommended to have the data landing in one region and then replicated globally. The best practices for Data Lake involve evaluating feature support and known issues, optimizing for data ingestion, considering data structures, performing ingestion, processing and analysis from several data sources and leveraging monitor telemetry. When the Data Lake supports query acceleration and analytics framework, it significantly improves data processing by only retrieving data that is relevant to an operation. This cascades to reduced time and processing power for the end-to-end scenarios that are necessary to gain critical insights into stored data. Both ‘filtering predicates' and ‘column projections’ are enabled, and SQL can be used to describe them. Only the data that meets these conditions are transmitted. A request processes only one file so joins, aggregates and other query operators are not supported but the request can be in any format such as csv or json file formats. The query acceleration feature isn’t limited to Data Lake Storage. It is supported even on Blobs in storage accounts that form the persistence layer below the containers of the data lake. Even those without hierarchical namespace are supported by the Data Lake query acceleration feature. The query acceleration is part of the data lake so applications can be switched with one another, and the data selectivity and improved latency continues across the switch. Since the processing is on the side of the Data Lake, the pricing model for query acceleration differs from that of the normal transactional model. Fine grained access control lists and active directory integration round up the data security considerations

A checklist helps with migrating sensitive data to the cloud and provides benefits to overcome the common pitfalls regardless of the source of the data. It serves merely as a blueprint for a smooth secure transition.

Characterizing permitted use is the first step for data teams need to take to address data protection for reporting. Modern privacy laws specify not only what constitutes sensitive data but also how the data can be used. Data obfuscation and redacting can help with protecting against exposure. In addition, data teams must classify the usages and the consumers. Once sensitive data is classified, and purpose-based usage scenarios are addressed, role-based access control must be defined to protect future growth.

Devising a strategy for governance is the next step; this is meant to prevent intruders and is meant to boost data protection by means of encryption and database management. Fine grained access control such as attribute or purpose-based ones also help in this regard.

Embracing a standard for defining data access policies can help to limit the explosion of mappings between users and the permissions for data access; this gains significance when a monolithic data management environment is migrated to the cloud. Failure to establish a standard for defining data access policies can lead to unauthorized data exposure.

When migrating to the cloud in a single stage with all at once data migration must be avoided as it is operationally risky. It is critical to develop a plan for incremental migration that facilitates development testing and deployment of a data protection framework which can be applied to ensure proper governance. Decoupling data protection and security policies from the underlying platform allows organizations to tolerate subsequent migrations.

There are different types of sanitizations such as redaction, masking, obfuscation, encryption tokenization and format preserving encryption. Among these static protection in which clear text values are sanitized and stored in their modified form and dynamic protection in which clear text data is transformed into a ciphertext are most used.

Finally defining and implementing data protection policies brings several additional processes such as validation, monitoring, logging, reporting, and auditing. Having the right tools and processes in place when migrating sensitive data to the cloud will allay concerns about compliance and provide proof that can be submitted to oversight agencies.

Compliance goes beyond applying rules and becomes a process to verify that laws are observed. The right tools and processes can allay concerns about compliance.

Monday, January 2, 2023

Exposing a web application as serverless:

The right tool for a job makes it easy. Serverless-http is a library registered with the NPM library manager that can be used to configure applications for hosting on serverless infrastructure such as AWS Lambda.

This makes it easy to wrap existing apis and applications and export it with an entry point for a serverless event listener.

Sample of using this library is as simple as follows:

import { Router } from '../router'

import serverless from 'serverless-http'

export const run = serverless(Router)

where the Router is an object that defines all the routes and can be instantiated from a server with a runtime such as Koa as shown above or Express as shown below.

const serverless = require('serverless-http');

const express = require('express');

const app = express();

app.use(/* register your middleware as normal */);

const handler = serverless(app, { provider: 'azure' });

module.exports.funcName = async (context, req) => {

context.res = await handler(context, req);

}

Sunday, January 1, 2023

Data Modernization – continued

Data technologies in recent years have popularized both structured and unstructured storage. This is fueled by applications that are embracing cloud resources. The two trends are happening simultaneously and are reinforcing each other.

Data modernization means moving data from legacy databases to modern databases. It comes at a time when many databases are doubling their digital footprint. Unstructured data is the biggest contributor to this growth and includes images, audio, video, social media comments, clinical notes, and such others. Organizations have shifted from a data architecture based on relational enterprise-based data warehouses to data lakes based on big data. If the survey from IT spends is to be believed, a great majority of organizations are already on their way towards data modernization with those in the Finance service firms leading the way. These organizations reported data security planning as part of their data modernization activities. They consider the tools and technology that are available in the marketplace as the third most important reason in their decision making.

Drivers for one-time data modernization plans include security and governance, strategy and plan, tools and technology, and talent. Data modernization is a key component of, or reason for, migrating to the cloud. The rate of adoption of external services in the data planning and implementation is about 44% for these organizations.