Cluster computing

Saturday, January 14, 2023

#codingexercise

Merge two sorted lists:

List<int> merge(List<int> A, List<int> B)

{

List<int> result = new ArrayList<int>();

While (A != null && B != null && A.size() > 0 && B.size() > 0)

{

int min = A.get(0) <= B.get(0) ? A.remove(0) : B.remove(0);

result.add(min);

}

While (A != null && A.size() > 0)

{

result.add(A.remove(0));

}

While(B != null && B.size() > 0)

{

result.add(B.remove(0));

}

return result;

}

Test cases:

Null, Null -> []

Null, [] -> []

[], Null -> []

[],[] -> []

[1],[] -> [1]

[], [1] -> [1]

[1],[1] -> [1, 1]

[1],[2] -> [1, 2]

[2],[1] -> [1,2]

[1],[2,3] -> [1,2,3]

[1,2],[3] -> [1,2,3]

[1,2,3],[] -> [1,2,3]

[1,3][2] -> [1,2,3]

[],[1,2,3] -> [1,2,3]

Friday, January 13, 2023

A few more considerations for using S3 over document store for basic storage operations.

The previous article introduced cost as the driving factor for leveraging simple storage aka S3. The document store has many features but is priced based on read and write capacity units. All those features may not be necessary for mere create, update and delete of an object. This results in significant savings even on low-end applications that typically have a monthly charge as follows:

API Gateway 0.04 USD

Cognito 10.00 USD

DynamoDB 75.02 USD

S3 2.07 USD

Lambda 0.00 USD

Web Application Firewall 8.00 USD

It is in this context that we strive to use S3 APIs for ordinary persistence.

The sample code below illustrates the use of Javascript SDK for making these operations:

const REGION = "us-west-2";

const s3 = new S3Client({

region: REGION,

credentials: fromCognitoIdentityPool({

client: new CognitoIdentityClient({ region: REGION }),

identityPoolId: "us-west-2:de827e1d-f9b6-4402-bd0e-c7bdce52d8c8",

}),

});

const docsBucketName = "mybucket";

export const getAllDocuments = async () => {

if (!client) {

await createAPIClient();

}

try {

const data = await s3.send(

new ListObjectsCommand({ Delimiter: "/", Bucket: docsBucketName })

);

console.log(JSON.stringify(data, null, 4));

var results = [];

if (typeof data != "undefined" && data.hasOwnProperty("Contents")) {

results = data.Contents.map(function(item,index) {

var identifier = item.Key + item.LastModified + item.Owner.ID;

return {

'FileSize' : item.Size,

'Name' : item.Key,

'Owner' : item.Owner.ID,

'DateUploaded' : item.LastModified,

'FileName' : item.Key,

'SK' : 'Doc#BVNA',

'PK' : identifier.hashCode().toString(),

'Thumbnail' : '/images/LoremIpsum.jpg'};

});

}

return results;

} catch (err) {

console.log("Error", err);

return [];

}

};

Unlike the document store that returns a unique identifier for every item stored, here we must make our own identifier. The file contents and the file attributes together can help make this identifier if we leverage basic cryptology functions such as md5. Also, unlike the document store there is no index. Tags and metadata are available for querying purposes and it is possible adjust just the tags for state management but it is even better to populate the operations on an uploaded object in a dedicated metadata object in the database.

Then, it is possible to query just the contents of that specific object with:

const S3 = require(‘aws-sdk/clients/s3’);

s3.selectObjectContent(params, (err, data) => {

if (err) {

// handle error

Return

}

const eventStream = data.Payload;

eventStream.on(‘data’, (event) => {

if (event.Records) {

// event.Records.Payload is a buffer containing

// a single record, partial records, or multiple records

process.stdout.write(event.Records.Payload.toString());

} else if (event.Stats) {

console.log(`Processed ${event.Stats.Details.BytesProcessed} bytes`);

} else if (event.End) {

console.log('SelectObjectContent completed');

}

});

// Handle errors encountered during the API call

eventStream.on('error', (err) => {

switch (err.name) {

// Check against specific error codes that need custom handling

}

});

eventStream.on('end', () => {

// Finished receiving events from S3

});

This mechanism is sufficient for low overhead persistence of objects in the cloud.

Thursday, January 12, 2023

A motivation to use S3 over document store:

Cost is one of the main drivers for the choice of cloud technologies. Unfortunately, programmability and functionality are developer’s motivations. For example, a document store like dynamo db is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. It might be the convenience choice for schema-less storage, a table representation and for its frequent usage with an in-memory cache for low latency. But the operations taken on the resource stored in the table must be plain and simple create, update, get and delete of the resource. On the other hand, in terms of storage of such objects, a web accessible store like S3, is sufficient.

When we calculate the cost of a small sized application, the monthly charges might appear something like this:

API Gateway 0.04 USD

Cognito 10.00 USD

DynamoDB 75.02 USD

S3 2.07 USD

Lambda 0.00 USD

Web Application Firewall 8.00 USD

In this case, the justification to use S3 is clear from the cost savings for the said low-overhead resources for whom only cloud persistence is necessary.

It is in this context, that application modernization has the potential to driven costs by moving certain persistence to S3 instead of DynamoDB. The only consideration is the inevitability to use a new and improved feature on S3 called Amazon S3 Select to realize these cost savings. The bookkeeping operations on the other objects can be achieved by querying a ledger object that makes progressive updates without deleting earlier entries.

Using Amazon S3 Select, we can query for a subset of data from an S3 object by using Simple SQL expressions. The selectObjectContent API in the AWS SDK for JavaScript is used for this purpose.

Let us use a CSV file named target-file.csv as the key, that’s uploaded to an S3 object in the bucket named my-bucket in the us-west-2 region. This csv contains entries with username, age attributes. If we were to select users with an age greater than 20, the SQL query would appear as

SELECT username FROM S3Object WHERE cast(age as int) > 20

With Javascript SDK, we write this as:

const S3 = require(‘aws-sdk/clients/s3’);

s3.selectObjectContent(params, (err, data) => {

if (err) {

// handle error

Return

}

const eventStream = data.Payload;

eventStream.on(‘data’, (event) => {

if (event.Records) {

// event.Records.Payload is a buffer containing

// a single record, partial records, or multiple records

process.stdout.write(event.Records.Payload.toString());

} else if (event.Stats) {

console.log(`Processed ${event.Stats.Details.BytesProcessed} bytes`);

} else if (event.End) {

console.log('SelectObjectContent completed');

}

});

// Handle errors encountered during the API call

eventStream.on('error', (err) => {

switch (err.name) {

// Check against specific error codes that need custom handling

}

});

eventStream.on('end', () => {

// Finished receiving events from S3

});

Wednesday, January 11, 2023

The infrastructural challenges of working with data modernization tools and products has often mandated a simplicity in the overall deployment. Consider an application such as a Streaming Data Platform and its deployment on-premises includes several components for the ingestion store and the analytics computing platform as well as the metrics and management dashboards that are often independently sourced and require a great deal of tuning. The same applies to performance improvements in data lakes and event driven frameworks although by design they are elastic, pay per-use and scalable.

The solution integration for data modernization often deals with such challenges across heterogeneous products. Solutions often demand more simplicity and functionality from the product. There are also quite a few parallels to be drawn between solution integration with the cloud services and the product development of data platforms and products. With such technical similarities, the barrier for product development of data products is lowered and simultaneously the business needs to make it easier for the consumer to plug in the product for their data handling, driving the product upwards into the solution space, often referred to as the platform space.

Event based data is by nature unstructured data. Data Lakes are popular for storing and handling such data. It is not a massive virtual data warehouse, but it powers a lot of analytics and is the centerpiece of most solutions that conform to the Big Data architectural style. A data lake must store petabytes of data while handling bandwidths up to Gigabytes of data transfer per second.

Data lakes may serve to reduce complexity in storing data but also introduce new challenges around managing, accessing, and analyzing data. Deployments fail without properly addressing these challenges which include:

The process of procuring, managing and visualizing data assets is not easy to govern.
The ingestion and querying require performance and latency tuning from time to time.
The realization of business purpose in terms of the time to value can vary often involving coding.

These are addressed by automations and best practices.

Tuesday, January 10, 2023

Content Reselling Platform:

Introduction:

This is a proposal for a content provider to help with a multi-outlet reselling. Content providers include those experts who write books, publish videos, and provide other forms of digital content. Earlier the media platform of choice have been Youtube for videos and Instagram for images but the limitations for these providers have included losses in copyright royalties, declining brand value, reliance on searches from users and unpredictable search rankings, and above all a commoditization of content. While those content platforms have ubiquitous reachability, they do not provide content resellers the ability to customize, repackage and resell the content via dedicated or boutique channels. End users who must rely on http access to the content also suffer from privacy and website monitoring from networking providers. On the other hand, consider mobile applications like “The Shift”, “Tripp” etc. that are targeted for specific publishers, resellers and their clientele who wish not only to own redistribution of the content but also the channel and the source integration of these content via customizable but templatized applications, web services and end-user interfaces. These applications serve to provide content at the convenience of wearable computing and with the increased efficiency of time to market, low code or no code solutions over a consistent, governed, and managed multi-tenant platform and a revenue collection framework that is the envy of seller platforms and shopping framework

Monday, January 9, 2023

IDP Integration with cloud

Yesterday, we discussed about integrating an IDP with a membership directory. This article talks about integrating the IDP to the cloud.

The IDP integration with a cloud user pool is called identity federation. This is achieved by exchanging an artifact between the IDP and the cloud user pool using the Security Assertion Markup Language (SAML) which is a login standard that helps users access applications based on sessions in another context.

Just like the article on IDP integration with a membership directory, the first step involves the creation of a domain. When specifying a domain name, it must be checked that it is available.

The IDP will need to have a new application added such as the SAML Test Connector with one that is preferably responding with a signed response. The consumer url validator as well as the consumer url must be provided.

Parameters will need to be added to the configuration. The default parameter is the SAML Name Id but a custom parameter will need to be added to the configuration which is the email address. This parameter will be included in the SAML assertion, and it will be easier for applications to locate and pass it through.

The SAML once setup correctly will be available via the issuer URL. This is required to be specified as the metadata document URL in the cloud setup and establishes a reference to the SAML based identity provider aka IDP.

A set of attribute mappings like those done between the IDP and the membership provider needs to be completed. In this case, the e-mail attribute would be mapped.

Finally, an application client will be added. Generally, application clients will generate client secret but nodeJs package manager aka npm clients will generate it instead so the checkbox must be left unchecked for application clients that use npm. The secure remote password protocol-based authentication must be configured as the authentication flow. An authentication flow that permits tokens to be refreshed will also need to be refreshed.

The application client settings will be specific to the enabled identity providers but generally, they include a set of URLs for sign-in and sign-out. If the authorization protocol is OAuth 2.0, the subset of flows such as implicit grant, authentication code grant, and client credentials will need to be selectively enabled. The scopes for these flows will include email and openid.

A successful integration can be tested with a validator that checks for the JWT token, access token, identity token along with the contact url.

Sunday, January 8, 2023

How to Integrate an IDP with a membership directory?

Membership directories come in all forms and sizes, such as Active Directory, Google Workspace, LDAP based directories, Workday and Human Resources applications. If we take the example of Google Workspace, they provide APIs as well as management console that one can use to configure a directory for integration with an IDP.

The process of configuration usually begins with a domain name such as sampledomain.info or sampledomain.net and these domain registrations are sometimes offered through the management console for a price of about twenty dollars or so. The registration process is automatic but does not happen instantaneously. It remains pending until an external authority registers it and the duration can take anywhere from one day to a week.

The next step in the process is the configuration of the membership directory is the provisioning of an administrator user. This person will now have an email address with the newly created domain name. With this email-based credential, this person can start adding other users and set the maximum number of members in the directory. Once the directory is created, it will be available programmatically as well.

When the membership directory is ready, the process for integration can begin. This step requires to go over to the IDP and create an application which is of the type that the membership directory belongs to. Some membership directories are well suited to integrate with a specific IDP and make the automation extremely easy to trigger, follow through and complete. All membership directories supported by an IDP would begin with the ask for the domain name associated with the membership directory.

An authentication step is required to enable the programmatic access to the membership directory via the consent page when the automation is triggered. This usually requires the same credentials as the administrator of the membership directory.

Once this configuration is initiated, the next step is to enable provisioning of users. This step is important because the IDP and the membership directory must be in sync. A person registering with the IDP and indicating the membership directory via the domain in his email address must be allowed to create a member in the membership directory. Usually, this is done by assigning a role to the user by the IDP that is authorized by the membership directory to map to a member in the directory. If the member is not found, the role creates a new member for this purpose.

Enabling the automatic provisioning helps during the rollout and keeps the IDP and membership directory in sync by creating, deleting and editing the record corresponding to the member. Other configuration parameters must also be chosen at this time. These can include additional information about the potential member as well as groups that they must be part of. The specification of the groups is also associated with rules that determine the default groups a user must be part of. These groups are also helpful to be associated with roles.

Finally, the configuration on the IDP requires validation and testing by means of built-in checks as well as by exercising the creation of a new member. This new member must have all the attributes set by the IDP and this can be verified from the console.