Cluster computing

Saturday, January 14, 2023

#codingexercise

Merge two sorted lists:

List<int> merge(List<int> A, List<int> B)

{

List<int> result = new ArrayList<int>();

While (A != null && B != null && A.size() > 0 && B.size() > 0)

{

int min = A.get(0) <= B.get(0) ? A.remove(0) : B.remove(0);

result.add(min);

}

While (A != null && A.size() > 0)

{

result.add(A.remove(0));

}

While(B != null && B.size() > 0)

{

result.add(B.remove(0));

}

return result;

}

Test cases:

Null, Null -> []

Null, [] -> []

[], Null -> []

[],[] -> []

[1],[] -> [1]

[], [1] -> [1]

[1],[1] -> [1, 1]

[1],[2] -> [1, 2]

[2],[1] -> [1,2]

[1],[2,3] -> [1,2,3]

[1,2],[3] -> [1,2,3]

[1,2,3],[] -> [1,2,3]

[1,3][2] -> [1,2,3]

[],[1,2,3] -> [1,2,3]

Friday, January 13, 2023

A few more considerations for using S3 over document store for basic storage operations.

The previous article introduced cost as the driving factor for leveraging simple storage aka S3. The document store has many features but is priced based on read and write capacity units. All those features may not be necessary for mere create, update and delete of an object. This results in significant savings even on low-end applications that typically have a monthly charge as follows:

API Gateway 0.04 USD

Cognito 10.00 USD

DynamoDB 75.02 USD

S3 2.07 USD

Lambda 0.00 USD

Web Application Firewall 8.00 USD

It is in this context that we strive to use S3 APIs for ordinary persistence.

The sample code below illustrates the use of Javascript SDK for making these operations:

const REGION = "us-west-2";

const s3 = new S3Client({

region: REGION,

credentials: fromCognitoIdentityPool({

client: new CognitoIdentityClient({ region: REGION }),

identityPoolId: "us-west-2:de827e1d-f9b6-4402-bd0e-c7bdce52d8c8",

}),

});

const docsBucketName = "mybucket";

export const getAllDocuments = async () => {

if (!client) {

await createAPIClient();

}

try {

const data = await s3.send(

new ListObjectsCommand({ Delimiter: "/", Bucket: docsBucketName })

);

console.log(JSON.stringify(data, null, 4));

var results = [];

if (typeof data != "undefined" && data.hasOwnProperty("Contents")) {

results = data.Contents.map(function(item,index) {

var identifier = item.Key + item.LastModified + item.Owner.ID;

return {

'FileSize' : item.Size,

'Name' : item.Key,

'Owner' : item.Owner.ID,

'DateUploaded' : item.LastModified,

'FileName' : item.Key,

'SK' : 'Doc#BVNA',

'PK' : identifier.hashCode().toString(),

'Thumbnail' : '/images/LoremIpsum.jpg'};

});

}

return results;

} catch (err) {

console.log("Error", err);

return [];

}

};

Unlike the document store that returns a unique identifier for every item stored, here we must make our own identifier. The file contents and the file attributes together can help make this identifier if we leverage basic cryptology functions such as md5. Also, unlike the document store there is no index. Tags and metadata are available for querying purposes and it is possible adjust just the tags for state management but it is even better to populate the operations on an uploaded object in a dedicated metadata object in the database.

Then, it is possible to query just the contents of that specific object with:

const S3 = require(‘aws-sdk/clients/s3’);

s3.selectObjectContent(params, (err, data) => {

if (err) {

// handle error

Return

}

const eventStream = data.Payload;

eventStream.on(‘data’, (event) => {

if (event.Records) {

// event.Records.Payload is a buffer containing

// a single record, partial records, or multiple records

process.stdout.write(event.Records.Payload.toString());

} else if (event.Stats) {

console.log(`Processed ${event.Stats.Details.BytesProcessed} bytes`);

} else if (event.End) {

console.log('SelectObjectContent completed');

}

});

// Handle errors encountered during the API call

eventStream.on('error', (err) => {

switch (err.name) {

// Check against specific error codes that need custom handling

}

});

eventStream.on('end', () => {

// Finished receiving events from S3

});

This mechanism is sufficient for low overhead persistence of objects in the cloud.

Thursday, January 12, 2023

A motivation to use S3 over document store:

Cost is one of the main drivers for the choice of cloud technologies. Unfortunately, programmability and functionality are developer’s motivations. For example, a document store like dynamo db is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. It might be the convenience choice for schema-less storage, a table representation and for its frequent usage with an in-memory cache for low latency. But the operations taken on the resource stored in the table must be plain and simple create, update, get and delete of the resource. On the other hand, in terms of storage of such objects, a web accessible store like S3, is sufficient.

When we calculate the cost of a small sized application, the monthly charges might appear something like this:

API Gateway 0.04 USD

Cognito 10.00 USD

DynamoDB 75.02 USD

S3 2.07 USD

Lambda 0.00 USD

Web Application Firewall 8.00 USD

In this case, the justification to use S3 is clear from the cost savings for the said low-overhead resources for whom only cloud persistence is necessary.

It is in this context, that application modernization has the potential to driven costs by moving certain persistence to S3 instead of DynamoDB. The only consideration is the inevitability to use a new and improved feature on S3 called Amazon S3 Select to realize these cost savings. The bookkeeping operations on the other objects can be achieved by querying a ledger object that makes progressive updates without deleting earlier entries.

Using Amazon S3 Select, we can query for a subset of data from an S3 object by using Simple SQL expressions. The selectObjectContent API in the AWS SDK for JavaScript is used for this purpose.

Let us use a CSV file named target-file.csv as the key, that’s uploaded to an S3 object in the bucket named my-bucket in the us-west-2 region. This csv contains entries with username, age attributes. If we were to select users with an age greater than 20, the SQL query would appear as

SELECT username FROM S3Object WHERE cast(age as int) > 20

With Javascript SDK, we write this as:

const S3 = require(‘aws-sdk/clients/s3’);

s3.selectObjectContent(params, (err, data) => {

if (err) {

// handle error

Return

}

const eventStream = data.Payload;

eventStream.on(‘data’, (event) => {

if (event.Records) {

// event.Records.Payload is a buffer containing

// a single record, partial records, or multiple records

process.stdout.write(event.Records.Payload.toString());

} else if (event.Stats) {

console.log(`Processed ${event.Stats.Details.BytesProcessed} bytes`);

} else if (event.End) {

console.log('SelectObjectContent completed');

}

});

// Handle errors encountered during the API call

eventStream.on('error', (err) => {

switch (err.name) {

// Check against specific error codes that need custom handling

}

});

eventStream.on('end', () => {

// Finished receiving events from S3

});

Wednesday, January 11, 2023

The infrastructural challenges of working with data modernization tools and products has often mandated a simplicity in the overall deployment. Consider an application such as a Streaming Data Platform and its deployment on-premises includes several components for the ingestion store and the analytics computing platform as well as the metrics and management dashboards that are often independently sourced and require a great deal of tuning. The same applies to performance improvements in data lakes and event driven frameworks although by design they are elastic, pay per-use and scalable.

The solution integration for data modernization often deals with such challenges across heterogeneous products. Solutions often demand more simplicity and functionality from the product. There are also quite a few parallels to be drawn between solution integration with the cloud services and the product development of data platforms and products. With such technical similarities, the barrier for product development of data products is lowered and simultaneously the business needs to make it easier for the consumer to plug in the product for their data handling, driving the product upwards into the solution space, often referred to as the platform space.

Event based data is by nature unstructured data. Data Lakes are popular for storing and handling such data. It is not a massive virtual data warehouse, but it powers a lot of analytics and is the centerpiece of most solutions that conform to the Big Data architectural style. A data lake must store petabytes of data while handling bandwidths up to Gigabytes of data transfer per second.

Data lakes may serve to reduce complexity in storing data but also introduce new challenges around managing, accessing, and analyzing data. Deployments fail without properly addressing these challenges which include:

The process of procuring, managing and visualizing data assets is not easy to govern.
The ingestion and querying require performance and latency tuning from time to time.
The realization of business purpose in terms of the time to value can vary often involving coding.

These are addressed by automations and best practices.