Cluster computing

Monday, March 25, 2019

Today we continue discussing the best practice from storage engineering:

620) When a new key is added, it may not impact existing keys but it does affect the overall space consumption of the listing depending on the size and number.

621) The keys can have as many fields as necessary. However, the lookups are faster when there are only a few keys to compare.

622) Key comparison can be partial or full. Partial keys are useful to match duplicates. The number of keys that share the same subkeys can be many. This form of comparison is very helpful to group entries.

623) Grouping of entries also help with entries that span groups based on sub keys. These work across groups

624) The number of entries may run to a large order but the prefix could be more inclusive of subkeys to narrow the search. This makes it efficient to run on these listings.

625) The number of entries also don’t matter to the number of keys in each entry as long as the prefix is using a small set of subkeys.

626) Listings are great to use when they are in a single location. However, they are often scoped to a parent container. If the parent containers are distributed, the listings tend to be multiple. In such cases the effort is repeated.

#codingexercise
Find paths in a matrix
int GetPaths(int x, int y)
{
if (x <= 0 || y <= 0)
return 1;

return GetPaths(x - 1, y) +
GetPaths(x - 1, y - 1) +
GetPaths (x, y - 1); // for the three possible directions
}

Saturday, March 23, 2019

Today we continue discussing the best practice from storage engineering:

611) Background tasks may sometimes need to catch up with the current activities. In order to accommodate the delay, they may either be run upfront so that changes to be processed are incremental or they can increase in number to divide up the work.

612) The results from the background tasks mentioned above might also take a long time to accumulate. They can be made available as they appear or batched.

613) The load balancer works very well to enable background tasks to catch up by not overloading a single task and distributing the online activities to ensure that the background task has light load

614) The number of background tasks or their type should not affect online activities. However, systems have known to be impacted when the tasks are consuming memory or delay garbage collection

615) There is no specific mitigation for one or more background tasks that takes plenty of shared resources but generally they are written to be fault tolerant so that they can pick up from where they left off.

Friday, March 22, 2019

Today we continue discussing the best practice from storage engineering :

606) We use data structures to keep the information we want to access in a convenient form. When this is persisted, it mitigates faults in the processing. However each such artifact brings in additional chores and maintenance. On the other hand, it is cheaper to execute the logic and the logic can be versioned. Therefore when there is a trade-off between compute and storage for numerous small and cheap artifacts, it is better to generate them dynamically

607) The above has far reaching impact when there are a number of layers involved and ac cost incurred in the lower layer bubbles up to the top layer.

608) Compute tends to be distributed in nature while storage tends to be local. They can be mutually exclusive in this regard.

609) Compute oriented processing can scale up or out while storage has to scale out.

610) Compute oriented processing can get priority but storage tends to remain in a class

611) Background tasks may sometimes need to catch up with the current activities. In order to accommodate the delay, they may either be run upfront so that changes to be processed are incremental or they can increase in number to divide up the work.

612) The results from the background tasks mentioned above might also take a long time to accumulate. They can be made available as they appear or batched.

613) The load balancer works very well to enable background tasks to catch up by not overloading a single task and distributing the online activities to ensure that the background task has light load

Thursday, March 21, 2019

Today we continue discussing the best practice from storage engineering

600) As with any product, a storage product also qualifies for the Specific-measureable-attainable-realistic-timely aka SMART process where improvements can be measured and the feedback used to improve the process and the product.

601) As with each and every process, there is some onus but the rewards generally outweigh the costs when it is reasoned and accepted by all. The six sigma process for example sets a high bar for quality because it eliminates errors progressively.

602) The iterations for six sigma were high so it takes longer and the results are not always available in the interim. The agile development processes allowed results to be incremental.

603) The agile methodology improved the iterations over the features in such a way that it did not impact the rest of the product. This enables faster feature development

604) The continuous integration and continuous deployment model made the individual feature improvements available for use because the changes were build, tested and deployed in lock step with development.

605) Together with the process to make improvements one change at a time and to have it build tested and deployed gives great benefit to the overall product.

606) We use data structures to keep the information we want to access in a convenient form. When this is persisted, it mitigates faults in the processing. However each such artifact brings in additional chores and maintenance. On the other hand, it is cheaper to execute the logic and the logic can be versioned. Therefore when there is a trade-off between compute and storage fo

Wednesday, March 20, 2019

We were discussing the S3 API:

Virtually all storage providers in cloud and on-premise storage solutions support the S3 Application Programming Interface. With the help of this APIs, applications can send and receive data without having to worry about the storage best practice. They are also able to switch from one storage appliance to another, one on-premise cluster to another, one cloud provider to another and so on. The API was introduced by Amazon but has become the industry standard and accepted by many storage providers. Even competitors like Azure provide an S3 Proxy to allow applications to access their storage with this API.

S3 like any other cloud-based service has been developed with the Representation State Transfer (REST) best practice. However, it does not involve all the provisions of HTTP 2.0 (released) or 3.0 (in-progress). Neither does it provide a protocol like abstraction where layers can be added above or below in a storage stack. A networking stack on the other hand has dedicated protocols for each of its layers. A storage stack may comprise of say, at the very least, an active workload versus a backup workload layer where the active remains as the higher layer and can support say HTTP and the backup remains as the lower layer and supports S3. Perhaps this distinction has been somewhat obfuscated where object storage can expand its capabilities to both layer.

The S3 API makes no endearment for developers on how object storage can be positioned as an object queue, an object cache, an object query, a gateway, a log index store, and many other such capabilities. API best practice enables automated monitoring, logging, auditing, and many more.

If a new storage class is added and the functionalities not at par with the regular S3 storage, then they would have an entirely new set of APIs and these would preferably have a prefix to differentiate the APIs from the rest.

If the storage stack is layered from the active regular s3 storage on the top to the less frequently used storage classes at a lower level than the regular and finally the glacier or least used data as the last layer, then aging of data alone is sufficient to migrate from the top layer all the way to the bottom without any involvement of API. That said, the API could provide visibility to the users on the contents of each storage class along with the additional functionality of direct placement of objects in those classes or their eviction. Since the nature of the storage class differentiates the api set and we decided to use prefix based api naming conventions to indicate the differentiation, each storage class adds a new set to the existing APIs. On the other hand, policies common to all three storage classes or the functionality that stripes across layers will be provided either with request attributes targeting that layer and its lower layers or with the help of parameters.

Functionalities such as deduplication, rsync for incremental backups, compression and management will require new APIs and these do not have to be limited to objects in any one storage class. APIs that automate the workflow of calling more than one APIs can also be written as a coarse granularity API. These wrapped APIs can collate functionalities for a single layer or across layers. They can also include automation not specific to the control or data path of the storage stack. Together the new functionality and wrapped APIs can become one whole set.

S3 API can become a protocol for all storage functionalities. They can be organized as a flat list of features or by resource path qualified functionalities where the resource path may pertain to storage classes. These API could also support discoverability. And these APIs could support nuances specific to file protocols and content addressability.

https://1drv.ms/w/s!Ashlm-Nw-wnWuT-u1f7DRjBRuvD4

Tuesday, March 19, 2019

We were discussing the S3 API from previous post.

The S3 API makes no endearment for developers on how object storage can be positioned as an object queue, an object cache, an object query, a gateway, a log index store, and many other such capabilities. API best practice enables automated monitoring, logging, auditing, and many more.
This makes the S3 API more useful and mash-able than what it is today. It highlights the notion that the storage is no more an appliance, product, or cloud but a layer that simplifies and solves most application storage needs without any involvement. Perhaps then it could be called a simpler storage service.
Let us take a closer look at how the API could be structured to facilitate calls generally to the storage stack, calls individually to a storage class, or calls that bridge lateral functionalities to the same class.
The S3 command guide provides core functionalities to buckets and objects, the listing, creating, updating and deleting of objects, the setting of access control lists, the setting of bucket policy, the expiration period, or the life-cycle, and the multi-part uploads.
If a new storage class is added and the functionalities not at par with the regular S3 storage, then they would have an entirely new set of APIs and these would preferably have a prefix to differentiate the APIs from the rest.
If the storage stack is layered from the active regular s3 storage on the top to the less frequently used storage classes at a lower level than the regular and finally the glacier or least used data as the last layer, then aging of data alone is sufficient to migrate from the top layer all the way to the bottom without any involvement of API. That said, the API could provide visibility to the users on the contents of each storage class along with the additional functionality of direct placement of objects in those classes or their eviction. Since the nature of the storage class differentiates the api set and we decided to use prefix based api naming conventions to indicate the differentiation, each storage class adds a new set to the existing APIs. On the other hand, policies common to all three storage classes or the functionality that stripes across layers will be provided either with request attributes targeting that layer and its lower layers or with the help of parameters.

Monday, March 18, 2019

The Simple Storage Service (S3) API
Virtually all storage providers in cloud and on-premise storage solutions support the S3 Application Programming Interface. With the help of this APIs, applications can send and receive data without having to worry about the storage best practice. They are also able to switch from one storage appliance to another, one on-premise cluster to another, one cloud provider to another and so on. The API was introduced by Amazon but has become the industry standard and accepted by many storage providers. Even competitors like Azure provide an S3 Proxy to allow applications to access their storage with this API.
This API has wide spread popularity for storing binary large objects also called blobs for short but that has not stopped providers from making an S3 façade over other forms of storage. They get the audience with S3 since the application changes are minimal and retrofit the storage solutions to support the most used aspect of S3.
S3 however is neither a restriction on the storage type nor true to its roots in object storage. It provides the ability to navigate the buckets and objects and adds improvements such as multi-part upload to the way data is sent into the object storage. It supports various headers to make the most of the object storage which includes sending options along with the payload as well as retrieving information on the stored products.
S3 like any other cloud-based service has been developed with the Representation State Transfer (REST) best practice. However, it does not involve all the provisions of HTTP 2.0 (released) or 3.0 (in-progress). Neither does it provide a protocol like abstraction where layers can be added above or below in a storage stack. A networking stack on the other hand has dedicated protocols for each of its layers. A storage stack may comprise of say, at the very least, an active workload versus a backup workload layer where the active remains as the higher layer and can support say HTTP and the backup remains as the lower layer and supports S3. Perhaps this distinction has been somewhat obfuscated where object storage can expand its capabilities to both layer.
Object storage can also support the notions of storage classes, versioning, retention, data aging, deduplication and compression and none of these are really feature very well by S3 without a header here or a prefix there or a tag or a path and this leads to incredible inconsistency not just between the features but also in their usages by the customers.