Cluster computing

Thursday, March 21, 2019

Today we continue discussing the best practice from storage engineering

600) As with any product, a storage product also qualifies for the Specific-measureable-attainable-realistic-timely aka SMART process where improvements can be measured and the feedback used to improve the process and the product.

601) As with each and every process, there is some onus but the rewards generally outweigh the costs when it is reasoned and accepted by all. The six sigma process for example sets a high bar for quality because it eliminates errors progressively.

602) The iterations for six sigma were high so it takes longer and the results are not always available in the interim. The agile development processes allowed results to be incremental.

603) The agile methodology improved the iterations over the features in such a way that it did not impact the rest of the product. This enables faster feature development

604) The continuous integration and continuous deployment model made the individual feature improvements available for use because the changes were build, tested and deployed in lock step with development.

605) Together with the process to make improvements one change at a time and to have it build tested and deployed gives great benefit to the overall product.

606) We use data structures to keep the information we want to access in a convenient form. When this is persisted, it mitigates faults in the processing. However each such artifact brings in additional chores and maintenance. On the other hand, it is cheaper to execute the logic and the logic can be versioned. Therefore when there is a trade-off between compute and storage fo

Wednesday, March 20, 2019

We were discussing the S3 API:

Virtually all storage providers in cloud and on-premise storage solutions support the S3 Application Programming Interface. With the help of this APIs, applications can send and receive data without having to worry about the storage best practice. They are also able to switch from one storage appliance to another, one on-premise cluster to another, one cloud provider to another and so on. The API was introduced by Amazon but has become the industry standard and accepted by many storage providers. Even competitors like Azure provide an S3 Proxy to allow applications to access their storage with this API.

S3 like any other cloud-based service has been developed with the Representation State Transfer (REST) best practice. However, it does not involve all the provisions of HTTP 2.0 (released) or 3.0 (in-progress). Neither does it provide a protocol like abstraction where layers can be added above or below in a storage stack. A networking stack on the other hand has dedicated protocols for each of its layers. A storage stack may comprise of say, at the very least, an active workload versus a backup workload layer where the active remains as the higher layer and can support say HTTP and the backup remains as the lower layer and supports S3. Perhaps this distinction has been somewhat obfuscated where object storage can expand its capabilities to both layer.

The S3 API makes no endearment for developers on how object storage can be positioned as an object queue, an object cache, an object query, a gateway, a log index store, and many other such capabilities. API best practice enables automated monitoring, logging, auditing, and many more.

If a new storage class is added and the functionalities not at par with the regular S3 storage, then they would have an entirely new set of APIs and these would preferably have a prefix to differentiate the APIs from the rest.

If the storage stack is layered from the active regular s3 storage on the top to the less frequently used storage classes at a lower level than the regular and finally the glacier or least used data as the last layer, then aging of data alone is sufficient to migrate from the top layer all the way to the bottom without any involvement of API. That said, the API could provide visibility to the users on the contents of each storage class along with the additional functionality of direct placement of objects in those classes or their eviction. Since the nature of the storage class differentiates the api set and we decided to use prefix based api naming conventions to indicate the differentiation, each storage class adds a new set to the existing APIs. On the other hand, policies common to all three storage classes or the functionality that stripes across layers will be provided either with request attributes targeting that layer and its lower layers or with the help of parameters.

Functionalities such as deduplication, rsync for incremental backups, compression and management will require new APIs and these do not have to be limited to objects in any one storage class. APIs that automate the workflow of calling more than one APIs can also be written as a coarse granularity API. These wrapped APIs can collate functionalities for a single layer or across layers. They can also include automation not specific to the control or data path of the storage stack. Together the new functionality and wrapped APIs can become one whole set.

S3 API can become a protocol for all storage functionalities. They can be organized as a flat list of features or by resource path qualified functionalities where the resource path may pertain to storage classes. These API could also support discoverability. And these APIs could support nuances specific to file protocols and content addressability.

https://1drv.ms/w/s!Ashlm-Nw-wnWuT-u1f7DRjBRuvD4

Tuesday, March 19, 2019

We were discussing the S3 API from previous post.

The S3 API makes no endearment for developers on how object storage can be positioned as an object queue, an object cache, an object query, a gateway, a log index store, and many other such capabilities. API best practice enables automated monitoring, logging, auditing, and many more.
This makes the S3 API more useful and mash-able than what it is today. It highlights the notion that the storage is no more an appliance, product, or cloud but a layer that simplifies and solves most application storage needs without any involvement. Perhaps then it could be called a simpler storage service.
Let us take a closer look at how the API could be structured to facilitate calls generally to the storage stack, calls individually to a storage class, or calls that bridge lateral functionalities to the same class.
The S3 command guide provides core functionalities to buckets and objects, the listing, creating, updating and deleting of objects, the setting of access control lists, the setting of bucket policy, the expiration period, or the life-cycle, and the multi-part uploads.
If a new storage class is added and the functionalities not at par with the regular S3 storage, then they would have an entirely new set of APIs and these would preferably have a prefix to differentiate the APIs from the rest.
If the storage stack is layered from the active regular s3 storage on the top to the less frequently used storage classes at a lower level than the regular and finally the glacier or least used data as the last layer, then aging of data alone is sufficient to migrate from the top layer all the way to the bottom without any involvement of API. That said, the API could provide visibility to the users on the contents of each storage class along with the additional functionality of direct placement of objects in those classes or their eviction. Since the nature of the storage class differentiates the api set and we decided to use prefix based api naming conventions to indicate the differentiation, each storage class adds a new set to the existing APIs. On the other hand, policies common to all three storage classes or the functionality that stripes across layers will be provided either with request attributes targeting that layer and its lower layers or with the help of parameters.

Monday, March 18, 2019

The Simple Storage Service (S3) API
Virtually all storage providers in cloud and on-premise storage solutions support the S3 Application Programming Interface. With the help of this APIs, applications can send and receive data without having to worry about the storage best practice. They are also able to switch from one storage appliance to another, one on-premise cluster to another, one cloud provider to another and so on. The API was introduced by Amazon but has become the industry standard and accepted by many storage providers. Even competitors like Azure provide an S3 Proxy to allow applications to access their storage with this API.
This API has wide spread popularity for storing binary large objects also called blobs for short but that has not stopped providers from making an S3 façade over other forms of storage. They get the audience with S3 since the application changes are minimal and retrofit the storage solutions to support the most used aspect of S3.
S3 however is neither a restriction on the storage type nor true to its roots in object storage. It provides the ability to navigate the buckets and objects and adds improvements such as multi-part upload to the way data is sent into the object storage. It supports various headers to make the most of the object storage which includes sending options along with the payload as well as retrieving information on the stored products.
S3 like any other cloud-based service has been developed with the Representation State Transfer (REST) best practice. However, it does not involve all the provisions of HTTP 2.0 (released) or 3.0 (in-progress). Neither does it provide a protocol like abstraction where layers can be added above or below in a storage stack. A networking stack on the other hand has dedicated protocols for each of its layers. A storage stack may comprise of say, at the very least, an active workload versus a backup workload layer where the active remains as the higher layer and can support say HTTP and the backup remains as the lower layer and supports S3. Perhaps this distinction has been somewhat obfuscated where object storage can expand its capabilities to both layer.
Object storage can also support the notions of storage classes, versioning, retention, data aging, deduplication and compression and none of these are really feature very well by S3 without a header here or a prefix there or a tag or a path and this leads to incredible inconsistency not just between the features but also in their usages by the customers.

Sunday, March 17, 2019

The operation on inactive entries
Recently I came across an unusual problem of maintaining active and inactive entries. It was unusual because there were two sets and they were updated differently and at different times. Both the sets were invoked separately and there was no sign whether the sets were mutually exclusive. Although we could assume the invocations of operations on the sets were from the same component, they were invoked in different iterations. This meant that the operations taken on the set would be repeated several times.
The component only took actions on the active set. It needed to take an action on the entries that were inactive. This action was added subsequent to the action on the active set. However, the sets were not mutually exclusive so they had to be differentiated to see what was available in one but not the other. Instead this was overcome by delegating the sets to be separated at the source. This made it a lot easier to work with the actions on the sets because they would be fetched each time with some confidence that they would be mutually exclusive. There was no confirmation that the source was indeed giving up to date sets. This called for a validation on the entries in the set prior to taking the action. The validation was merely to check if the entry was active or not.
However, an entry does not remain in the same set forever. It could move from the active set to the inactive set and back. The active set and the inactive set would always correspond to their respective actions. This meant that the actions needed to be inverted between the entries so that they could flip their state between the two processing.
There were four cases for trying this out. The first case was when the active set was called twice. The second case was when the inactive set was called twice. The third case was when the active set was followed by the inactive set. The fourth case was when the inactive set was followed by the active set.
With these four cases, the active and the inactive set could have the same operations taken deterministically no matter how many times they were repeated and in what order.
The only task that remained now was to ensure that the sets returned from the source were good to begin with. The source was merely subscribed to events that added entries to the sets. However, the events could be called in any order and for arbitrary number of times. The event handling did not all exercise the same logic so the entries did not appear final in all the cases. This contributed to the invalid entries in the set. When the methods used to retrieve the active and inactive set were made consistent, deterministic, robust and correct, it became easier to work with the operations on the set in the calling component.
This concluded the cleaning up of the logic to handle the active and inactive sets.
We now follow up with the improvements to the source when possible:
There are different lists maintained for active, inactive, failover, scanned collections. They could all be part of the same collection and synchronized so that all accesses are serialized. Attributes on the entries can describe the state of the entry. If the entries need to be on separate collections, they could be synchronized with the same lock.
The operations on the entries may be made available as simpler full service methods where validation and updates are included. When the source is rewritten, it must make sure that all the existing behavior is unchanged.

Saturday, March 16, 2019

Today we continue discussing the best practice from storage engineering:

595) Virtually every service utilized from the infrastructure is candidate for standardization and consistency so that one component/vendor in the infrastructure may be replaced with another with little or no disruption

596) There are a number of stack frames that a developer has to traverse in order to find the code path taken by the execution thread and they don’t always pertain to layers but if the stack frames can get simpler for the developer, the storage product on the whole improves tremendously. This is not a rule but just a rule of thumb that the simpler the better.

597) As with all one-point maintenance code, there is bloating and complexity to handle different use cases from the same code. Unfortunately, developers don’t have the luxury to rewrite core components without significant investment of time and effort. Therefore version 1 of the product must always strive for building it right from the get go.

598) As use cases increase and the business improves, the product management pays a lot of attention to sustainable growth in the face of business needs. It is at this cusp of technology and business that the system architecture plays its best.

599) There are very few cases where the process goes wrong. On the other hand, there is a lot of advantage to trusting the process. Consequently, the product must be improved with sound processes.

600) As with any product, a storage product also qualifies for the Specific-measureable-attainable-realistic-timely aka SMART process where improvements can be measured and the feedback used to improve the process and the product.

Friday, March 15, 2019

Today we continue discussing the best practice from storage engineering:
es time and cost.

588) There are notions of SaaS, PaaS and, IaaS with clear separation of concerns in the cloud. The same applies to a storage layer in terms of delegating to dedicated products.

589) The organization in the cloud does not limit the number and type of services available from the cloud. The same holds true for the feature as services within a storage product.

590) The benefits that come with the cloud can also come from a storage product.

591) There are times when the storage product will have imbalanced load. They will need to be load balanced. Since this is an ongoing activity, it can be periodically scheduled or responded when thresholds are crossed.

592) When the layers of infrastructure and storage services are clearly differentiated, the upper layer may utilize the alerting from the lower layers for health checks and to take corrective actions.

593) There are a number of ways to monitor a system whether it is for performance, statistics or health checks. A system center management system can consolidate and unify the operations management. The storage product merely needs to publish to a system center.

594) There are several formats of metrics and monitoring data and generally they are proprietary. Utilizing an external stack for these purposes via APIs helps alleviate the concerns from the storage service.

595) Virtually every service utilized from the infrastructure is candidate for standardization and consistency so that one component/vendor in th