Cluster computing

Wednesday, March 27, 2019

Today we continue discussing the best practice from storage engineering:

631) Listing entry values are particularly interesting. In addition to the type of attributes in an entry, we can take advantage of the range of values that these attributes can take. For example, we can reserve boundary values and extremely tiny values that will not be encountered in the real world at least for the majority of cases.

632) When the values describe the size of an associated object, the size itself can be arbitrary and it is not always possible to rule out a size for a user object no matter how unlikely it seems. However, when used together with other attributes such as status, they become usable as representative of some object state that is otherwise not easily ffound.

633) The state of an object is authoritative. If it weren’t the source of truth, the entries itself cannot be relied on without involving validation logic across entries. There is no probllem performing validations but doing them over and over again not only introduces delays but can be avoided altogether with clean state.

634) The states are also representative are also unique. The entries are not supposed to be in two or more states at once. It is true that bitmask can be used to denote conjunctive status but a forward only discrete singular state is preferable.

635) The attributes in an entry are often added on a case by case basis since it is expedient to add a new attribute without affecting others. However, the accessors of the entry should not proliferate the attributes. If the normalization of the attribute can serve more than one accessor, it will provide consistency across accesses.

636) Background tasks may be run or canceled. Frequently these tasks need to be canceled. If they don’t do proper cleanup, they can leave their results in bad state. The shutdown helps release the resources properly

Tuesday, March 26, 2019

Today we continue discussing the best practice from storage engineering:

626) Listings are great to use when they are in a single location. However, they are often scoped to a parent container. If the parent containers are distributed, the listings tend to be multiple. In such cases the effort is repeated.

627) When the listings are separated by locations, the results from the search may be fewer than the expected total if only one of the locations is searched. This has often been encountered in deployments.

628) The listings do not need to be aggregated across locations in all cases. Sometimes, only the location is relevant and the listing and the search can be scoped to it.

629) Iterating the listings has proved banal in most cases both for system and for user. Consequently, either an identifier is used to go directly to the entry in the listing or a listing is reserved so that only that listing is accessed.

630) The listing can be cleaned up as well. There is no need to keep it growing with outdated entries and then archived by age. The cleaning can happen in the background so that list iterations skip over entries or do not see the entries that appear as removed.

631) Listing entry values are particularly interesting. In addition to the type of attributes in an entry, we can take advantage of the range of values that these attributes can take. For example, we can reserve boundary values and extremely tiny values that will not be encountered in the real world at least for the majority of cases.

Monday, March 25, 2019

Today we continue discussing the best practice from storage engineering:

620) When a new key is added, it may not impact existing keys but it does affect the overall space consumption of the listing depending on the size and number.

621) The keys can have as many fields as necessary. However, the lookups are faster when there are only a few keys to compare.

622) Key comparison can be partial or full. Partial keys are useful to match duplicates. The number of keys that share the same subkeys can be many. This form of comparison is very helpful to group entries.

623) Grouping of entries also help with entries that span groups based on sub keys. These work across groups

624) The number of entries may run to a large order but the prefix could be more inclusive of subkeys to narrow the search. This makes it efficient to run on these listings.

625) The number of entries also don’t matter to the number of keys in each entry as long as the prefix is using a small set of subkeys.

626) Listings are great to use when they are in a single location. However, they are often scoped to a parent container. If the parent containers are distributed, the listings tend to be multiple. In such cases the effort is repeated.

#codingexercise
Find paths in a matrix
int GetPaths(int x, int y)
{
if (x <= 0 || y <= 0)
return 1;

return GetPaths(x - 1, y) +
GetPaths(x - 1, y - 1) +
GetPaths (x, y - 1); // for the three possible directions
}

Saturday, March 23, 2019

Today we continue discussing the best practice from storage engineering:

611) Background tasks may sometimes need to catch up with the current activities. In order to accommodate the delay, they may either be run upfront so that changes to be processed are incremental or they can increase in number to divide up the work.

612) The results from the background tasks mentioned above might also take a long time to accumulate. They can be made available as they appear or batched.

613) The load balancer works very well to enable background tasks to catch up by not overloading a single task and distributing the online activities to ensure that the background task has light load

614) The number of background tasks or their type should not affect online activities. However, systems have known to be impacted when the tasks are consuming memory or delay garbage collection

615) There is no specific mitigation for one or more background tasks that takes plenty of shared resources but generally they are written to be fault tolerant so that they can pick up from where they left off.

Friday, March 22, 2019

Today we continue discussing the best practice from storage engineering :

606) We use data structures to keep the information we want to access in a convenient form. When this is persisted, it mitigates faults in the processing. However each such artifact brings in additional chores and maintenance. On the other hand, it is cheaper to execute the logic and the logic can be versioned. Therefore when there is a trade-off between compute and storage for numerous small and cheap artifacts, it is better to generate them dynamically

607) The above has far reaching impact when there are a number of layers involved and ac cost incurred in the lower layer bubbles up to the top layer.

608) Compute tends to be distributed in nature while storage tends to be local. They can be mutually exclusive in this regard.

609) Compute oriented processing can scale up or out while storage has to scale out.

610) Compute oriented processing can get priority but storage tends to remain in a class

611) Background tasks may sometimes need to catch up with the current activities. In order to accommodate the delay, they may either be run upfront so that changes to be processed are incremental or they can increase in number to divide up the work.

612) The results from the background tasks mentioned above might also take a long time to accumulate. They can be made available as they appear or batched.

613) The load balancer works very well to enable background tasks to catch up by not overloading a single task and distributing the online activities to ensure that the background task has light load

Thursday, March 21, 2019

Today we continue discussing the best practice from storage engineering

600) As with any product, a storage product also qualifies for the Specific-measureable-attainable-realistic-timely aka SMART process where improvements can be measured and the feedback used to improve the process and the product.

601) As with each and every process, there is some onus but the rewards generally outweigh the costs when it is reasoned and accepted by all. The six sigma process for example sets a high bar for quality because it eliminates errors progressively.

602) The iterations for six sigma were high so it takes longer and the results are not always available in the interim. The agile development processes allowed results to be incremental.

603) The agile methodology improved the iterations over the features in such a way that it did not impact the rest of the product. This enables faster feature development

604) The continuous integration and continuous deployment model made the individual feature improvements available for use because the changes were build, tested and deployed in lock step with development.

605) Together with the process to make improvements one change at a time and to have it build tested and deployed gives great benefit to the overall product.

606) We use data structures to keep the information we want to access in a convenient form. When this is persisted, it mitigates faults in the processing. However each such artifact brings in additional chores and maintenance. On the other hand, it is cheaper to execute the logic and the logic can be versioned. Therefore when there is a trade-off between compute and storage fo

Wednesday, March 20, 2019

We were discussing the S3 API:

Virtually all storage providers in cloud and on-premise storage solutions support the S3 Application Programming Interface. With the help of this APIs, applications can send and receive data without having to worry about the storage best practice. They are also able to switch from one storage appliance to another, one on-premise cluster to another, one cloud provider to another and so on. The API was introduced by Amazon but has become the industry standard and accepted by many storage providers. Even competitors like Azure provide an S3 Proxy to allow applications to access their storage with this API.

S3 like any other cloud-based service has been developed with the Representation State Transfer (REST) best practice. However, it does not involve all the provisions of HTTP 2.0 (released) or 3.0 (in-progress). Neither does it provide a protocol like abstraction where layers can be added above or below in a storage stack. A networking stack on the other hand has dedicated protocols for each of its layers. A storage stack may comprise of say, at the very least, an active workload versus a backup workload layer where the active remains as the higher layer and can support say HTTP and the backup remains as the lower layer and supports S3. Perhaps this distinction has been somewhat obfuscated where object storage can expand its capabilities to both layer.

The S3 API makes no endearment for developers on how object storage can be positioned as an object queue, an object cache, an object query, a gateway, a log index store, and many other such capabilities. API best practice enables automated monitoring, logging, auditing, and many more.

If a new storage class is added and the functionalities not at par with the regular S3 storage, then they would have an entirely new set of APIs and these would preferably have a prefix to differentiate the APIs from the rest.

If the storage stack is layered from the active regular s3 storage on the top to the less frequently used storage classes at a lower level than the regular and finally the glacier or least used data as the last layer, then aging of data alone is sufficient to migrate from the top layer all the way to the bottom without any involvement of API. That said, the API could provide visibility to the users on the contents of each storage class along with the additional functionality of direct placement of objects in those classes or their eviction. Since the nature of the storage class differentiates the api set and we decided to use prefix based api naming conventions to indicate the differentiation, each storage class adds a new set to the existing APIs. On the other hand, policies common to all three storage classes or the functionality that stripes across layers will be provided either with request attributes targeting that layer and its lower layers or with the help of parameters.

Functionalities such as deduplication, rsync for incremental backups, compression and management will require new APIs and these do not have to be limited to objects in any one storage class. APIs that automate the workflow of calling more than one APIs can also be written as a coarse granularity API. These wrapped APIs can collate functionalities for a single layer or across layers. They can also include automation not specific to the control or data path of the storage stack. Together the new functionality and wrapped APIs can become one whole set.

S3 API can become a protocol for all storage functionalities. They can be organized as a flat list of features or by resource path qualified functionalities where the resource path may pertain to storage classes. These API could also support discoverability. And these APIs could support nuances specific to file protocols and content addressability.

https://1drv.ms/w/s!Ashlm-Nw-wnWuT-u1f7DRjBRuvD4