Cluster computing

Sunday, January 20, 2019

Today we continue discussing the best practice from storage engineering :

341) Emerging trends like Blockchain have no precedence for storage standards. A blockchain is a continuously growing list of records called blocks which are linked and secured using cryptography. Since it is resistant to tampering, it becomes an open distributed ledger to record transactions between two parties.
342) Blockchain is used for a variety of purposes. Civic for instance enables its users to login with their fingerprints. It uses a variety of smart contracts, an indigenous utility token and new software applications. A Merkel tree is used for attestation.
343) Storage servers to store such distributed ledgers is not standardized yet though any stream based store would help. Object Storage suitability for storage ids is not popular yet.
344) A Peer-to-Peer (P2P) network is popular In production storage, peer to peer networks were not popular as data because data accrues based on popularity not on utility.
345) A P2P network introduces a network first design where peers are autonomous agents and there is a protocol to enable them to negotiate contracts, transfer data, verify the integrity and availability of remote data and to reward with payments. It provides tools to enable all these interactions. Moreover, it enables the distribution of the storage of a file as shards on this network and these shards can bee stored using a distributed hash table. The shards themselves need not be stored in this hash table, rather a distributed network and messaging could facilitate it with location information.

Saturday, January 19, 2019

Today we continue discussing the best practice from Storage Engineering:

Storage units such as files and objects may be stored as shards on different constituents of the network. The location information in such cases becomes part of the messages between the peers.

Distributed Hash Table has gained widespread popularity in distributing load over a network. There are some well-known players in this technology. Kademlia for instance proves many of the theorems for DHT.

Messages are routed through low latency paths and use parallel asynchronous queries. Message queuing itself and its protocol is an excellent communication mechanism for a distributed network.

Integrity of the data is verified with the help of Merkle trees and proofs in such distributed framework. Others use key based encryptions.

Partial audits reduce overhead in compute and storage. For example, shards of data may be hashed many times to generate pre-leaves in a Merkle tree. And the proof may be compact but the tree may not be compact. Partial audits therefore save on compute and storage.

Data does not always need to be presented in entirety and always from the source of truth. The data may remain completely under the control of the owner while a challenge-proof may suffice

Messages may contain just the hash of the data and the challenge while the responses may contain just the proof. This enables the message exchange to be meaningful while keeping the size within a limit.

Contracts are signed and stored by both parties. Contracts are just as valuable as the sensitive data. Without a contract in place, there is no exchange possible.

Parties interested in forming a contract use a publisher-subscriber model that is topic-based and uses a bloom filter implementation which tests whether an element is a member of a set.

Friday, January 18, 2019

Today we continue discussing the best practice from storage engineering:

326) A higher layer that manages and understands abstractions, provides namespace and data operation ordering is present in almost all storage management systems

327) A lower layer comprising of distribution and replication of data without actually requiring any knowledge of abstractions maintained by the higher layer is similarly present in almost all storage products.

328) Similarly, the combination of the two layers described above is almost always separated from the front-end layer interpreting and servicing the user requests.

329) Working with streams is slightly different from fixed sized data. It is an ordered set of references to segments. All the extents are generally immutable.

330) If a shared resource can be represented as a pie, there are two ways to enhance the usage: First, make the pie bigger and the allocate the same fractions in the increment. Second, dynamically modify the fractions so that at least some of the resource usages can be guaranteed some resource.

331) In order to remove the notion of a centralized storage provider, some storage products have preferred distributed cloud storage network. A distributed hash table overlays the network.

332) Storage units such as files and objects may be stored as shards on different constituents of the network. The location information in such cases becomes part of the messages between the peers.

333) Distributed Hash Table has gained widespread popularity in distributing load over a network. There are some well-known players in this technology. Kademlia for instance proves many of the theorems for DHT.

334) Messages are routed through low latency paths and use parallel asynchronous queries. Message queuing itself and its protocol is an excellent communication mechanism for a distributed network.

335) Integrity of the data is verified with the help of Merkel trees and proofs in such distributed framework. Others use key based encryptions

Thursday, January 17, 2019

Today we continue discussing the best practice from storage engineering:

323) Cost of storage is sometimes vague because it does not necessarily encompass all operational costs for all the units because the scope and purpose changes for the storage product. The cost is not a standard but we can get comparable values when we take the sum of the costs and divide it for unit price.
324) Cost is always a scalar value and usually calculated by fixing parameters of the system. Different studies may use the same parameters but have widely different results. Therefore, it is not good practice to compare studies unless they are all relatively performing the same.
325) The total cost of ownership encompasses cost for operations and is usually not reflected on new instances of the storage product. It is used with products that have been used for a while and are becoming a liability.
326) A higher layer that manages and understands abstractions, provides namespace and data operation ordering is present in almost all storage management systems
327) A lower layer comprising of distribution and replication of data without actually requiring any knowledge of abstractions maintained by the higher layer is similarly present in almost all storage products.
328) Similarly, the combination of the two layers described above is almost always separated from the front-end layer interpreting and servicing the user requests.
329) Working with streams is slightly different from fixed sized data. It is an ordered set of references to segments. All the extents are generally immutable.
330) If a shared resource can be represented as a pie, there are two ways to enhance the usage: First, make the pie bigger and the allocate the same fractions in the increment. Second, dynamically modify the fractions so that at least some of the resource usages can be guaranteed some resource.

Wednesday, January 16, 2019

The design of appenders for Object Storage:

Object storage has established itself as a “standard storage” in the enterprise and cloud. As it brings many of the storage best practice to provide durability, scalability, availability and low cost to its users, it could expand its role from being a storage layer to one that facilitates data transfer. We focus on the use cases of object storage so that the onerous tasks of using object storage can be automated. Object storage then transforms from being a passive storage layer to one actively participating in data transfers to and from different data sinks and sources simply by attaching appenders. Earlier, we described the data transfer to the object storage with the help of connectors. The purpose of this document is to describe appenders that facilitate intelligent automations of data transfer to external data sinks with minimal or no disruption to their operations. 

File-systems have long been the destination to capture traffic and while file system has evolved to stretching over clusters and not just remote servers, it remains inadequate as a blob storage. The use of connectors and appenders enhance the use of object storage and do not compete with the usages of elastic file stores. 

We describe the appenders not by the data sink they serve but by the technology behind the appenders. We enumerate synchronous send and receive over protocols, asynchronous publisher-subscriber model, compressed data transfer, deduplicated data transfer and incremental rsync based backups to name a few.  We describe their advantages and disadvantages but do not provide a prescription to the retail world which allows them to be used on a case by case basis and with flexible customizations. In this sense, the appenders can be presented in the form of a library with the object storage so that commodity and proprietary data sinks may choose to receive data rather than actively poll for it.

While public cloud object storage solutions offer cloud based services such as Amazon’s Simple Notification Services and Azure Notification hubs, on-premise object storage has the ability to make the leap from standalone storage offering to a veritable solution integration packaging. Public-Cloud offers robust myriad features for their general-purpose cloud services while the appenders specialize in the usages of the object storage and come as an out of box feature for the object storage.

The data that flows out of object storage is often pulled from various receivers. That usage continues to be mainstream for the object storage. However, we add additional usages where the appenders push the data to different data sinks. Previously there were in-house scripts for such data transfers. Instead we make it part of the standard storage and provide users just the ability to configure it to their purpose. 

The ability to take the onerous routines of using object storage as a storage layer from the layer above across different data sinks enables a thinner upper layer and more convenience to the end user. The customizations in the upper layer are reduced and the value additions bubble up the stack.  

On-premise object storage is no more a standalone. The public cloud has moved towards embracing on-premise compute resources and their management via System Center integration. The same applies to on-premise object storage as well.  Consequently, the technologies behind the appenders are not only transparent but they are also setup for being monitored and reported via dashboards and charts This improves visibility into the data transfers while enabling cost accounting.  

An Object Storage offers better features and cost management, as it continues to stand out against most competitors in the unstructured storage. The appenders lower the costs of usage so that the total cost of ownership is also lowered making the object storage whole lot more profitable to the end users. 

Tuesday, January 15, 2019

Today we continue discussing the best practice from storage engineering:

318) Ingestion engines have a part to play in the larger data processing pipeline that users search with the help of a search engine. The data storage has to be searchable. Therefore, the ingestion engine also annotates the data, classifies the content, classifies for language and tags. The search engine crawls and expands the links in the data. The results are stored back as blobs. These blobs then become publicly searchable. The workflows over the artifacts may be implemented in queues and the overall timing of the tasks may be tightened to make the results available within reasonable time to the end user in their search results.

319) The increase in the data size after annotations and search engine suitability is usually less than double the size of original data.

320) Strong consistency is an aspect of data not the operations. A simple copy-on-write mechanism and versions is sufficient to enable all accesses to be seen by all parallel processes in their sequential order.

321) Multi-Tenancy is a choice for the workload not for the infrastructure. If the storage product requires multiple instances for their own product, then it is dividing the resources versus making most of the resources with shared tenancy. Unless there is a significant boost to performance to a particular workload, the cost does not justify workloads to require their own instances of storage products.

322) Along with tenancy, namespaces can also be local or global. Global namespaces tend to be longer and less user-friendly. On the other hand, global namespaces can enforce consistency

323) Cost of storage is sometimes vague because it does not necessarily encompass all operational costs for all the units because the scope and purpose changes for the storage product. The cost is not a standard but we can get comparable values when we take the sum of the costs and divide it for unit price.

324) Cost is always a scalar value and usually calculated by fixing parameters of the system. Different studies may use the same parameters but have widely different results. Therefore, it is not good practice to compare studies unless they are all relatively performing the same.

325) The total cost of ownership encompasses cost for operations and is usually not reflected on new instances of the storage product. It is used with products that have been used for a while and are becoming a liability.

Monday, January 14, 2019

Today we continue discussing the best practice from storage engineering :

311) The SSD could be treated as a pool of fast storage that is common to all the processes. since it is pluggable and external from all hard drives, it can be dynamically used as long as there is any availability.
312) In this sense it is very similar to L3 cache, however it is not meant for dynamic partitions, balancing access speed, power consumption and storage capacity. It is not as fast as cache but it is more flexible than conventional storage and plays a vital role in managing inter-process communication. This is a simplified storage.
313) SSDs can make use of different storage including flash storage. The two most common are NOR and NAND. NOR was the first of the two to be developed. It is very fast for reads but not as fast for writes, so it is used most often in places where code will be written once and read a lot. NAND is faster for writes and takes up significantly less space than NOR, which also makes it less expensive. Most flash used in SSDs is the NAND variety.
314) One of the easiest way to perform diagnostics on storage devices is to enable diagnostic API which do not need any credentials and inform resource statistics
315) These diagnostic queries can show even Btree information as long as they are gathered correctly.
316) Blobs, tables and queues are three primary forms of storage. While storage products excel in one or the other forms of storage, only a public cloud provider is best suited to offer all three from the same storage.
317) Ingestion engine is usually built separate from the storage engine. Eventually the resources may result in the form of unstructured storage such as user files or blobs and structured storage such as tables.
318) Ingestion engines have a part to play in the larger data processing pipeline that users search with the help of a search engine. The data storage has to be searchable. Therefore, the ingestion engine also annotates the data, classifies the content, classifies for language and tags. The search engine crawls and expands the links in the data. The results are stored back as blobs. These blobs then become publically searchable. The worflows over the artifacts may be implemented in queues and the overall timing of the tasks may be tightened to make the results available within reasonable time to the end user in their search results.
319) The increase in the data size after annotations and search engine suitability is usually less than double the size of original data.
#codingexercise
List <String> getSameSuffixLength ( List <String> input, String tail){
Int length = getSuffixLength (tail);
Return input.stream ()
.select ( x -> getSuffixLength (x) == length)
.collect (Collectors.toList ());
}