Cluster computing

Tuesday, January 15, 2019

Today we continue discussing the best practice from storage engineering:

318) Ingestion engines have a part to play in the larger data processing pipeline that users search with the help of a search engine. The data storage has to be searchable. Therefore, the ingestion engine also annotates the data, classifies the content, classifies for language and tags. The search engine crawls and expands the links in the data. The results are stored back as blobs. These blobs then become publicly searchable. The workflows over the artifacts may be implemented in queues and the overall timing of the tasks may be tightened to make the results available within reasonable time to the end user in their search results.

319) The increase in the data size after annotations and search engine suitability is usually less than double the size of original data.

320) Strong consistency is an aspect of data not the operations. A simple copy-on-write mechanism and versions is sufficient to enable all accesses to be seen by all parallel processes in their sequential order.

321) Multi-Tenancy is a choice for the workload not for the infrastructure. If the storage product requires multiple instances for their own product, then it is dividing the resources versus making most of the resources with shared tenancy. Unless there is a significant boost to performance to a particular workload, the cost does not justify workloads to require their own instances of storage products.

322) Along with tenancy, namespaces can also be local or global. Global namespaces tend to be longer and less user-friendly. On the other hand, global namespaces can enforce consistency

323) Cost of storage is sometimes vague because it does not necessarily encompass all operational costs for all the units because the scope and purpose changes for the storage product. The cost is not a standard but we can get comparable values when we take the sum of the costs and divide it for unit price.

324) Cost is always a scalar value and usually calculated by fixing parameters of the system. Different studies may use the same parameters but have widely different results. Therefore, it is not good practice to compare studies unless they are all relatively performing the same.

325) The total cost of ownership encompasses cost for operations and is usually not reflected on new instances of the storage product. It is used with products that have been used for a while and are becoming a liability.

Monday, January 14, 2019

Today we continue discussing the best practice from storage engineering :

311) The SSD could be treated as a pool of fast storage that is common to all the processes. since it is pluggable and external from all hard drives, it can be dynamically used as long as there is any availability.
312) In this sense it is very similar to L3 cache, however it is not meant for dynamic partitions, balancing access speed, power consumption and storage capacity. It is not as fast as cache but it is more flexible than conventional storage and plays a vital role in managing inter-process communication. This is a simplified storage.
313) SSDs can make use of different storage including flash storage. The two most common are NOR and NAND. NOR was the first of the two to be developed. It is very fast for reads but not as fast for writes, so it is used most often in places where code will be written once and read a lot. NAND is faster for writes and takes up significantly less space than NOR, which also makes it less expensive. Most flash used in SSDs is the NAND variety.
314) One of the easiest way to perform diagnostics on storage devices is to enable diagnostic API which do not need any credentials and inform resource statistics
315) These diagnostic queries can show even Btree information as long as they are gathered correctly.
316) Blobs, tables and queues are three primary forms of storage. While storage products excel in one or the other forms of storage, only a public cloud provider is best suited to offer all three from the same storage.
317) Ingestion engine is usually built separate from the storage engine. Eventually the resources may result in the form of unstructured storage such as user files or blobs and structured storage such as tables.
318) Ingestion engines have a part to play in the larger data processing pipeline that users search with the help of a search engine. The data storage has to be searchable. Therefore, the ingestion engine also annotates the data, classifies the content, classifies for language and tags. The search engine crawls and expands the links in the data. The results are stored back as blobs. These blobs then become publically searchable. The worflows over the artifacts may be implemented in queues and the overall timing of the tasks may be tightened to make the results available within reasonable time to the end user in their search results.
319) The increase in the data size after annotations and search engine suitability is usually less than double the size of original data.
#codingexercise
List <String> getSameSuffixLength ( List <String> input, String tail){
Int length = getSuffixLength (tail);
Return input.stream ()
.select ( x -> getSuffixLength (x) == length)
.collect (Collectors.toList ());
}

Sunday, January 13, 2019

Today we continue discussing the best practice from storage engineering:

308) A garbage collector will find it easier to collect aged data by levels. If there are generations in the pages to be reclaimed, it splits the work for the garbage collector so that the application operations can continue to work well.

309) Similarly, aggregated and large pages are easier for the garbage collector to collect rather than multiple spatially and temporally spread out pages If the pages can be bundled or allocated in clusters, it will signal the garbage collector to free them up at once when they are marked

310) Among the customizations for garbage collector, it is helpful to see which garbage collector is being worked the most. The garbage collector closer to the application has far more leverage in the allocations and deallocations than something downstream.

311) The SSD could be treated as a pool of fast storage that is common to all the processes. since it is pluggable and external from all hard drives, it can be dynamically used as long as there is any availability.

312) In this sense it is very similar to L3 cache, however it is not meant for dynamic partitions, balancing access speed, power consumption and storage capacity. It is not as fast as cache but it is more flexible than conventional storage and plays a vital role in managing inter-process communication. This is a simplified storage.

313) SSDs can make use of different storage including flash storage.
There are at least two kinds: NOR and NAND.
NOR preceded NAND. It is very fast for reads but not as fast for writes
and was used for the purpose of writing code once and read it many times.
NAND improved the writes and took up a lot less space than NOR.
This also made it less expensive. NAND is more popular in SSD than NOR.

Saturday, January 12, 2019

Today we continue discussing the best practice from storage engineering:
306) The use of garbage collector sometimes interferes with the performance of the storage device. The garbage collector has to be tuned for the kind of workloads.
307) In Solid State Drives as per descriptions online, there is a garbage collection process inside the SSD controller which resets dirty pages into renewed pages that can take the writes. It is important to know under which conditions the garbage collection might degrade performance. In the case of SSD, a continuous workload on small random writes puts a lot of work on the garbage collection.
308) A garbage collector will find it easier to collect aged data by levels. If there are generations in the pages to be reclaimed, it splits the work for the garbage collector so that the application operations can continue to work well.
309) Similarly, aggregated and large pages are easier for the garbage collector to collect rather than multiple spatially and temporally spread out pages If the pages can be bundled or allocated in clusters, it will signal the garbage collector to free them up at once when they are marked
310) Among the customizations for garbage collector, it is helpful to see which garbage collector is being worked the most. The garbage collector closer to the application has far more leverage in the allocations and deallocations than something downstream.
311) File copying on SSD is instantaneous. The operations that are file copy intensive gain performance significantly on SSD. The capacity is secondary consideration because the code that works well on SSD can make use of it even if it is smaller.

Friday, January 11, 2019

We were discussing data transfers in and out of Object Storage It is a limitless storage. It is used as-is in many deployments and scales to any workload with the help of monitoring over capacity. While it is easier to add a lot of disks when capacity runs low, it is not easy to reserve space for a high priority workload in preference to others under the same disk space. Even if the workload is just backup, there is no differentiation of one backup from another. Also, if the workloads are not supposed to be throttled, this can be an administrator only feature that enables them to classify storage pools for workloads. That said, the benefits of object storage are clear to all workloads which we enumerate below.
Object storage is a zero-maintenance storage. There is no planning for capacity and elastic nature of its services may be taken for granted. The automation of asynchronous writes, flush to object storage and sync of data in Connectors to that in object storage is now self-contained and packaged into this Connectors.
The cloud services are elastic, both the storage and the Connectors along with the compute necessary for these purposes can remain in the cloud enabling not just the same benefits to one application and client for an organization but to be universal across departments, organizations, applications, services and workloads.
Object storage coupled with this Connectors is also suited to dynamically address the needs for client and application because the former may have been a global store but the latter is able to determine the frequency depending on the workload.  Different applications may tune the governing to its requirements.
Performance increases dramatically when the resources are guaranteed rather than commissioning them on demand. This has been one of the arguments for Connectors in general.
Such service is hard to mimic individually within each application or client. Moreover, optimizations only happen when the compute and storage are elastic where they can be studied, Connectors, and replayed independent of the applications.
The move from tertiary to secondary storage is not a straightforward shift from NAS storage to object storage without some form of chores over time. A dedicated product likes this takes the concern out of the picture.

Thursday, January 10, 2019

Object storage has established itself as a “standard storage” in the enterprise and cloud. As it brings many of the storage best practice to provide durability, scalability, availability and low cost to its users, it could expand its role from being a storage layer to one that facilitates data transfer. We focus on the use cases of object storage so that the onerous tasks of using object storage can become part of it. Object storage then transforms from being a passive storage layer to one actively pulling data from different data sources simply by attaching connectors. The purpose of this document is to describe such connectors that facilitate intelligent automations of data transfers from data sources with minimal or no disruption to their operations.
File-systems have long been the destination to capture traffic and while file system has evolved to stretch over clusters and not just remote servers, it remains inadequate as a blob storage. The connectors enhance the use of object storage and do not compete with the usages of elastic file stores.
We describe the connectors not by the data source they serve but by the technology behind the connectors. We enumerate synchronous send and receive over protocols, asynchronous publisher-subscriber model, compressed data transfer, deduplicated data transfer and incremental rsync based backups to name a few. We describe their advantages and disadvantages but do not provide a prescription to the retail world which allows them to be used on a case by case basis and with flexible customizations. In this sense, the connectors can be presented in the form of a library with the object storage.
While public cloud object storage solutions offer cloud based services such as Amazon’s Simple Notification Services and Azure Notification hubs, on-premise object storage has the ability to make the leap from standalone storage offering to a veritable solution integration packaging. Public-Cloud offer robust myriad features for their general-purpose cloud services while the connectors specialize in the usages of the object storage.
The data that flows into object storage is often pushed from various senders. That usage continues to be mainstream for the object storage. However, we add additional usages where the connectors pull the data from different data sources. Previously there were in-house scripts for such data transfers. Instead we make it part of the standard storage and provide users just the ability to configure it to their purpose.
The ability to take the onerous routines of using object storage as a storage layer from the layer above across different data sources enables a thinner upper layer and more convenience to the end user. The customizations in the upper layer are reduced and the value additions bubble up the stack.
On-premise object storage is no more a standalone. The public cloud has moved towards embracing on-premise compute resources and their management via System Center integration. The same applies to on-premise object storage as well. Consequently, the technologies behind the connectors are not only transparent but they are also setup for being monitored and reported via dashboards and charts This improves visibility into the data transfers while enabling cost accounting.
An Object Storage offers better features and cost management, as it continues to stand out against most competitors in the unstructured storage. The connectors lower the costs of usage so that the total cost of ownership is also lowered making the object storage whole lot more profitable to the end users.

Wednesday, January 9, 2019

Today we continue discussing the best practice from storage engineering :

300) Disks compete not only with other disks but also with other forms of storage such as Solid-State Drives. Consequently, disks tend to become cheaper, capable and smarter in their operations with added value in emerging and traditional usages. Cloud Storage costs have been said to follow a trend that asymptotically reaches zero with current price today at about 1c per GigaByte per month for cold storage. The average cost per GigaByte per drive has come down by half from 4c per GigaByte between 2013 and 2018.

301) Solid State Drives are considered as replacements for memory, L1 and L2 cache with added benefits. This is not really the case. It is a true storage even if it does wear out. Consequently, programming should be more mindful of the reads and writes to data and if they are random, store those data structures on the SSD drives.

302) The use of sequential data structures is very common in storage engineering. While some components go to great length to make their reads and writes access sequential, other components may simplify their design by storing on SSD.

303) Reads and writes are aligned on page size on Solid State Drives while erasures are on a block level. Consequently, data organized in data structures can leverage these criteria for reading some or all at once. If we are writing less than a page more frequently, we are not making good usage of the SSD. We can use buffering to aggregate writes.

304) The internal caching and readahead mechanism in SSD controller prefer long continuous reads and writes rather than simultaneous multiple reads and writes and performs them in one large chunk. This means we open up iterations and aggregate reads and writes to do them all together

305) Random writes perform just as well as sequential writes on SSD as long as the data are comparable. If the data size is small and the random writes are numerous, it may affect performance.