Cluster computing

Sunday, November 11, 2018

We continue discussing the best practice from storage engineering:
31) Acceleration- although network acceleration with the help of direct tcp connections is essentially a networking tier technique it is equally applicable in the storage tier when the tier spans over geographically distributed regions
32) Datacenters and data stores: the choice of locations for datacenters and Data stores plays heavily into the consolidation of storage technologies. When virtual machines are spun up on organizational assets, they are often done in private datacenters. Many web services use the datastore for their storage especially if they have no need for local storage. Therefore, storage offerings have to be mindful often the presence of large datastores.
33) Distributed hash table. In order to scale horizontally over commodity compute, storage tier use a distributed hash table to assign and delegate resources and tasks. This facilitates a large Peer to Peer network that works well with large scale processing including high volume workload.
34) Cluster This is another form of deployment as opposed to a single server deployment of storage servers. The advantages of using a cluster include horizontal scalability, fault tolerance and high availability. Cluster technology is now common practice and is widely adopted for any server deployment.
35) Container Docker containers are also immensely popular for deployment and every server benefit from portability because it makes the server resilient from the issues faced by the host.
36) Congestion control: when requests are queued to the storage server, it can use the same sliding window technique that is used for congestion control in a tcp connection. Fundamentally there is no difference between giving each request a serial number in either case and handling them with the boundaries of the received and the processed at a rate that can keep the bounds manageable.

Saturday, November 10, 2018

We continue discussing the best practice from storage engineering:

26) Containers – Data is organized as per the units of organization from the storage device or appliance. These containers however do not necessarily remain the same size because a user dictates what is packed in any container. Therefore, when it comes to data transfer, we can transfer a large container at a time or smaller. Also, users often have to specify attributes of the container and sometimes it could go wrong. Instead of correcting a container beyond salvage, it might be easier to recreate another and transfer the data.

27) Geographical location – Administrators often determine the sites where their data needs to be replicated. This involves choosing the locations which will have least latency to the users. This choice of sites may be common across data organizations and their owners and customized where the choices are inadequate.

28) Backup – although data backup has been cited earlier as a maintenance item, it is in fact the prudence on the part of the owner or administrator to determine which data needs to be backed up. Tools like duplicity use rsync protocol to determine incremental changes and storage products may have a way to do it or allow it to be externalized.

29) Aging – Generally the older the data, the more amenable it is for backup. The data age is progressive on the timeline. Therefore, it is easier to label the data as hot warm and cold so that the cut-off for age related treatments may then be taken. Cost savings on cheaper storage was touted as the primary motivation earlier but this has recently been challenged. That said, aged data lends itself to treatments such as deduplication.

30) Compression - Probably the hallmark of any efficient storage is in the packing of the data. Most data files and directories can be archived. For example, a tar ball is a convenient way to make web sites and installable portable. When the data is viewed in the form of binaries, a long sequence of either 0 or 1 can be efficiently packed. When the binary sequence flips way too often, it becomes efficient to not encode It and leave it as such. That said, there are many efficient compression techniques available.

Friday, November 9, 2018

We continue discussing the best practice from storage engineering:
21) Maintenance – Every storage offering comes with a responsibility for administrators. Some excel at reducing this maintenance with the help of auto-tuning and automation of maintenance chores while others present comprehensive dashboards and charts for detailed, interactive and involved maintenance. The managed service that moved technologies and stacks from on-premise to cloud came with the reduction in Total Cost of Ownership by way of centralizing and automating tasks that provided scalability, high availability, backups, software updates and patches, host and server maintenance, rack and stack, power and network redress etc.
22) Data transfer – The performance considerations of IO devices includes throughput and latency in one form or another. Any storage offering may be robust and large but will remain inadequate if the data transfer speed is low. In addition, data transfer may need to be across large geographical distances and repeatedly so. Facilitating of dedicated network connection may not be feasible in all cases so the baseline must itself be reasonable.
23) Gateway- Traditionally gateways have been used to bridge across different storage providers or between on-premise and cloud or even two similar but different origin storage stacks. Gateways also help with load balancing, routing and proxy duties. Some storage providers are savvy to include this technology within their offering so that they are not used everywhere.
24) Cache – A cache enables to requests to be handled by providing the resource without looking it up in deeper layers. The technology can span across storage or offered at many levels deep in the stack. Cache not only improves performance but they also save costs.
25) Checksum – This is a simple way to check data integrity and it suffices in place where encryption may not be easy especially when keys required to encrypt and decrypt cannot be secured. This simple technique is no match for the advantages from encryption but it is often put to use in low level message transfers and data at rest.

Thursday, November 8, 2018

We continue listing the best practice from storage engineering:
15) Management – Storage is very much a resource. It can be created, update and deleted. With software defined technologies, the resource only takes a gigantic form otherwise it is the equivalent of a single data record for the user. Every such resource has also significant metadata. Consequently we use manage storage just the same way as we manage resources.
16) Monitoring – Virtual large storage may be stretched across disks in one form or the other. And the physical resources such as disks often have failures and run out of space. Therefore monitoring becomes a crucial aspect.
17) Replication groups – Most storage organization have to deal with copies of the data. This is generally handled with replication. There is no such limit to the copies maintained but if it spans across root of storage organization, a replication group is created where these different sites are automatically sync’ed.
18) Storage organization – We referred to hierarchical organization earlier that allows maximum flexibility to the user in terms of folders and depth. Here the organization includes replication groups if any as well as ability to maintain simultaneous organization such as when the storage is file system enabled.
19) Background tasks – routine and periodic tasks can be delegated to the background workers instead of executing them in line with data in and out. These can be added to a background task scheduler that invokes them as specified. Some of the metadata for the storage entities is improved with journaling and other such background operations.
20) Relays – Most interactions between components is in the form of requests and responses. These may have traverse through multiple layers before they are authoritatively handled by the node and partition. Relays help translate requests and responses between layers. They are necessary for making the request processing logic modular and chained.

#codingexercise
to determine if an integer binary tree is binary search tree, we can simply check if the root and leaves are within int_min and int_max and then reverse for every child.
bool IsBstHelper(node root, int min, int max)
{
if (root==null) return true;
if (root.data < min || root.data> max) return false;
return IsBstHelper(root.left, min, root.data-1) &&
IsBstHelper(root.right, root.data+1, max);
}

Wednesday, November 7, 2018

We continue listing the best practice from storage engineering:

Data flow – Data flows into stores and stores grow by size. Businesses and applications that generate data often find the data to be sticky once it accumulates. Consequently a lot of attention is paid to early estimation of size and the kind of treatment to take.

Distributed activity – File systems and object storage have to take advantage of horizontal scalability with the help of clusters and nodes. Consequently, the use of distributed processing such as with Paxos algorithm becomes useful to take advantage of this strategy. Partitioning becomes useful in isolating activities.

Protocols – Nothing facilitates communication between peers or master-slave as a protocol. Even a description of the payload and generic operations of create, update, list and delete become sufficient to handle storage relevant operations at all levels.

Layering – Finally storage solutions have taught us that appliances can be stacked, services can be hierarchical and data may be tiered. Problem solved in one domain with a particular solution may be equally applicable to similar problem in different domain. This means that we can use layers for the overall solution

Virtualization – Cloud computing has taught us the benefit of virtualization at all levels where different entities may be spanned with a universal access pattern. Storage is no exception and every storage product tends to take advantage of this strategy.

Security and compliance – Every regulatory agency around the globe look for some kind of certification. Most storage providers have to demonstrate compliance with one or more of the following: PCI-DSS, HIPAA/HITECH, FedRAMP, EU Data Protection Directive, FISMA and such others. Security is provided with the help of identity and access management and they come in useful to secure individual storage artifacts

Tuesday, November 6, 2018

We continue listing the best practice from storage engineering.
5) Seal your data – Append only format of writing data is preferred for forms of data such as events where events appear in a continuous stream. When we seal the data, we make all activities progressive on the timeline without loss of fidelity over time. If there are failures, seal the data. When data does not change, we can perform calculations that help us repair and recover.
6) Versions and policy – As with most libraries, append only data facilitates versioning and versions can be managed with policies. Data may be static but policies can be dynamic. When the storage is viewed as a library, users go back in time and track revisions.
7) Reduplication - As data ages, there is very little need to access it regularly. It can be packed and saved in a format that reduces spaces. When the data is no longer used by an application or a user, it can be viewed as segments that are delineations which facilitate study of redundancy in data. Then redundant segments may simply be avoided from storing which allows a more manageable form of accumulated raw data.
8) Encryption – Encryption is probably the only technique to truly protect a data when there can be unwanted or undesirable access. The scope of encryption may be limited to sensitive data if the raw data can be tolerated as not encrypted.

Monday, November 5, 2018

Best practice from storage engineering:
Introduction: Storage is one of the three pillars of any commercial software. Together these three concepts of compute, networking and storage, are included directly as products to implement solutions, as components to make products, as perspectives for implementation details of a feature within a product and so on. Every algorithm that is implemented pays attention to these three perspectives in order to be efficient and correct. We cannot think of distributed or parallel algorithms without network, efficiency without storage, and convergence without compute. Therefore these disciplines bring certain best practice from the industry.

We list a few in this article from storage engineering perspective:
1) Not a singleton – Most storage vendors know that that data is precious. It cannot be lost or corrupted. Therefore storage industry vendors go to great lengths in making data safe at rest by not allowing a single point of failure such as a disk crash. If the data is written to a store, it is made available with copies or archived as backup.
2) Protection against loss – Data when stored may get corrupted. In order to make sure the data does not change, we need to keep additional information. This is called erasure coding and with additional information about the data, we can not only validate the existing data, we may even be able to recreate the original data by tolerating certain loss. How we store the data and the erasure code, also determines the level of redundancy we can use.
3) Hot warm cold – Data differs in treatment based on the access. Hot data is one that is actively read and written. Warm and cold indicate progressive inactivity over the data. Each of these labels allows different leeway with the treatment to the data and the cost of storage.
4) Organizational unit of data – Data is often written in one of several units of organization depending on the producer. For example, we may have blobs, files and block level storage. These do not need to be handled the same way and each organizational unit even comes with its own software stack to facilitate the storage.

#codingexercise

// predicate to select positive integer sequence from enumerated combinations

List <List <Integer>> result = new ArrayList <>(Collection2.filter (combinations, new Predicate (List <Integer>() {

@Override

public boolean apply (List <Integer> sequence) {

return isPositive (sequence);

}

});