Cluster computing

Friday, July 6, 2018

Namespace, Buckets, Objects and their use with Querying

FileSystem does not lend itself to the same querying capabilities and performance as database tables do. Directories and files from a file-system are enumerated using iterations. Database tables have indexes allowing faster access than sequential scan from iterations. There is nothing that works quite like a database for efficient querying both historically and for the physics that the local data access is far more efficient than remote data access especially when it is organized at the finest granularity of the data and managed with metadata and query caching. We have realized cloud databases where the remote access does not matter to the service level agreement for the business transactions but we have yet to realize database like queries over an object store.

We are adding compute to storage. There is no limit to the possibilities once we take the virtualization that some object stores enable. Without the compute, the storage solution of such object stores satisfies immense and diverse requirements. With the compute and data processing capabilities offered out of the platform, the operations expand beyond create, update and delete to performing standard query operations that can support a dashboard of charts and graphs, participate in streaming queries or improve the metadata of the objects. For example, the usage statistics of the object may now be part of the metadata.

When we iterate namespaces, buckets and objects, we often have to rely on sequentially visiting each one of them. There is no centralized data structure that speeds them up nor are they organized in a sorted manner unlike the indexes. These S3 artifacts may be stored over a layer that might facilitate data structure that speed lookup. One such example is a B-plus tree – a data structure that relies on storing ranges by their keys. Another example may be skip lists – a data structure that relies on the links not only between adjacent occurring records but also skipping adjacencies usually by an exponent of two. Such techniques improve lookup because they resist from having to visit each element one after the other.
#codingexercise

Find count of common elements between two arrays before a mismatch

Int GetCountMoves(List<int> A, List<int> B)

{

Assert (A!= null);

Assert (B!=nul)

Assert (A.Count == B.Count);

A.sort();

B.sort();

Int result = 0;

For (int I = 0; I < A.Count; I++)

{

If (A[I] == B[I]) result++;

Else break;

}

Return result;

}

Thursday, July 5, 2018

Namespace, Buckets, Objects and their use with Querying

How then do we use an object storage as a data store and why do we need to enable it for querying?

Object Storage is immensely popular in the cloud just as filesystem was for stashing data. As a low level primitive it helped the evolution of data warehouses in the cloud. There are many uses of structured and unstructured data that makes its way to the object storage. However, aside from their availability over http, the data generally goes dark and opaque. Since there are no built-in capabilities of data processing of relational or NoSQL databases, it does not become part of higher business purpose software stack and remains as infrastructure.

Do Databases need to be built on top of object storage. That may be interesting concept but the data management techniques within the database namely locking, logging, indexes, catalog, caching, query plans etc are tightly coupled to the database and its central view of hierarchical metadata. Objects on the other hand carry metadata with themselves and while the absence of index at the object storage level may be compensated by the creation of new objects by a query processing layer that is stacked on top of the object storage, the query processing is inherently different from the binary search on a clustered index.

Will the querying capabilities on the object storage increase its adoption?

Conclusion: A specific data processing kit specific to object storage may be a great library to be included with an object storage solution.

#codingexercise

Find the number of ways we can place tiles if there are one or two tile units :

int GetCount(uint n)
{
if ( n == 0) return 0;
if (n == 1) return 1;
if (n == 2) return 2;
return GetCount(n-1)+GetCount(n-2);
}

Wednesday, July 4, 2018

We were discussing the storage as a network. In particular, I want to bring up a level of separation between storage and networking and show that by moving this separation further into one domain we get the possibilities of technologies that are vastly different than if it were pushed in the other domain. For example, Peer-to-Peer (P2P) networking provides a good base for large scale data sharing and application level multicasting. Some of the desirable features of P2P networks include selection of peers, redundant storage, efficient location, hierarchical namespaces, authentication as well as anonymity of users. In terms of performance, the P2P has desirable properties such as efficient routing, self-organizing, massively scalable and robust in deployments, fault tolerance, load balancing and explicit notions of locality. Perhaps the biggest takeaway is that the P2P is an overlay network with no restriction on size and there are two classes structured and unstructured. Structured P2P means that the network topology is tightly controlled and the content is placed on random peers and at specified location which will make subsequent queries more efficient. DHTs fall in this category where the location of the data objects is deterministic and the keys are unique. Napster was probably the first example to realize the distributed file sharing benefit with the assertion that requests for popular content does not need to be sent to a central server. P2P file sharing systems are self-scaling.
On the other hand, we have storage systems that propose a cluster-based file system, a universal S3 object store or a streaming store each with its own benefits. Essentially the users may choose to see these as storage or network first and depending on their purpose, a solution may be recommended.
To summarize, both storage and networking in their modern forms use some kind of distributed hashes, indexes logging and co-ordination services. However, this smartness over traditional block level storage and connected networks may not need to be replicated in both domains ideally. In fact, it may even belong to the networking layer rather than the storage. The real question is do we want to create smarter storage in preference to smarter networking and who gets to be on top in layering. Neither cluster with their network on top design nor a file system that spans clusters is a true all purpose solution to everyone.
Reference:
Comparing the performance of distributed hash tables under churn by Li, Stribling et al.
Comparision of peer-to-peer overlay network schemes Lua, Crowcroft et al.

Tuesday, July 3, 2018

We were discussing the storage as a network:

What used to be stored centrally is now being demanded to be stored in a distributed manner. Moreover, businesses are requiring virtualization to automate deployments, ease migrations and enable fast setup and tear-down over existing resources. Any attempt at meeting these requirements is also now expected to be elastic to growth and billing. While storage cost has been driving down, the cloud is eagerly absorbing on-premise assets into its networked storage.

In particular, I want to bring up a level of separation between storage and networking and show that by moving this separation further into one domain we get the possibilities of technologies that are vastly different than if it were pushed in the other domain. For example, Peer-to-Peer (P2P) networking provides a good base for large scale data sharing and application level multicasting. Some of the desirable features of P2P networks include selection of peers, redundant storage, efficient location, hierarchical namespaces, authentication as well as anonymity of users. In terms of performance, the P2P has desirable properties such as efficient routing, self-organizing, massively scalable and robust in deployments, fault tolerance, load balancing and explicit notions of locality. Perhaps the biggest takeaway is that the P2P is an overlay network with no restriction on size and there are two classes structured and unstructured. Structured P2P means that the network topology is tightly controlled and the content is placed on random peers and at specified location which will make subsequent queries more efficient. DHTs fall in this category where the location of the data objects is deterministic and the keys are unique. Napster was probably the first example to realize the distributed file sharing benefit with the assertion that requests for popular content does not need to be sent to a central server. P2P file sharing systems are self-scaling.

On the other hand, we have storage systems that propose a cluster-based file system, a universal S3 object store or a streaming store each with its own benefits. Essentially the users may choose to see these as storage or network first and depending on their purpose, a solution may be recommended.

#codingexercise

In a candy store there are N different types of candies each with its own price. We can buy a single candy from the store and get at most k other types of candies free. What is the minimum amount of money we need to spend to buy all N candies.
Solution: we sort the price. we purchase the low cost candies and for each we reduce k the high cost candies.
int GetMinimum(List<uint> prices, uint k)
{
uint res = 0;
uint n = prices.Count();
for (uint i = 0; i < n; i++)
{
res += prices[i];
n = n - k;
}
return res;

}

Monday, July 2, 2018

The Storage as the network or the network as the storage.

Digital Storage is a necessity for organizations that want to store content from business and operational need. The storage depends on the data and the needs around its processing. We referred to some of them in the article here. Since this has changed quite a bit, organizations generally have adopted a variety of forms of storage over the years resulting in different appliances, hardware, stacks, solutions, virtualized data centers and a host of technologies – new and old. In any organization, storage is a turf war with vendors competing for a slice of the pie.

Networking on the other hand has always been about connectivity bridging huge geographical distances or providing mobility. The rise of cloud computing is testament to the ubiquity of networking and the reliance of both compute and storage on networking. In addition, networking gives us the abilities to dictate terms and policies on the transport and securing data in transit. Tunneling is one such example where network packets get shipped with additional label that allows it to be routed through a public network and making it all the more convenient for setting up utmost security even over public fabric. In addition authentication and encryption factors can be governed for the transport.

Traditionally these have been two different concerns just like compute and storage. Networking more or less meant point to point communication for end users with relays in between and storage gave the assurance that the data that arrives over the network is not lost.

A lot of factors have changed in between: we added geographical regions for storage and datastores, replication, data guarantees that evolved from databases to eventual consistency and processing requirements changed from network access storage on one node to a large cluster of nodes. The domains of compute, network and storage have overlapped to form popular and notable technologies.

Sunday, July 1, 2018

#codingexercise
Problem: Find the minimum number of steps to reach the end where each element represents the max number of steps that can be made forward from that element.
Solution:
int GetMinJumps(List<int> A, int start, int end)
{
int min = Integer.MaxValue;

if (start==end) return 0;
if (A[start] == 0) return min;

// the next pick is to the right with the min cost
for (int i = start+1; i<= end; i++)
{
if (i > start + A[start]) continue;
int jumps = GetMinJumps(A, i, end);
if (jumps != Integer.MaxValue && jumps + 1 < min) {
min = jumps + 1;
}

return min;
}