Cluster computing

Thursday, July 5, 2018

Namespace, Buckets, Objects and their use with Querying

FileSystem does not lend itself to the same querying capabilities and performance as database tables do. Directories and files from a file-system are enumerated using iterations. Database tables have indexes allowing faster access than sequential scan from iterations. There is nothing that works quite like a database for efficient querying both historically and for the physics that the local data access is far more efficient than remote data access especially when it is organized at the finest granularity of the data and managed with metadata and query caching. We have realized cloud databases where the remote access does not matter to the service level agreement for the business transactions but we have yet to realize database like queries over an object store.

How then do we use an object storage as a data store and why do we need to enable it for querying?

Object Storage is immensely popular in the cloud just as filesystem was for stashing data. As a low level primitive it helped the evolution of data warehouses in the cloud. There are many uses of structured and unstructured data that makes its way to the object storage. However, aside from their availability over http, the data generally goes dark and opaque. Since there are no built-in capabilities of data processing of relational or NoSQL databases, it does not become part of higher business purpose software stack and remains as infrastructure.

Do Databases need to be built on top of object storage. That may be interesting concept but the data management techniques within the database namely locking, logging, indexes, catalog, caching, query plans etc are tightly coupled to the database and its central view of hierarchical metadata. Objects on the other hand carry metadata with themselves and while the absence of index at the object storage level may be compensated by the creation of new objects by a query processing layer that is stacked on top of the object storage, the query processing is inherently different from the binary search on a clustered index.

Will the querying capabilities on the object storage increase its adoption?

We are adding compute to storage. There is no limit to the possibilities once we take the virtualization that some object stores enable. Without the compute, the storage solution of such object stores satisfies immense and diverse requirements. With the compute and data processing capabilities offered out of the platform, the operations expand beyond create, update and delete to performing standard query operations that can support a dashboard of charts and graphs, participate in streaming queries or improve the metadata of the objects. For example, the usage statistics of the object may now be part of the metadata.

Conclusion: A specific data processing kit specific to object storage may be a great library to be included with an object storage solution.

#codingexercise

Find the number of ways we can place tiles if there are one or two tile units :

int GetCount(uint n)
{
if ( n == 0) return 0;
if (n == 1) return 1;
if (n == 2) return 2;
return GetCount(n-1)+GetCount(n-2);
}

Cluster computing

Thursday, July 5, 2018

No comments:

Post a Comment