Friday, May 17, 2019


Object storage has established itself as a “standard storage” in the enterprise and cloud. As it brings many of the storage best practice to provide durability, scalability, availability and low cost to its users, it can go beyond tier 2 storage to become nearline storage for vectorized execution. Web accessible storage has been important for vectorized execution. We suggest that some of the NoSQL stores can be overlaid on top of object storage and discuss an example with Column storage.  We focus on the use case of columns because they are not relational and find many applications that are similar to the use cases of object storage. They also tend to become large with significant read-write access. Object storage then transforms from being a storage layer participating in vectorized executions to one that actively builds metadata, maintains organizations, rebuilds indexes, and supporting web access for those don’t want to maintain local storage or want to leverage easy data transfers from a stash. Object storage utilize a  queue layer and a  cache layer to handle processing of data for pipelines. We presented the notion of fragmented data transfer with an earlier document. Here we suggest that Columns are similar to fragmented data transfer and how object storage can serve both as source and destination of Columns.
Column storage gained popularity because cells could be grouped in columns rather than rows. Read and writes are over columns enabling fast data access and aggregation. Their need for storage is not very different from applications requiring object storage. However as object storage makes inwards into vectorized execution, the data transfers become increasingly fragmented and continuous. At this junction it is important to facilitate data transfer between objects and Column
File-systems have long been the destination to store artifacts on disk and while file-system has evolved to stretch over clusters and not just remote servers, it remains inadequate as a blob storage. Data writers have to self-organize and interpret their files while frequently relying on the metadata stored separate from the files.  Files also tend to become binaries with proprietary interpretations. Files can only be bundled in an archive and there is no object-oriented design over data. If the storage were to support organizational units in terms of objects without requiring hierarchical declarations and supporting is-a or has-a relationships, it tends to become more usable than files.
Since Column storage overlays on Tier 2 storage on top of blocks, files and blobs, it is already transferring data to object storage. However, the reverse is not that frequent although objects in a storage class can continue to be serialized to Column in a continuous manner. It is also symbiotic to audience on both storage.
As compute, network and storage are overlapping to expand the possibilities in each frontier at cloud scale, message passing has become a ubiquitous functionality. While libraries like protocol buffers and solutions like RabbitMQ are becoming popular, Flows and their queues can be given native support in unstructured storage. With the data finding universal storage in object storage, we can now focus on making nodes as objects and edges as keys so that the web accessible node can participate in Column processing and keep it exclusively compute-oriented.

No comments:

Post a Comment