Sunday, November 4, 2018

Data Virtualization over Object Storage:

Introduction:  
Object Storage is routinely put to use for backup and archival. The content in an Object Storage is also suitable for queries. Since the data is unstructured and is available in the form of copies, we merely have to locate the data. The data virtualization attempts to do just that. 

Description: 
The destination of queries is usually a single data source but query execution like any other application retrieves data without requiring technical details about the data such as where it is located. Location usually depends on the organization that has a business justification for growing and maintaining data. Not everybody likes to dip right into the data lake right in the beginning of a venture as most organizations do. They have to grapple with their technical need and changing business priorities before the data and its management solution can be called stable.  
Object Storage unlike databases allows for incredible almost limitless storage with sufficient replication groups to cater to organizations that need their own copy. In addition, the namespace-bucket-object hierarchy allows the level of separation as organizations need it.  
The role of Object Storage in data virtualization is only on the physical storage level where we determine which site/zone to get the data from.  A logical data virtualization layer that knows which namespace or bucket to go to within an Object Storage does not come straight out of the Object Storage. A gateway would server that purpose. The queries can then choose to run against a virtualized view of the data by querying the gateway which in turn would fetch the data from the corresponding location.   
There are many levels of abstraction. First, the destination data source may be within an Object Storage. Second the destination data source may be from different Object Storage. Third the destination may be from different storages such as an Object Storage and cluster file system. In all these cases, the virtualization logic resides external to the storage and can be written in different forms. Finally, the logic within the virtualization can be customized so that queries can make the most of it. We refer this article to describe the usages of data virtualization while here we discuss the role of Object Storage.  
Data Virtualization does not need to be a thin layer as a mash-up of different Object Storage. Even if it was just a gateway, it could have allowed customizations from users or administrators for workloads. However, it doesn’t stop there as described in this write-up. This layer can be intelligent to interpret data types to storage. Typically queries specify the data location either as part of the query with fully resolved data types or as part of the connection in which the queries are made. In a stateless request-response mode, this can be part of the request. The layer resolving the entire request can determine the storage location. Therefore, the data virtualization can be an intelligent router.  
Notice that both user as well as system can determine the policies for the data virtualization. If the data type is only one then the location is known If the same data type is available in different locations, then the determination can be made by user with the help of rules and configurations. If the user has not specified, then the system can choose to determine the location for the data type. The rules are not necessarily maintained by the system in the latter case. It can simply choose the first available store that has the data type.  
Many systems use the technique of reflection to find out whether a module has a particular data type. An Object Storage does not have any such mechanism. However, there is nothing preventing the data virtualization layer to maintain a registry in the Object Storage or the data type defining modules themselves so that those modules can be loaded and introspected to determine if there is such a data type. This technique is common in many run-time. Therefore, the intelligence to add to the data virtualization layer can draw inspiration from these well-known examples. 
Architecture: 
 

The above Illustration is for utilizing query services via languages specific to the store such as for document databases, graph databases, stream storage and object storage 
The query languages differ from store to store. For example, the streaming queries generally follow a five-step procedure:  
1)     define events in terms of payload as the data values of the event and the shape as the lifetime of the event along the time axis  
2)     define the input streams of the event as a function of the event payload and shape. For example, this could be a simple enumerable over some time interval  
3)     Based on the events definitions and the input stream, determine the output stream and express it as a query. In a way this describes a flow chart for the query  
4)     Bind the query to a consumer. This could be to a console. For example  
        Var query = from win in inputStream.TumblingWindowTimeSpan.FromMinutes(3)) select win.Count();  
5)     Run the query and evaluate it based on time. 
On the other hand, simple aggregations on BigData involve Map-Reduce algorithms. But these can be expressed  

It is probably most succinct in SQL where a windowing function can be written as  
SELECT COUNT(*)   
OVER ( PARTITION BY hash(u.timestamp DIV (60*60*24)) partitions 3 ) u1  
FROM graphupdate u; 

The above queries merely highlight the different rigor needed for destination stores that differ by type. 
However, the illustration also suggests that these query services are best kept closer to the data store rather than combining them in the virtualization layer unless the stores are all the same types. As a special example of a multi-level virtualization, a single query service above may retrieve data from more than one data storage. 

Conclusion 
Data Virtualization as a technique is popular to databases. However, it is equally applicable to Object Storage.  

No comments:

Post a Comment