Cluster computing

Sunday, November 4, 2018

Data Virtualization over Object Storage:

Introduction: 

Object Storage is routinely put to use for backup and archival. The content in an Object Storage is also suitable for queries. Since the data is unstructured and is available in the form of copies, we merely have to locate the data. The data virtualization attempts to do just that. 

Description: 

The destination of queries is usually a single data source but query execution like any other application retrieves data without requiring technical details about the data such as where it is located. Location usually depends on the organization that has a business justification for growing and maintaining data. Not everybody likes to dip right into the data lake right in the beginning of a venture as most organizations do. They have to grapple with their technical need and changing business priorities before the data and its management solution can be called stable.  

Object Storage unlike databases allows for incredible almost limitless storage with sufficient replication groups to cater to organizations that need their own copy. In addition, the namespace-bucket-object hierarchy allows the level of separation as organizations need it.  

The role of Object Storage in data virtualization is only on the physical storage level where we determine which site/zone to get the data from.  A logical data virtualization layer that knows which namespace or bucket to go to within an Object Storage does not come straight out of the Object Storage. A gateway would server that purpose. The queries can then choose to run against a virtualized view of the data by querying the gateway which in turn would fetch the data from the corresponding location.   

There are many levels of abstraction. First, the destination data source may be within an Object Storage. Second the destination data source may be from different Object Storage. Third the destination may be from different storages such as an Object Storage and cluster file system. In all these cases, the virtualization logic resides external to the storage and can be written in different forms. Finally, the logic within the virtualization can be customized so that queries can make the most of it. We refer this article to describe the usages of data virtualization while here we discuss the role of Object Storage.  

Data Virtualization does not need to be a thin layer as a mash-up of different Object Storage. Even if it was just a gateway, it could have allowed customizations from users or administrators for workloads. However, it doesn’t stop there as described in this write-up. This layer can be intelligent to interpret data types to storage. Typically queries specify the data location either as part of the query with fully resolved data types or as part of the connection in which the queries are made. In a stateless request-response mode, this can be part of the request. The layer resolving the entire request can determine the storage location. Therefore, the data virtualization can be an intelligent router.  

Notice that both user as well as system can determine the policies for the data virtualization. If the data type is only one then the location is known If the same data type is available in different locations, then the determination can be made by user with the help of rules and configurations. If the user has not specified, then the system can choose to determine the location for the data type. The rules are not necessarily maintained by the system in the latter case. It can simply choose the first available store that has the data type.  

Many systems use the technique of reflection to find out whether a module has a particular data type. An Object Storage does not have any such mechanism. However, there is nothing preventing the data virtualization layer to maintain a registry in the Object Storage or the data type defining modules themselves so that those modules can be loaded and introspected to determine if there is such a data type. This technique is common in many run-time. Therefore, the intelligence to add to the data virtualization layer can draw inspiration from these well-known examples. 

Architecture:

The above Illustration is for utilizing query services via languages specific to the store such as for document databases, graph databases, stream storage and object storage 

The query languages differ from store to store. For example, the streaming queries generally follow a five-step procedure:

1) define events in terms of payload as the data values of the event and the shape as the lifetime of the event along the time axis

2) define the input streams of the event as a function of the event payload and shape. For example, this could be a simple enumerable over some time interval

3) Based on the events definitions and the input stream, determine the output stream and express it as a query. In a way this describes a flow chart for the query

4) Bind the query to a consumer. This could be to a console. For example

Var query = from win in inputStream.TumblingWindow( TimeSpan.FromMinutes(3)) select win.Count();

5) Run the query and evaluate it based on time.

On the other hand, simple aggregations on BigData involve Map-Reduce algorithms. But these can be expressed

It is probably most succinct in SQL where a windowing function can be written as

SELECT COUNT(*)

OVER ( PARTITION BY hash(u.timestamp DIV (60*60*24)) partitions 3 ) u1

FROM graphupdate u;

The above queries merely highlight the different rigor needed for destination stores that differ by type.

However, the illustration also suggests that these query services are best kept closer to the data store rather than combining them in the virtualization layer unless the stores are all the same types. As a special example of a multi-level virtualization, a single query service above may retrieve data from more than one data storage.

Conclusion:

Data Virtualization as a technique is popular to databases. However, it is equally applicable to Object Storage.  

Saturday, November 3, 2018

We were discussing the use of Object Storage to stash state for each worker from applications services and clusters and its implementation in the form of distributed services over object storage.
A lease is supposed to help track resource usage. If the callers have to send the requests to the same single point of contention, it is not very useful. When requests are served from separate leaseing service, the performance generally improves when there ar e load balancers. That is why different proxy servers behind a lease could maintain their own leaseing service.  Shared services may offer at par service level agreement as an individual service for requests. Since a lease will not see performance degradation when sending to a proxy server or a shared dedicated lease, it works in both these cases. Replacing a shared dedicated leaseing service with individual dedicated leaseing service is entirely the call of the consumers.  

The policies for a lease need not be mere regex translation of worker parameters to object hierarchy. We are dealing with objects on a worker by worker basis where we can use all parts of the hierarchical namespace – bucket – object names to differentiate the resource usages for workers. For that matter names may be translated so that the caller may only need a lease id to enumerate all storage resources within the object storage. We are not just putting the lease on steroid, we are also making it smart by allowing more interpretations and actions on the resource usages. These rules can be authored in the form of expressions and statements much like a program with lots of if then conditions ordered by their execution sequence. The leases facilitate a lookup of objects without sacrificing performance and without restrictions to the organization of objects within or distributed stores because it enumerates them within the lease.

Friday, November 2, 2018

We were discussing the use of Object Storage to stash state for each worker from applications services and clusters and its implementation in the form of distributed services over object storage. The nodes in a storage pool assigned to the VDC may have a fully qualified name and public IP address. Although these names and ip address are not shared with anyone, they serve to represent the physical location of the fragments of an object. Generally, an object is written across three such nodes. The storage engine gets a request to write an object. It writes the object to one chunk but the chunk may be physically located on three separate nodes. The writes to these three nodes may even happen in parallel.  The object location index of one chunk and the disk locations corresponding to the chunk are also artifacts that need to be written. For this purpose, also, three separate nodes may be chosen and the location information may be written.   The storage engine records the disk locations of the chunk in a chunk location index and the disk locations corresponding to the chunk to three different disks/nodes. The index locations are chosen independently from the object chunk locations.  Therefore, we already have a mechanism to store locations. When these locations have representations for the node and the site, a copy of an object served over the web has a physical internal location. Even when they are geo-replicated, the object and the location information will be updated together.  The tracking of site-specific locations for an object is a matter of merely maintaining a registry of locations just the same way as we look up the chunks for an object. We just need more information on the location part of the object and the replication group automatically takes care of keeping locations and objects updated as they are copied.   

Thursday, November 1, 2018

We were discussing the use of Object Storage to stash state for each worker from applications services and clusters. We referred to the use of leases in the form of tickets. The tickets need not be issued from a singleton service Tickets can be distributed just as the objects tracked by the tickets can belong to different object storage. We discuss these alternate forms of issuing tickets:
The ticketing service can be chained across object storage. If the current object storage does meet the need for a worker pool, it is possible to pass on the ticketing to another similar stack. The ticketing layer merely needs to forward the requests that it cannot answer to a default pre-registered outbound destination. In a distributed Ticketing service, the ticketing service handlers can make sense out of the requests simply with the help of the object-storage-namespace-bucket-object hierarchy and say if a request can be handled it or forwarded. If it does not, it simply forwards it to another ticketing service cum object storage stack. This is somewhat different from the original notion that the ticketing service is bound to an object storage where the resources for the workers are maintained. The linked ticketing service does not even need to take time to resolve object location to see if it exists. It can merely translate the hierarchical naming to know if the resources belong to it or not. This shallow lookup means a request can be forwarded faster to another linked object storage and ultimately to where it may be guaranteed to be found. The Linked Storage has no criteria for the object store to be similar and as long as the forwarding logic is enabled, any implementation can exist in each of the storage for translation, lookup and return. another way to distribute tickets is with hashes where the destination is determined based on a hash table. Whether we use routing tables or a static hash table, the networking over the object storage can be its own layer facilitating request resolution at different object storage and ticketing later stack.