Cluster computing

Wednesday, October 31, 2018

We were discussing the use of object storage to stash state across workers from applications and cluster nodes on a lease basis in the previous post
This section is not exclusive to Ticketing layer and is mentioned here only for convenience. We suggested earlier that tickets may be issued with any existing planning and tracking software in the ticket layer over object storage. There is no dearth of choices for such software in the market. Almost any tracking software that can issue tickets will suffice. This allows for easy integration with existing software so that applications may leverage their favorite API to create, update and resolve tickets.
In addition, each ticket may have rich metadata that can be annotated by parties other than the workers who create, update and resolve tickets. Such metadata may carry any number of fields to improve the tracking information associated with the objects in the object storage. They could also assist with the update of metadata to objects in the object storage. Such metadata brings many perspectives other than that of the initiator to the users
As long as the tickets are open, they can also maintain links with IDs of resources in systems other than the ticket issuing software. This allows for integration of ticket layer with several other production software that helps navigation from a ticket representing the lease to information in other systems that may have also have additional state for the workers that opened the tickets. This gives a holistic view of all things relevant to the workers merely from their read-write data path
The schema for arranging integrations this way with existing tickets is similar to a snowflake pattern. Each link represents a dimension and therefore the ticket is a representation of all information and not just itself. This pattern also facilitates the independence of the ticket from another system that can go down. The links may only be used if they are available to be reached. Since the ticket itself holds enough information locally about the resource tracking, any external links are nice to have and not mandatory.

This kind of extensibility allows ticket layer to grow without disrupting the path above or below. Since the workers remain unaffected, there is no limit to the extensibility in this layer.

Tuesday, October 30, 2018

We were discussing the use of object storage to stash state across workers from applications and cluster nodes on a lease basis in the previous post. The object storage acts as an aggregator across workers in this design. It brings elasticity from the virtualized storage customized to individual workers. There is no need for capacity planning for any organization or the payment for file-system, block or other forms of storage as the object storage not only represents the unification of such storage but also supports billing with the availability of detailed study on worker usages. The performance trade-off for workers is small and they are required to change their usages of conventional storage with preferred S3 access. They become facilitators for moving compute layer beyond virtual machines and disks to separate compute and storage platforms the likes of which can generate a new wave of commodity hardware suitable for both on-premise and datacenters.

The ticketing and leasing system proposed in this post need not just be applications and clusters. It can also be other local platform providers that want to help ease the migration of workloads from local and unmanaged storage.

The use of a leasing service facilitates object lifecycle management as much as it can be done individually by workers. It can be offloaded to workers but the notion is that the maintenance activities move from organization owned to self-managed by this solution. Tickets may be implemented with any software as long as they map workers to resources in the object storage. 
Moreover, not all the requests need to reach the object storage. In some cases, web ticket may use temporary storage from hybrid choices. The benefits of using a web ticket including saving bandwidth, reducing server load, and improving request-response time. If a dedicated content store is required, typically the ticketing and server are encapsulated into a content server. This is quite the opposite paradigm of using object storage and replicated objects to directly serve the content from the store. The distinction here is that there are two layers of functions - The first layer is the Ticket layer that solves the life cycle management of storage resources. The second layer is the storage concerns of the actual storage which we mitigate with the help of object storage. We will call this the storage engine and will get to it shortly.

The Ticket would be only the ID that the worker needs to track all of its storage resources.   

Monday, October 29, 2018

Ticket Service works closely with issue management software and traditionally both have been long standing products in the marketplace. Atlassian Jira is a planning and tracking software that generates tickets for managing and tracking work-items. This ticket layer is well-positioned for using such existing software directly. This document does not repeat the internals of tracking management software and instead focuses on its use within object storage so that newer workloads can use a storage platform rather than their disks and local storage.

The ticketing and leasing system proposed in this document need not just be applications and clusters. It can also be other local platform providers that want to help ease the migration of workloads from local and unmanaged storage.

Sunday, October 28, 2018

Object Storage is perceived as backup and tertiary storage. However, in the next few posts we argue that object storage is a persistent thread local storage for all workers in any system so that there is never any loss of state when a worker disappears. While files have traditionally been the storage of choice for workers, we argue that this is just a matter of access and that file protocols or http access serve just the same to file-system enabled object storage. Moreover, not all data needs to be written deep into the object storage at once. With the help of a cache layer discussed earlier, we can even allow workers higher performance than working with remote storage. The requirements for object storage need not even change while the reads and writes from the workers can be handled

Many applications maintain concurrent activities. Large scale query processing on big data requires intermediate storage for their computations. Similarly cluster computing also requires storage for the processing of their nodes. Traditionally, these nodes have been sharing volumes or maintaining local file storage. However, most disk operations are already written off as expensive as computing moves more into memory. A large number of workers are generally commodity workers. They don’t need to maintain high performance that object storage cannot step in and fill. Moreover each worker or node gets it own set of objects in the object storage which can be considered as shared-nothing as its own disks. All they need is their storage to be in the cloud and managed so that they never have to be limited by their disks.

That is how this Ticket Layer positions itself. It offers a tracking and leasing of universal storage that do away with disks for nodes in a cluster and workers in an application. Several forms of data access can be supported in addition to file-system protocols and S3 access in an object storage. The ticketing system is a planning and tracking software management system that generates tickets and leases for remote storage on a workload by workload basis so that the clouds’ elasticity can be brought to individual workers even within the same node or application.

Saturday, October 27, 2018

We were discussing Query over object storage. We now discuss query rewrites. A query describing the selection of entries with the help of predicates does not necessarily have to be bound to structured or unstructured query languages. Yet the convenience and universal appeal of one language may dominate another. Therefore, in such cases whether the query language is agnostic or predominantly biased, it can be modified or rewritten to suit the needs of the storage stacks described earlier. This implies that not only the data types but also the query commands may be translated to suit the delegations to the storage stacks. If the commands can be honored, then they will return results and since the resolver in the virtualization layer may check all of the registered storage stacks, we will eventually get the result if the data matches.
Delegation doesn’t have to be the only criteria for the virtualization layer. Both the administrator and the system may maintain rules and configurations with which to locate the store for the data. More importantly the rules can be both static and dynamic. The former refers to rules that are declared ahead of the launch of the service and the service merely loads it in. The latter refers to the evaluations that dynamically assign queries to store based on classifiers and connection attributes. There is no limit to the attributes with which the queries are assigned to stores and the evaluation is done with a logic in a module that can be loaded and executed.
Query assignment logic may be directed towards the store that has the relevant data. This is not the only purpose of the logic. Since there can be many stores with the same data for availability, the dynamic assignment can also function as a load balancer. In some cases, we can go beyond load balancing to have resource pools that have been earmarked with different levels of resources and queries may be assigned to these pools. Usually resource pools refer to compute and certain queries may require more resources for computation than others even when the data is made available. Moreover, it is not just resource intensive queries, it is also priority of queries as some may have priority than others. Therefore, the logic for dynamic assignment of queries can have more than one purpose.

Friday, October 26, 2018

The notion that there can be hybrid storage for data virtualization does not jive with the positioning of object storage as a limitless single storage. However, customers choose on-premise or cloud object storage for different reasons. There is no question that some data is more suited to be in the cloud while others can remain on-premise. Even if the data were to be within the same object storage but in different zones, they would have different address and the resolver will need to resolve to one of them. Therefore, whether the resolution is between zones, between storage or between storage types, the resolver has to deliberate over the choices.
The nature of the query language determines the kind of resolving that the data virtualization needs to do. In addition, the type of storage that the virtualization layer spans also depends on the query language. LogParser was able to enumerate unstructured data but it is definitely not mainstream for unstructured data. Commands and scripts have been used for unstructured data in most places and they have worked well. Solutions like Saltstack enabled the same command and scripts to run over files in different storages even if they were on different operating systems. SaltStack used ZeroMQ for messaging and the commands would be executed by agents on those computers. The master on the SaltStack resolved the minions on which the commands and scripts would be run. This is no different from data virtualization layer that needs to resolve the types to different storages.
In order to explain the difference between data virtualization over structured and unstructured storage types, we look at metadata in structure storage. All data types used are registered. Whether they are system builtin types or user defined types, the catalog helps with the resolution. The notion that json documents don’t have a rigid type and that the document model can be extensible is not lost to the resolver. It merely asks for a inverted list of documents corresponding to a template that can be looked up. Furthermore, the logic to resolve json fields to documents does not necessarily have to be within the resolver. The inverted list and the lookup can be provided by the index layer while the resolver merely delegates it. Therefore, the resolver can span different storage types as long as the resolution is delegated to the storage types. The query language for structure and unstructured may or may not be unified with a universal query language but that is beside the consideration that we can different storage types as different stacks and yet have the data virtualization over them.
Query rewrites has not been covered in the topic above. A query describing the selection of entries with the help of predicates does not necessarily have to be bound to structured or unstructured query languages. Yet the convenience and universal appeal of one language may dominate another. Therefore, in such cases whether the query language is agnostic or predominantly biased, it can be modified or rewritten to suit the needs of the storage stacks described earlier. This implies that not only the data types but also the query commands may be translated to suit the delegations to the storage stacks. If the commands can be honored, then they will return results and since the resolver in the virtualization layer may check all of the registered storage stacks, we will eventually get the result if the data matches.

Thursday, October 25, 2018

Many systems use the technique of reflection to find out whether a module has a particular data type. An object storage does not have any such mechanism. However, there is nothing preventing the data virtualization layer to maintain a registry in the object storage or the data type defining modules themselves so that those modules can be loaded and introspected to determine if there is such a data type. This technique is common in many run-time. Therefore, the intelligence to add to the data virtualization layer can draw inspiration from these well-known examples
Most of the implementations for a layer is usually in the form of a micro-service. This helps modular design as well as testability. Moreover, the microservice provides http access for other services. This has been proven to be helpful from the point of view of production support. The services that depend on the data virtualization to be running can also directly go to the object storage via http. Therefore, the data virtualization layer merely acts like a proxy. The proxy can be transparent, thin or fat. It can even play the man in the middle and therefore it requires very little change from the caller.
The implementation of a service for data virtualization is not uncommon – both on–premise and in the cloud. In general, most applications benefit from being modular. Service oriented architecture and micro-services play important roles in making applications modular. This is in fact the embrace of cloud technologies as the modules are hosted on their own virtual machines and containers. With the move towards cloud, we get the added benefits of elasticity and billing on a pay per use model that we otherwise do not have. The benefits of managed instances, hosts and containers is never lost on the application modularity. Therefore, data virtualization can also be implemented as a service that can span both on-premise as well as cloud storage.
The data virtualization service is also an opt-in because the consumers can go directly to the storage. For example, the callers will either use the proxy or they won’t so long as they take care of resolving their queries with fully qualified namespaces and data types. It is merely a convenience and very much suited for encapsulating in a module. The testability of the data virtualization also improves significantly.
Unlike other services, data virtualization is exclusively dependent on the nature of data retrieval from the query layer. While data transformation services such as Extract-Transform-Load prepare data in forms that are more suitable for consumer services, there is no read-write operation here. The purpose of fully resolving data types so that queries can be executed depends exclusively on the queries. This is why the language of the queries is important. If it were something standard like the SQL query language then it becomes helpful to bridge that language over unstructured data. LogParser came very close to that by viewing the data as an enumerable but the reason the language for LogParser became highly restrictive was that it needed to be LogParser friendly. If the types mentioned in the log parser queries did not match the data from the Component Object Model (COM) interfaces, Log Parser would not be able to understand the query. If there were a data virtualization layer within the LogParser that mashed-up the datatype to the data set by using one or more interfaces, it would have expanded the type of queries being written with the LogParser. Here, we utilize the same concept for data virtualization over Object Storage.