Friday, October 26, 2018

The notion that there can be hybrid storage for data virtualization does not jive with the positioning of object storage as a limitless single storage.  However, customers choose on-premise or cloud object storage for different reasons. There is no question that some data is more suited to be in the cloud while others can remain on-premise. Even if the data were to be within the same object storage but in different zones, they would have different address and the resolver will need to resolve to one of them. Therefore, whether the resolution is between zones, between storage or between storage types, the resolver has to deliberate over the choices.
The nature of the query language determines the kind of resolving that the data virtualization needs to do. In addition, the type of storage that the virtualization layer spans also depends on the query language. LogParser was able to enumerate unstructured data but it is definitely not mainstream for unstructured data. Commands and scripts have been used for unstructured data in most places and they have worked well. Solutions like Saltstack enabled the same command and scripts to run over files in different storages even if they were on different operating systems. SaltStack used ZeroMQ for messaging and the commands would be executed by agents on those computers. The master on the SaltStack resolved the minions on which the commands and scripts would be run. This is no different from data virtualization layer that needs to resolve the types to different storages.
In order to explain the difference between data virtualization over structured and unstructured storage types, we look at metadata in structure storage. All data types used are registered.  Whether they are system builtin types or user defined types, the catalog helps with the resolution. The notion that json documents don’t have a rigid type and that the document model can be extensible is not lost to the resolver. It merely asks for a inverted list of documents corresponding to a template that can be looked up. Furthermore, the logic to resolve json fields to documents does not necessarily have to be within the resolver. The inverted list and the lookup can be provided by the index layer while the resolver merely delegates it. Therefore, the resolver can span different storage types as long as the resolution is delegated to the storage types. The query language for structure and unstructured may or may not be unified with a universal query language but that is beside the consideration that we can different storage types as different stacks and yet have the data virtualization over them.
Query rewrites has not been covered in the topic above. A query describing the selection of entries with the help of predicates does not necessarily have to be bound to structured or unstructured query languages. Yet the convenience and universal appeal of one language may dominate another. Therefore, in such cases whether the query language is agnostic or predominantly biased, it can be modified or rewritten to suit the needs of the storage stacks described earlier. This implies that not only the data types but also the query commands may be translated to suit the delegations to the storage stacks. If the commands can be honored, then they will return results and since the resolver in the virtualization layer may check all of the registered storage stacks, we will eventually get the result if the data matches.

Thursday, October 25, 2018

Many systems use the technique of reflection to find out whether a module has a particular data type. An object storage does not have any such mechanism. However, there is nothing preventing the data virtualization layer to maintain a registry in the object storage or the data type defining modules themselves so that those modules can be loaded and introspected to determine if there is such a data type.  This technique is common in many run-time. Therefore, the intelligence to add to the data virtualization layer can draw inspiration from these well-known examples
Most of the implementations for a layer is usually in the form of a micro-service. This helps modular design as well as testability. Moreover, the microservice provides http access for other services. This has been proven to be helpful from the point of view of production support. The services that depend on the data virtualization to be running can also directly go to the object storage via http. Therefore, the data virtualization layer merely acts like a proxy. The proxy can be transparent, thin or fat. It can even play the man in the middle and therefore it requires very little change from the caller.
The implementation of a service for data virtualization is not uncommon – both on–premise and in the cloud. In general, most applications benefit from being modular. Service oriented architecture and micro-services play important roles in making applications modular. This is in fact the embrace of cloud technologies as the modules are hosted on their own virtual machines and containers.  With the move towards cloud, we get the added benefits of elasticity and billing on a pay per use model that we otherwise do not have. The benefits of managed instances, hosts and containers is never lost on the application modularity. Therefore, data virtualization can also be implemented as a service that can span both on-premise as well as cloud storage.
The data virtualization service is also an opt-in because the consumers can go directly to the storage. For example, the callers will either use the proxy or they won’t so long as they take care of resolving their queries with fully qualified namespaces and data types.  It is merely a convenience and very much suited for encapsulating in a module. The testability of the data virtualization also improves significantly.
Unlike other services, data virtualization is exclusively dependent on the nature of data retrieval from the query layer. While data transformation services such as Extract-Transform-Load prepare data in forms that are more suitable for consumer services, there is no read-write operation here. The purpose of fully resolving data types so that queries can be executed depends exclusively on the queries. This is why the language of the queries is important. If it were something standard like the SQL query language then it becomes helpful to bridge that language over unstructured data. LogParser came very close to that by viewing the data as an enumerable but the reason the language for LogParser became highly restrictive was that it needed to be LogParser friendly. If the types mentioned in the log parser queries did not match the data from the Component Object Model (COM) interfaces, Log Parser would not be able to understand the query. If there were a data virtualization layer within the LogParser that mashed-up the datatype to the data set by using one or more interfaces, it would have expanded the type of queries being written with the LogParser. Here, we utilize the same concept for data virtualization over Object Storage.

Wednesday, October 24, 2018

We continue discussing data virtualization over object storage. Data Virtualization does not need to be a thin layer as a mashup of different object storage. Even if it was just a gateway, it could have allowed customizations from users or administrators for workloads. However, it doesn’t stop there as described in this write-up: https://1drv.ms/w/s!Ashlm-Nw-wnWt0XqQrQQTQBtuKYh. This layer can be intelligent to interpret data types to storage. Typically queries specify the data location either as part of the query with fully resolved data types or as part of the connection in which the queries are made. In a stateless request-response mode, this can be part of the request. The layer resolving all of the request can determine the storage location. Therefore, the data virtualization can be an intelligent router.
Notice that both user as well as system can determine the policies for the data virtualization. If the data type is only one then the location is known If the same data type is available in different locations, then the determination can be made by user with the help of rules and configurations. If the user has not specified, then the system can choose to determine the location for the data type. The rules are not necessarily maintained by the system in the latter case. It can simply choose the first available store that has the data type.
Many systems use the technique of reflection to find out whether a module has a particular data type. An object storage does not have any such mechanism. However, there is nothing preventing the data virtualization layer to maintain a registry in the object storage or the data type defining modules themselves so that those modules can be loaded and introspected to determine if there is such a data type.  This technique is common in many run-time. Therefore, the intelligence to add to the data virtualization layer is pretty much known.
Most of the implementations for a layer is usually in the form of a micro-service. This helps modular design as well as testability. Moreover, the microservice provides http access for other services. This has been proven to be helpful from the point of view of production support. The services that depend on the data virtualization to be running can also directly go to the object storage via http. Therefore, the data virtualization layer merely acts like a proxy. The proxy can be transparent, thin or fat. It can even play the man in the middle and therefore it requires very little change from the caller.

Tuesday, October 23, 2018

Let us now discuss data virtualization for query over object storage. The destination of queries is usually a single data source but query execution like any other application retrieves data without requiring technical details about the data such as where it is located. Location usually depends on the organization that has a business justification for growing and maintaining data. Not everybody like to dip right into the data lake right in the beginning of a venture as most organizations do. They have to grapple with their technical need and changing business priorities before the data and its management solution can be called stable.
Object Storage unlike databases allows for incredible almost limitless storage with sufficient replication groups to cater to organizations that need their own copy. In addition, the namespace-bucket-object hierarchy allow the level of separation as organizations need it.
The role of object storage in data virtualization is only on the physical storage level where we determine which site/zone to get the data from.  A logical data virtualization layer that knows which namespace or bucket to go to within an object storage does not come straight out of the object storage. A gateway would server that purpose. The queries can then choose to run against a virtualized view of the data by querying the gateway which in turn would fetch the data from the corresponding location.
There are many levels of abstraction. First, the destination data source may be within an object storage. Second the destination data source may be from different object storage. Third the destination may be from different storages such as an object storage and cluster file system. In all these cases, the virtualization logic resides external to the storage and can be written in different forms. Commercial products of this kind such as Denodo are proof that this is a requirement for businesses.  Finally, the logic within the virtualization can be customized so that queries can make the most of it. We refer this article to describe the usages of data virtualization while here we discuss the role of object storage.

Monday, October 22, 2018

We were discussing search with object storage. The language for the query has traditionally been SQL. Tools like LogParser allow sql queries to be executed over enumerables. SQL has been supporting user defined operators for a while now. These user defined operators help with additional computations that are not present as builtins. In the case of relational data, these generally have been user defined functions or user defined aggregates. With the enumerable data set, the SQL is somewhat limited for LogParser. Any implementation of a query execution layer over the object storage could choose to allow or disallow user defined operators. These enable computation on say user defined data types that are not restricted by the system defined types. Such types have been useful with say spatial co-ordinates or geographical data for easier abstraction and simpler expression of computational logic. For example, vector addition can be done with user defined data types and user defined operators.
Aggregation operations have had the benefit that they can support both batch and streaming mode operations. These operations therefore can operate on large datasets because they view only a portion at a time. Furthermore, the batch operation can be parallelized to a number of processors. This has generally been the case with Big Data and cluster mode operations. Until recently, streaming mode operations were not so common with the big data. However, streaming conversions of summation form processing in batch data is now facilitated directly out of the box from streaming algorithm packages from some public cloud providers. This means that both batch and stream processing can operate on unstructured data. Some other forms of processing are included here.
The language may support search expressions as well as user defined operators. Structured query language works for structured data. Unstructured documents are best served by search operators and expressions. Examples include the search expressions and piped operations used with logs. With the broader umbrella inclusion of statistical and machine learning packages, universal query language is now trying to broaden the breadth of existing query language and standardize it.

Sunday, October 21, 2018



There is also another benefit to the full-text search. We are not restricted to their import into any form of storage. Object Storage can serve as the source for all databases including graph databases. There is generally a lot of preparation when data is exported from relational tables and imported into the graph databases when theoretically all the relations in the relational tables are merely edges to the nodes representing the entities. Graph databases are called natural databases because the relationships can be enumerated and persisted as edges but it is this enumeration that takes some iterations.  Data extract transform and load operations have rigorous packages in the relational world and largely relying on the consistency checks but they are not the same in the graph database. Therefore, each operation requires validation and more so when an organization is importing the data into a graph database without precedent. The indexer documents overcome the import because the data does not need to be collected. The inverted list of documents is easy to compare for Intersection, left and right differences and they add to edge weights directly when the terms are treated as nodes.   The ease with which data can be viewed as nodes and edges makes the import easier. In this way, the object storage for indexer provides convenience to destinations such as graph database where the inverted list of documents may be used in graph algorithms.
Full-text search is not the only stack. There can be another search stack that can be added to this object storage. For example, an iterator-based .NET style standard query operator may also be provided over this object storage. Even query tools like LogParser that opens up a COM interface to the objects in the storage can be used. Finally, a comprehensive and dedicated query engine that studies, caches and replays the query is possible since the object storage does not restrict the search layer.
There are a few techniques that can improve query execution. The degree of parallelism helps the query to execute faster by partitioning the data and invoking multiple threads. while increasing these parallel activities, we should be careful to not have too many otherwise the system can get into thread thrashing mode. The rule of thumb for increasing DoP is that the number of threads is one more than the number of processors and this refers to the operating system threads. There are no limits to lightweight workers  that do not have contention.
Caching is another benefit to query execution. if a query repeats itself over and over again, we need not perform the same calculations to determine the least cost of serving the query. We can cache the plan and the costs which we can reuse.


Saturday, October 20, 2018

We were discussing full-text search with object storage. Lucene indexes are inverted indexes. It lists documents that contain a term. It stores statistics about terms in order to make the term-based search more efficient. While Lucene itself is available in various programming languages, there is no restriction to take the inverted index from lucene and use it in any way as appropriate.
The inverted indexes over object storage may not be as performant as the query execution over relational tables in a sql database but they fill the space for enabling search over the storage. Spotlight on MacOS and Google page-rank algorithm on internet documents also use tokens and lookups.  Moreover by recognizing the generic organization of inverted indexes, we can apply any grouping, ranking and sorting algorithm we like. Those algorithms are now independent of the organization of the index in the object storage and each one can use the same index.
For example, the page rank algorithm can be selectively used to filter the results. The nodes are the terms and the edges are the inverted list of documents that contain two nodes. Since we already calculate the marginals, both for the nodes and for the edges, we already have a graph to calculate the page rank on.  PageRank can be found as a sum of two components. The first component represented in the form of a damping factor. The second component is in the summation form of the page ranks of the adjacent vertices each weighted by the inverse of the out-degrees of that vertex. This is said to correspond to the principal eigen vector of the normalized inverted document list matrix.
Full text search facilitates text mining just the same way a corpus does. While documents are viewed as a bag of words, the indexer represents a collection of already selected keywords for each indexed document. Both are input to the text mining algorithms. The neural nets will calculate the mutual information between terms regardless of the source and classify them with the softmax classifier. This implies that the indexer document can allow user input to be added or collected as fields in the index document which can then be treated the same way as the corpus documents.
There is also another benefit to the full-text search. We are not restricted to their import into any form of storage. Object Storage can serve as the source for all databases including graph databases. There is generally a lot of preparation when data is exported from relational tables and imported into the graph databases when theoretically all the relations in the relational tables are merely edges to the nodes representing the entities. Graph databases are called natural databases because the relationships can be enumerated and persisted as edges but it is this enumeration that takes some iterations.  Data extract transform and load operations have rigorous packages in the relational world and largely relying on the consistency checks but they are not the same in the graph database. Therefore, each operation requires validation and more so when an organization is importing the data into a graph database without precedent. The indexer documents overcome the import because the data does not need to be collected. The inverted list of documents is easy to compare for Intersection, left and right differences and they add to edge weights directly when the terms are treated as nodes.   The ease with which data can be viewed as nodes and edges makes the import easier. In this way, the object storage for indexer provides convenience to destinations such as graph database where the inverted list of documents may be used in graph algorithms.