Wednesday, October 24, 2018

We continue discussing data virtualization over object storage. Data Virtualization does not need to be a thin layer as a mashup of different object storage. Even if it was just a gateway, it could have allowed customizations from users or administrators for workloads. However, it doesn’t stop there as described in this write-up: https://1drv.ms/w/s!Ashlm-Nw-wnWt0XqQrQQTQBtuKYh. This layer can be intelligent to interpret data types to storage. Typically queries specify the data location either as part of the query with fully resolved data types or as part of the connection in which the queries are made. In a stateless request-response mode, this can be part of the request. The layer resolving all of the request can determine the storage location. Therefore, the data virtualization can be an intelligent router.
Notice that both user as well as system can determine the policies for the data virtualization. If the data type is only one then the location is known If the same data type is available in different locations, then the determination can be made by user with the help of rules and configurations. If the user has not specified, then the system can choose to determine the location for the data type. The rules are not necessarily maintained by the system in the latter case. It can simply choose the first available store that has the data type.
Many systems use the technique of reflection to find out whether a module has a particular data type. An object storage does not have any such mechanism. However, there is nothing preventing the data virtualization layer to maintain a registry in the object storage or the data type defining modules themselves so that those modules can be loaded and introspected to determine if there is such a data type.  This technique is common in many run-time. Therefore, the intelligence to add to the data virtualization layer is pretty much known.
Most of the implementations for a layer is usually in the form of a micro-service. This helps modular design as well as testability. Moreover, the microservice provides http access for other services. This has been proven to be helpful from the point of view of production support. The services that depend on the data virtualization to be running can also directly go to the object storage via http. Therefore, the data virtualization layer merely acts like a proxy. The proxy can be transparent, thin or fat. It can even play the man in the middle and therefore it requires very little change from the caller.

Tuesday, October 23, 2018

Let us now discuss data virtualization for query over object storage. The destination of queries is usually a single data source but query execution like any other application retrieves data without requiring technical details about the data such as where it is located. Location usually depends on the organization that has a business justification for growing and maintaining data. Not everybody like to dip right into the data lake right in the beginning of a venture as most organizations do. They have to grapple with their technical need and changing business priorities before the data and its management solution can be called stable.
Object Storage unlike databases allows for incredible almost limitless storage with sufficient replication groups to cater to organizations that need their own copy. In addition, the namespace-bucket-object hierarchy allow the level of separation as organizations need it.
The role of object storage in data virtualization is only on the physical storage level where we determine which site/zone to get the data from.  A logical data virtualization layer that knows which namespace or bucket to go to within an object storage does not come straight out of the object storage. A gateway would server that purpose. The queries can then choose to run against a virtualized view of the data by querying the gateway which in turn would fetch the data from the corresponding location.
There are many levels of abstraction. First, the destination data source may be within an object storage. Second the destination data source may be from different object storage. Third the destination may be from different storages such as an object storage and cluster file system. In all these cases, the virtualization logic resides external to the storage and can be written in different forms. Commercial products of this kind such as Denodo are proof that this is a requirement for businesses.  Finally, the logic within the virtualization can be customized so that queries can make the most of it. We refer this article to describe the usages of data virtualization while here we discuss the role of object storage.

Monday, October 22, 2018

We were discussing search with object storage. The language for the query has traditionally been SQL. Tools like LogParser allow sql queries to be executed over enumerables. SQL has been supporting user defined operators for a while now. These user defined operators help with additional computations that are not present as builtins. In the case of relational data, these generally have been user defined functions or user defined aggregates. With the enumerable data set, the SQL is somewhat limited for LogParser. Any implementation of a query execution layer over the object storage could choose to allow or disallow user defined operators. These enable computation on say user defined data types that are not restricted by the system defined types. Such types have been useful with say spatial co-ordinates or geographical data for easier abstraction and simpler expression of computational logic. For example, vector addition can be done with user defined data types and user defined operators.
Aggregation operations have had the benefit that they can support both batch and streaming mode operations. These operations therefore can operate on large datasets because they view only a portion at a time. Furthermore, the batch operation can be parallelized to a number of processors. This has generally been the case with Big Data and cluster mode operations. Until recently, streaming mode operations were not so common with the big data. However, streaming conversions of summation form processing in batch data is now facilitated directly out of the box from streaming algorithm packages from some public cloud providers. This means that both batch and stream processing can operate on unstructured data. Some other forms of processing are included here.
The language may support search expressions as well as user defined operators. Structured query language works for structured data. Unstructured documents are best served by search operators and expressions. Examples include the search expressions and piped operations used with logs. With the broader umbrella inclusion of statistical and machine learning packages, universal query language is now trying to broaden the breadth of existing query language and standardize it.

Sunday, October 21, 2018



There is also another benefit to the full-text search. We are not restricted to their import into any form of storage. Object Storage can serve as the source for all databases including graph databases. There is generally a lot of preparation when data is exported from relational tables and imported into the graph databases when theoretically all the relations in the relational tables are merely edges to the nodes representing the entities. Graph databases are called natural databases because the relationships can be enumerated and persisted as edges but it is this enumeration that takes some iterations.  Data extract transform and load operations have rigorous packages in the relational world and largely relying on the consistency checks but they are not the same in the graph database. Therefore, each operation requires validation and more so when an organization is importing the data into a graph database without precedent. The indexer documents overcome the import because the data does not need to be collected. The inverted list of documents is easy to compare for Intersection, left and right differences and they add to edge weights directly when the terms are treated as nodes.   The ease with which data can be viewed as nodes and edges makes the import easier. In this way, the object storage for indexer provides convenience to destinations such as graph database where the inverted list of documents may be used in graph algorithms.
Full-text search is not the only stack. There can be another search stack that can be added to this object storage. For example, an iterator-based .NET style standard query operator may also be provided over this object storage. Even query tools like LogParser that opens up a COM interface to the objects in the storage can be used. Finally, a comprehensive and dedicated query engine that studies, caches and replays the query is possible since the object storage does not restrict the search layer.
There are a few techniques that can improve query execution. The degree of parallelism helps the query to execute faster by partitioning the data and invoking multiple threads. while increasing these parallel activities, we should be careful to not have too many otherwise the system can get into thread thrashing mode. The rule of thumb for increasing DoP is that the number of threads is one more than the number of processors and this refers to the operating system threads. There are no limits to lightweight workers  that do not have contention.
Caching is another benefit to query execution. if a query repeats itself over and over again, we need not perform the same calculations to determine the least cost of serving the query. We can cache the plan and the costs which we can reuse.


Saturday, October 20, 2018

We were discussing full-text search with object storage. Lucene indexes are inverted indexes. It lists documents that contain a term. It stores statistics about terms in order to make the term-based search more efficient. While Lucene itself is available in various programming languages, there is no restriction to take the inverted index from lucene and use it in any way as appropriate.
The inverted indexes over object storage may not be as performant as the query execution over relational tables in a sql database but they fill the space for enabling search over the storage. Spotlight on MacOS and Google page-rank algorithm on internet documents also use tokens and lookups.  Moreover by recognizing the generic organization of inverted indexes, we can apply any grouping, ranking and sorting algorithm we like. Those algorithms are now independent of the organization of the index in the object storage and each one can use the same index.
For example, the page rank algorithm can be selectively used to filter the results. The nodes are the terms and the edges are the inverted list of documents that contain two nodes. Since we already calculate the marginals, both for the nodes and for the edges, we already have a graph to calculate the page rank on.  PageRank can be found as a sum of two components. The first component represented in the form of a damping factor. The second component is in the summation form of the page ranks of the adjacent vertices each weighted by the inverse of the out-degrees of that vertex. This is said to correspond to the principal eigen vector of the normalized inverted document list matrix.
Full text search facilitates text mining just the same way a corpus does. While documents are viewed as a bag of words, the indexer represents a collection of already selected keywords for each indexed document. Both are input to the text mining algorithms. The neural nets will calculate the mutual information between terms regardless of the source and classify them with the softmax classifier. This implies that the indexer document can allow user input to be added or collected as fields in the index document which can then be treated the same way as the corpus documents.
There is also another benefit to the full-text search. We are not restricted to their import into any form of storage. Object Storage can serve as the source for all databases including graph databases. There is generally a lot of preparation when data is exported from relational tables and imported into the graph databases when theoretically all the relations in the relational tables are merely edges to the nodes representing the entities. Graph databases are called natural databases because the relationships can be enumerated and persisted as edges but it is this enumeration that takes some iterations.  Data extract transform and load operations have rigorous packages in the relational world and largely relying on the consistency checks but they are not the same in the graph database. Therefore, each operation requires validation and more so when an organization is importing the data into a graph database without precedent. The indexer documents overcome the import because the data does not need to be collected. The inverted list of documents is easy to compare for Intersection, left and right differences and they add to edge weights directly when the terms are treated as nodes.   The ease with which data can be viewed as nodes and edges makes the import easier. In this way, the object storage for indexer provides convenience to destinations such as graph database where the inverted list of documents may be used in graph algorithms.

Friday, October 19, 2018

We were discussing full-text search with object storage. When users want to search, security goes out the window. The Lucene index documents are not secured via Access Control Lists  The user is really looking to cover the entire haystack and not get bogged down by disparate collections and the need to repeat the query on different indexes.
Although S3 supports adding access control descriptions to the objects,  those are for securing the objects from other users and not the system. Search is a system-wide operation.  Blacklisting any of the hierarchy artifacts is possible but it leaves the onus on the user.
This pervasive full-text has an unintended consequence that users with sensitive information in their objects may divulge them to search users because those documents will be indexed and match the search query. This has been noticed in many document libraries outside object storage. There the solution did not involve blacklisting. Instead it involved the users to be informed that the library is not the place to save sensitive information. We are merely following the same practice here.
The use of object storage with Lucene as a full text solution for unstructured data also comes with many benefits other than search. For example, the fields extracted from the raw data together also forms the input for other analysis.
The tags generated on the metadata via supervised or unsupervised learning also forms useful information for subsequent queries. When index documents are manually classified, there is no limit to the number of tags that can be added. Since the index documents and the tags are utilized by the queries, the user gets more and more predicates to use.
The contents of the object storage do not always represent text. They can come in different formats and file types. Even when they do represent text, they may not always be clean. Consequently, a text pre-processing stage is needed prior to indexing. Libraries that help extract from different file types may be used. Also, stemmers and term canonicalizers may be used.
The index documents are also unstructured storage. They can be saved and exported from object storage. The contents of the index document and the fields they retain are readable using the packages with which they were created. They are not proprietary per se if we can read the fields in the index documents and store them directly in object storage as json documents. Most of the fields in the index documents are enumerated by the doc.toString() method. It is easy to take the string collection and save them as text files if we want to make the terms available to other applications. This conversion of information in the various file extensions of the Lucene Index documents such as term infos, term infos index, term vector index, term vector documents and term vector fields can be converted to any form we like. Consequently we are not limited to using one form of search over the metadata.
Lucene indexes are inverted indexes. It lists documents that contain a term. It stores statistics about terms in order to make the term-based search more efficient. While Lucene itself is available in various programming languages, there is no restriction to take the inverted index from lucene and use it in any way as appropriate.
A Lucene index contains a sequence of documents each of which is a sequence of fields. The fields are named sequence of terms and each term is a string in the original text that was indexed.  The same term may appear in different fields but have different names.  The index may also have partitions called segments Each segment is a fully independent index which could be searched separately.  New segments may be created or existing segments may be merged.  This organization is re-usable in all contexts of using inverted indexes. Any external format for exporting these indexes may also use a similar organization.
The inverted indexes over object storage may not be as performant as the query execution over relational tables in a sql database but they fill the space for enabling search over the storage. Spotlight on MacOS and Google page-rank algorithm on internet documents also use tokens and lookups.  Moreover by recognizing the generic organization of inverted indexes, we can apply any grouping, ranking and sorting algorithm we like. Those algorithms are now independent of the organization of the index in the object storage and each one can use the same index. 

Thursday, October 18, 2018

We were discussing full-text search on object storage. When users want to search, security goes out the window. The Lucene index documents are not secured via Access Control Lists They don’t even record any information regarding the user in the fields of the document. The user is really looking to cover the entire haystack and not get bogged down by disparate collections and the need to repeat the query on different indexes.
Even the documents in the object storage are not secured so as to hide them from the indexer. Although S3 supports adding access control descriptions to the objects,  those are for securing the objects from other users and not the system. Similarly the buckets and namespaces can be secured from unwanted access and this is part of the hierarchy in object storage. However, the indexer cannot choose to ignore want part of the hierarchy because it would put the burden on the user  to decide what to index. This is possible technically by blacklisting any of the hierarchy artifacts but it is not a good business policy unless the customer needs advanced controls
This has an unintended consequence that users with sensitive information in their objects may divulge them to search users because those documents will be indexed and match the search query. This has been noticed in many document libraries outside object storage. There the solution did not involve blacklisting. Instead it involved the users to be informed that the library is not the place to save sensitive information. We are merely following the same practice here.
Finally, we mention that the users are not required to use different search queries by using different document collections. All queries can target the same index at different locations that is generated for all document collection.