Sunday, October 21, 2018



There is also another benefit to the full-text search. We are not restricted to their import into any form of storage. Object Storage can serve as the source for all databases including graph databases. There is generally a lot of preparation when data is exported from relational tables and imported into the graph databases when theoretically all the relations in the relational tables are merely edges to the nodes representing the entities. Graph databases are called natural databases because the relationships can be enumerated and persisted as edges but it is this enumeration that takes some iterations.  Data extract transform and load operations have rigorous packages in the relational world and largely relying on the consistency checks but they are not the same in the graph database. Therefore, each operation requires validation and more so when an organization is importing the data into a graph database without precedent. The indexer documents overcome the import because the data does not need to be collected. The inverted list of documents is easy to compare for Intersection, left and right differences and they add to edge weights directly when the terms are treated as nodes.   The ease with which data can be viewed as nodes and edges makes the import easier. In this way, the object storage for indexer provides convenience to destinations such as graph database where the inverted list of documents may be used in graph algorithms.
Full-text search is not the only stack. There can be another search stack that can be added to this object storage. For example, an iterator-based .NET style standard query operator may also be provided over this object storage. Even query tools like LogParser that opens up a COM interface to the objects in the storage can be used. Finally, a comprehensive and dedicated query engine that studies, caches and replays the query is possible since the object storage does not restrict the search layer.
There are a few techniques that can improve query execution. The degree of parallelism helps the query to execute faster by partitioning the data and invoking multiple threads. while increasing these parallel activities, we should be careful to not have too many otherwise the system can get into thread thrashing mode. The rule of thumb for increasing DoP is that the number of threads is one more than the number of processors and this refers to the operating system threads. There are no limits to lightweight workers  that do not have contention.
Caching is another benefit to query execution. if a query repeats itself over and over again, we need not perform the same calculations to determine the least cost of serving the query. We can cache the plan and the costs which we can reuse.


Saturday, October 20, 2018

We were discussing full-text search with object storage. Lucene indexes are inverted indexes. It lists documents that contain a term. It stores statistics about terms in order to make the term-based search more efficient. While Lucene itself is available in various programming languages, there is no restriction to take the inverted index from lucene and use it in any way as appropriate.
The inverted indexes over object storage may not be as performant as the query execution over relational tables in a sql database but they fill the space for enabling search over the storage. Spotlight on MacOS and Google page-rank algorithm on internet documents also use tokens and lookups.  Moreover by recognizing the generic organization of inverted indexes, we can apply any grouping, ranking and sorting algorithm we like. Those algorithms are now independent of the organization of the index in the object storage and each one can use the same index.
For example, the page rank algorithm can be selectively used to filter the results. The nodes are the terms and the edges are the inverted list of documents that contain two nodes. Since we already calculate the marginals, both for the nodes and for the edges, we already have a graph to calculate the page rank on.  PageRank can be found as a sum of two components. The first component represented in the form of a damping factor. The second component is in the summation form of the page ranks of the adjacent vertices each weighted by the inverse of the out-degrees of that vertex. This is said to correspond to the principal eigen vector of the normalized inverted document list matrix.
Full text search facilitates text mining just the same way a corpus does. While documents are viewed as a bag of words, the indexer represents a collection of already selected keywords for each indexed document. Both are input to the text mining algorithms. The neural nets will calculate the mutual information between terms regardless of the source and classify them with the softmax classifier. This implies that the indexer document can allow user input to be added or collected as fields in the index document which can then be treated the same way as the corpus documents.
There is also another benefit to the full-text search. We are not restricted to their import into any form of storage. Object Storage can serve as the source for all databases including graph databases. There is generally a lot of preparation when data is exported from relational tables and imported into the graph databases when theoretically all the relations in the relational tables are merely edges to the nodes representing the entities. Graph databases are called natural databases because the relationships can be enumerated and persisted as edges but it is this enumeration that takes some iterations.  Data extract transform and load operations have rigorous packages in the relational world and largely relying on the consistency checks but they are not the same in the graph database. Therefore, each operation requires validation and more so when an organization is importing the data into a graph database without precedent. The indexer documents overcome the import because the data does not need to be collected. The inverted list of documents is easy to compare for Intersection, left and right differences and they add to edge weights directly when the terms are treated as nodes.   The ease with which data can be viewed as nodes and edges makes the import easier. In this way, the object storage for indexer provides convenience to destinations such as graph database where the inverted list of documents may be used in graph algorithms.

Friday, October 19, 2018

We were discussing full-text search with object storage. When users want to search, security goes out the window. The Lucene index documents are not secured via Access Control Lists  The user is really looking to cover the entire haystack and not get bogged down by disparate collections and the need to repeat the query on different indexes.
Although S3 supports adding access control descriptions to the objects,  those are for securing the objects from other users and not the system. Search is a system-wide operation.  Blacklisting any of the hierarchy artifacts is possible but it leaves the onus on the user.
This pervasive full-text has an unintended consequence that users with sensitive information in their objects may divulge them to search users because those documents will be indexed and match the search query. This has been noticed in many document libraries outside object storage. There the solution did not involve blacklisting. Instead it involved the users to be informed that the library is not the place to save sensitive information. We are merely following the same practice here.
The use of object storage with Lucene as a full text solution for unstructured data also comes with many benefits other than search. For example, the fields extracted from the raw data together also forms the input for other analysis.
The tags generated on the metadata via supervised or unsupervised learning also forms useful information for subsequent queries. When index documents are manually classified, there is no limit to the number of tags that can be added. Since the index documents and the tags are utilized by the queries, the user gets more and more predicates to use.
The contents of the object storage do not always represent text. They can come in different formats and file types. Even when they do represent text, they may not always be clean. Consequently, a text pre-processing stage is needed prior to indexing. Libraries that help extract from different file types may be used. Also, stemmers and term canonicalizers may be used.
The index documents are also unstructured storage. They can be saved and exported from object storage. The contents of the index document and the fields they retain are readable using the packages with which they were created. They are not proprietary per se if we can read the fields in the index documents and store them directly in object storage as json documents. Most of the fields in the index documents are enumerated by the doc.toString() method. It is easy to take the string collection and save them as text files if we want to make the terms available to other applications. This conversion of information in the various file extensions of the Lucene Index documents such as term infos, term infos index, term vector index, term vector documents and term vector fields can be converted to any form we like. Consequently we are not limited to using one form of search over the metadata.
Lucene indexes are inverted indexes. It lists documents that contain a term. It stores statistics about terms in order to make the term-based search more efficient. While Lucene itself is available in various programming languages, there is no restriction to take the inverted index from lucene and use it in any way as appropriate.
A Lucene index contains a sequence of documents each of which is a sequence of fields. The fields are named sequence of terms and each term is a string in the original text that was indexed.  The same term may appear in different fields but have different names.  The index may also have partitions called segments Each segment is a fully independent index which could be searched separately.  New segments may be created or existing segments may be merged.  This organization is re-usable in all contexts of using inverted indexes. Any external format for exporting these indexes may also use a similar organization.
The inverted indexes over object storage may not be as performant as the query execution over relational tables in a sql database but they fill the space for enabling search over the storage. Spotlight on MacOS and Google page-rank algorithm on internet documents also use tokens and lookups.  Moreover by recognizing the generic organization of inverted indexes, we can apply any grouping, ranking and sorting algorithm we like. Those algorithms are now independent of the organization of the index in the object storage and each one can use the same index. 

Thursday, October 18, 2018

We were discussing full-text search on object storage. When users want to search, security goes out the window. The Lucene index documents are not secured via Access Control Lists They don’t even record any information regarding the user in the fields of the document. The user is really looking to cover the entire haystack and not get bogged down by disparate collections and the need to repeat the query on different indexes.
Even the documents in the object storage are not secured so as to hide them from the indexer. Although S3 supports adding access control descriptions to the objects,  those are for securing the objects from other users and not the system. Similarly the buckets and namespaces can be secured from unwanted access and this is part of the hierarchy in object storage. However, the indexer cannot choose to ignore want part of the hierarchy because it would put the burden on the user  to decide what to index. This is possible technically by blacklisting any of the hierarchy artifacts but it is not a good business policy unless the customer needs advanced controls
This has an unintended consequence that users with sensitive information in their objects may divulge them to search users because those documents will be indexed and match the search query. This has been noticed in many document libraries outside object storage. There the solution did not involve blacklisting. Instead it involved the users to be informed that the library is not the place to save sensitive information. We are merely following the same practice here.
Finally, we mention that the users are not required to use different search queries by using different document collections. All queries can target the same index at different locations that is generated for all document collection.

Wednesday, October 17, 2018

We were discussing full text search over object storage. As we enumerate some of the advantages of separating object index from object data, we realize that the metadata is what we choose to keep from the Lucene generated fields. With enhancement to fields of the documents added to the index, we improve not only the queries but also the collection on which we can perform correlations. If we started out with only a few fields, the model for statistical analysis has only a few parameters. The more fields we add the better the correlation. This is true not just for queries on the data and the analysis over historical accumulation of data, but also the data mining and machine learning methods.
We will elaborate each of these. Consider the time and space dimension queries that are generally required in dashboards, charts and graphs. These queries need to search over data that has been accumulated which might be quite large often exceeding terabytes. As the data grows, the metadata becomes all the more important and their organization can now be tailored to the queries instead of relying on the organization of the data. If there is need to separate online adhoc queries on current metadata from more analytical and intensive background queries, then we can choose to have different organizations of the information in each category so that they serve their queries better.
Let us also look at data mining techniques. These include clustering techniques and rely largely on adding additional tags to existing information. Even though we are searching Lucene index documents and they may already have fields, there is nothing preventing these techniques to classify and come up with newer labels which can be persisted as fields. Therefore, unlike the read only nature of the queries mentioned earlier, these are techniques where one stage of processing may benefit from the read-write of another. Data mining algorithms are computationally heavy as compared to some of the regular queries for grouping, sorting and ranking. Even the similarity between an index document and a cluster might not be cheap. That is why it helps to have the results of one processing benefit another especially if the documents have not changed.
Now let us take a look at the machine learning which is by far most involved with computations than all of the above. In these cases, we again benefit with more and more data to process. Since the machine learning methods are implemented in packages from different sources, there is more emphasis on long running tasks written in environments different from the data source. Hence making all of the data available in different compute environments becomes more important. If it helps the performance in these cases, the object storage can keep replications of the data in different zones.

Tuesday, October 16, 2018

In addition to regular time-series database,  object storage can participate in text based search both as full text search as well as text mining techniques. One example of using full-text based search is demonstrated with Lucene here: https://github.com/ravibeta/csharpexamples/blob/master/SourceSearch/SourceSearch/SourceSearch/Program.cs which is better written in Java and File system enabled object storage. An example is available at https://github.com/ravibeta/JavaSamples/blob/master/FullTextSearchOnS3.java
The text mining techniques including machine learning techniques is demonstrated with scikit python packages and sample code at https://github.com/ravibeta/pythonexamples/blob/master/
The use of indexable key values in full text search deserves special mention. On one hand, Lucene has ways to populate meta keys and meta values which it calls fields in the indexer documents. On the other hand each of the objects in the bucket can not only store the raw document but also the meta keys and meta values. This calls for keeping the raw data and the indexer fields together in the same object. When we search over the objects enumerated from the bucket we no longer use the actual object and thus avoid searching through large objects. Instead we search the metadata and we lost only those objects where the metadata has the relevant terms. However, we make an improvement to this model by separating the index objects from the raw objects. The raw objects are no longer required to be touched when the metadata changes. Similarly the indexer objects can be deleted and recreated regardless of the objects so that we can re-index at different sites. Also keeping the indexer documents as key value entries reduced space and keeps them together so that a greater range of objects can be searched . This technique has been quite popular with many indexes.

Monday, October 15, 2018

The object storage as a time-series data-store
Introduction: Niche products like log indexes and time series database often rely on unstructured storage. While they are primarily focused on their respective purpose, they have very little need to focus on storage. Object storage serves as a limitless no maintenance store that not only replaces a file-system as the traditional store for these products but also brings the best from storage best practice. As a storage tier, object storage is more suited to not only take the place of other storage used by these products but also bring those product features into the storage tier. This article examines the premise of using object storage to directly participate as the base tier of log index products.
Description: We begin by examining the requirements for a log index store.
First, the log indexes are maintained as time series database. The data in the form of cold, warm and hot buckets which are used to denote d the progression in the entries made to the store. Each of the index is maintained from a continuous rolling input of data that fills one bucket after another. Each entry is given a timestamp if it cannot be interpreted from the data either by parsing or by some tags specified by the user. This is called the raw entry and is used to parse the data for fields that can be extracted and used with the index. An object storage also enables key values to be written in its objects.  Indexes on raw entries can be maintained together with the data or in separately named objects. The organization of time-series artifacts is agreeable to the namespace-bucket-object hierarchy in object storage.
Second, most log stores accept the input directly over the http via their own proprietary application calls. The object storage already has well known S3 APIs that facilitate data read and write. This makes the convention of calling the API uniform not just for the log indexing service but also the callers.  The ability to input data via the http is not the only way to do so. A low-level connector that can take file system protocols for data input and output may also improve the access for these time series databases. A file-system backed object storage that supports file-system protocols may serve this lower level data path.  A log store’s existing file system may directly be replicated into an object storage and this is merely a tighter integration of the log index service as native to the object storage if the connectors are available to accept data from sockets, files, queues and other sources.
Third, the data input does not need to be collected from data sources. In fact, log appenders are known to send data to multiple destinations and object storage is merely another destination for them. For example, there is an S3 api based log appender http://s3appender.codeplex.com/ that can be used directly in many applications and services because they use log4j. The only refinement we mention here is that the log4j appenders need not go over the http and can used low level protocols as file system open and close.
Fourth, the object storage brings the best in storage practice in terms of durability, redundancy, availability and organization and these can be leveraged for simultaneous log analytical activities that were generally limiting in a single log index server.
Conclusion: The object storage is well suited for saving, parsing, searching and reporting from logs.
Reference: https://1drv.ms/w/s!Ashlm-Nw-wnWt2h4_zHbC-u_MKIn