Saturday, October 20, 2018

We were discussing full-text search with object storage. Lucene indexes are inverted indexes. It lists documents that contain a term. It stores statistics about terms in order to make the term-based search more efficient. While Lucene itself is available in various programming languages, there is no restriction to take the inverted index from lucene and use it in any way as appropriate.
The inverted indexes over object storage may not be as performant as the query execution over relational tables in a sql database but they fill the space for enabling search over the storage. Spotlight on MacOS and Google page-rank algorithm on internet documents also use tokens and lookups.  Moreover by recognizing the generic organization of inverted indexes, we can apply any grouping, ranking and sorting algorithm we like. Those algorithms are now independent of the organization of the index in the object storage and each one can use the same index.
For example, the page rank algorithm can be selectively used to filter the results. The nodes are the terms and the edges are the inverted list of documents that contain two nodes. Since we already calculate the marginals, both for the nodes and for the edges, we already have a graph to calculate the page rank on.  PageRank can be found as a sum of two components. The first component represented in the form of a damping factor. The second component is in the summation form of the page ranks of the adjacent vertices each weighted by the inverse of the out-degrees of that vertex. This is said to correspond to the principal eigen vector of the normalized inverted document list matrix.
Full text search facilitates text mining just the same way a corpus does. While documents are viewed as a bag of words, the indexer represents a collection of already selected keywords for each indexed document. Both are input to the text mining algorithms. The neural nets will calculate the mutual information between terms regardless of the source and classify them with the softmax classifier. This implies that the indexer document can allow user input to be added or collected as fields in the index document which can then be treated the same way as the corpus documents.
There is also another benefit to the full-text search. We are not restricted to their import into any form of storage. Object Storage can serve as the source for all databases including graph databases. There is generally a lot of preparation when data is exported from relational tables and imported into the graph databases when theoretically all the relations in the relational tables are merely edges to the nodes representing the entities. Graph databases are called natural databases because the relationships can be enumerated and persisted as edges but it is this enumeration that takes some iterations.  Data extract transform and load operations have rigorous packages in the relational world and largely relying on the consistency checks but they are not the same in the graph database. Therefore, each operation requires validation and more so when an organization is importing the data into a graph database without precedent. The indexer documents overcome the import because the data does not need to be collected. The inverted list of documents is easy to compare for Intersection, left and right differences and they add to edge weights directly when the terms are treated as nodes.   The ease with which data can be viewed as nodes and edges makes the import easier. In this way, the object storage for indexer provides convenience to destinations such as graph database where the inverted list of documents may be used in graph algorithms.

No comments:

Post a Comment