Tuesday, October 16, 2018

In addition to regular time-series database,  object storage can participate in text based search both as full text search as well as text mining techniques. One example of using full-text based search is demonstrated with Lucene here: https://github.com/ravibeta/csharpexamples/blob/master/SourceSearch/SourceSearch/SourceSearch/Program.cs which is better written in Java and File system enabled object storage. An example is available at https://github.com/ravibeta/JavaSamples/blob/master/FullTextSearchOnS3.java
The text mining techniques including machine learning techniques is demonstrated with scikit python packages and sample code at https://github.com/ravibeta/pythonexamples/blob/master/
The use of indexable key values in full text search deserves special mention. On one hand, Lucene has ways to populate meta keys and meta values which it calls fields in the indexer documents. On the other hand each of the objects in the bucket can not only store the raw document but also the meta keys and meta values. This calls for keeping the raw data and the indexer fields together in the same object. When we search over the objects enumerated from the bucket we no longer use the actual object and thus avoid searching through large objects. Instead we search the metadata and we lost only those objects where the metadata has the relevant terms. However, we make an improvement to this model by separating the index objects from the raw objects. The raw objects are no longer required to be touched when the metadata changes. Similarly the indexer objects can be deleted and recreated regardless of the objects so that we can re-index at different sites. Also keeping the indexer documents as key value entries reduced space and keeps them together so that a greater range of objects can be searched . This technique has been quite popular with many indexes.

No comments:

Post a Comment