We were discussing full-text search with object storage. When users want to search, security goes out the window. The Lucene index documents are not secured via Access Control Lists The user is really looking to cover the entire haystack and not get bogged down by disparate collections and the need to repeat the query on different indexes.
Although S3 supports adding access control descriptions to the objects, those are for securing the objects from other users and not the system. Search is a system-wide operation. Blacklisting any of the hierarchy artifacts is possible but it leaves the onus on the user.
This pervasive full-text has an unintended consequence that users with sensitive information in their objects may divulge them to search users because those documents will be indexed and match the search query. This has been noticed in many document libraries outside object storage. There the solution did not involve blacklisting. Instead it involved the users to be informed that the library is not the place to save sensitive information. We are merely following the same practice here.
The use of object storage with Lucene as a full text solution for unstructured data also comes with many benefits other than search. For example, the fields extracted from the raw data together also forms the input for other analysis.
The tags generated on the metadata via supervised or unsupervised learning also forms useful information for subsequent queries. When index documents are manually classified, there is no limit to the number of tags that can be added. Since the index documents and the tags are utilized by the queries, the user gets more and more predicates to use.
The contents of the object storage do not always represent text. They can come in different formats and file types. Even when they do represent text, they may not always be clean. Consequently, a text pre-processing stage is needed prior to indexing. Libraries that help extract from different file types may be used. Also, stemmers and term canonicalizers may be used.
The index documents are also unstructured storage. They can be saved and exported from object storage. The contents of the index document and the fields they retain are readable using the packages with which they were created. They are not proprietary per se if we can read the fields in the index documents and store them directly in object storage as json documents. Most of the fields in the index documents are enumerated by the doc.toString() method. It is easy to take the string collection and save them as text files if we want to make the terms available to other applications. This conversion of information in the various file extensions of the Lucene Index documents such as term infos, term infos index, term vector index, term vector documents and term vector fields can be converted to any form we like. Consequently we are not limited to using one form of search over the metadata.
Lucene indexes are inverted indexes. It lists documents that contain a term. It stores statistics about terms in order to make the term-based search more efficient. While Lucene itself is available in various programming languages, there is no restriction to take the inverted index from lucene and use it in any way as appropriate.
A Lucene index contains a sequence of documents each of which is a sequence of fields. The fields are named sequence of terms and each term is a string in the original text that was indexed. The same term may appear in different fields but have different names. The index may also have partitions called segments Each segment is a fully independent index which could be searched separately. New segments may be created or existing segments may be merged. This organization is re-usable in all contexts of using inverted indexes. Any external format for exporting these indexes may also use a similar organization.
The inverted indexes over object storage may not be as performant as the query execution over relational tables in a sql database but they fill the space for enabling search over the storage. Spotlight on MacOS and Google page-rank algorithm on internet documents also use tokens and lookups. Moreover by recognizing the generic organization of inverted indexes, we can apply any grouping, ranking and sorting algorithm we like. Those algorithms are now independent of the organization of the index in the object storage and each one can use the same index.
Although S3 supports adding access control descriptions to the objects, those are for securing the objects from other users and not the system. Search is a system-wide operation. Blacklisting any of the hierarchy artifacts is possible but it leaves the onus on the user.
This pervasive full-text has an unintended consequence that users with sensitive information in their objects may divulge them to search users because those documents will be indexed and match the search query. This has been noticed in many document libraries outside object storage. There the solution did not involve blacklisting. Instead it involved the users to be informed that the library is not the place to save sensitive information. We are merely following the same practice here.
The use of object storage with Lucene as a full text solution for unstructured data also comes with many benefits other than search. For example, the fields extracted from the raw data together also forms the input for other analysis.
The tags generated on the metadata via supervised or unsupervised learning also forms useful information for subsequent queries. When index documents are manually classified, there is no limit to the number of tags that can be added. Since the index documents and the tags are utilized by the queries, the user gets more and more predicates to use.
The contents of the object storage do not always represent text. They can come in different formats and file types. Even when they do represent text, they may not always be clean. Consequently, a text pre-processing stage is needed prior to indexing. Libraries that help extract from different file types may be used. Also, stemmers and term canonicalizers may be used.
The index documents are also unstructured storage. They can be saved and exported from object storage. The contents of the index document and the fields they retain are readable using the packages with which they were created. They are not proprietary per se if we can read the fields in the index documents and store them directly in object storage as json documents. Most of the fields in the index documents are enumerated by the doc.toString() method. It is easy to take the string collection and save them as text files if we want to make the terms available to other applications. This conversion of information in the various file extensions of the Lucene Index documents such as term infos, term infos index, term vector index, term vector documents and term vector fields can be converted to any form we like. Consequently we are not limited to using one form of search over the metadata.
Lucene indexes are inverted indexes. It lists documents that contain a term. It stores statistics about terms in order to make the term-based search more efficient. While Lucene itself is available in various programming languages, there is no restriction to take the inverted index from lucene and use it in any way as appropriate.
A Lucene index contains a sequence of documents each of which is a sequence of fields. The fields are named sequence of terms and each term is a string in the original text that was indexed. The same term may appear in different fields but have different names. The index may also have partitions called segments Each segment is a fully independent index which could be searched separately. New segments may be created or existing segments may be merged. This organization is re-usable in all contexts of using inverted indexes. Any external format for exporting these indexes may also use a similar organization.
The inverted indexes over object storage may not be as performant as the query execution over relational tables in a sql database but they fill the space for enabling search over the storage. Spotlight on MacOS and Google page-rank algorithm on internet documents also use tokens and lookups. Moreover by recognizing the generic organization of inverted indexes, we can apply any grouping, ranking and sorting algorithm we like. Those algorithms are now independent of the organization of the index in the object storage and each one can use the same index.
No comments:
Post a Comment