Wednesday, October 17, 2018

We were discussing full text search over object storage. As we enumerate some of the advantages of separating object index from object data, we realize that the metadata is what we choose to keep from the Lucene generated fields. With enhancement to fields of the documents added to the index, we improve not only the queries but also the collection on which we can perform correlations. If we started out with only a few fields, the model for statistical analysis has only a few parameters. The more fields we add the better the correlation. This is true not just for queries on the data and the analysis over historical accumulation of data, but also the data mining and machine learning methods.
We will elaborate each of these. Consider the time and space dimension queries that are generally required in dashboards, charts and graphs. These queries need to search over data that has been accumulated which might be quite large often exceeding terabytes. As the data grows, the metadata becomes all the more important and their organization can now be tailored to the queries instead of relying on the organization of the data. If there is need to separate online adhoc queries on current metadata from more analytical and intensive background queries, then we can choose to have different organizations of the information in each category so that they serve their queries better.
Let us also look at data mining techniques. These include clustering techniques and rely largely on adding additional tags to existing information. Even though we are searching Lucene index documents and they may already have fields, there is nothing preventing these techniques to classify and come up with newer labels which can be persisted as fields. Therefore, unlike the read only nature of the queries mentioned earlier, these are techniques where one stage of processing may benefit from the read-write of another. Data mining algorithms are computationally heavy as compared to some of the regular queries for grouping, sorting and ranking. Even the similarity between an index document and a cluster might not be cheap. That is why it helps to have the results of one processing benefit another especially if the documents have not changed.
Now let us take a look at the machine learning which is by far most involved with computations than all of the above. In these cases, we again benefit with more and more data to process. Since the machine learning methods are implemented in packages from different sources, there is more emphasis on long running tasks written in environments different from the data source. Hence making all of the data available in different compute environments becomes more important. If it helps the performance in these cases, the object storage can keep replications of the data in different zones.

No comments:

Post a Comment