Sunday, May 6, 2018

Full text search and text summarization from relational stores
Many databases in the 1990s started providing full text search techniques. It was a text retrieval technique that relied on searching all of the words in every document stored in a specific full text database. Fulltext is a strategy. It is not changeable.
Text analysis models are a manifestation of strategy. We can train and test different models in the text analysis. With the migration of existing databases to cloud, we can now import the data we want to be analyzed from say Azure machine learning studio. Then we can use one or more of the strategies to evaluate what works best for us. 
This solves two problems - one we do not rely on one t-shirt fits all strategy and two we can continously improve the model we train on our data by tuning its parameters. 
The difference in techniques between fulltext and nlp rely largely on the data structures used. While one uses inverted document lists and indexes, the other is making use of word vectors. The latter can be used to create similarity graphs and used with inference techniques. The word vectors can also be used with a variety of data mining techniques. With the availability of managed service databases in the public clouds, we are now empowered to creating nlp databases in the cloud
If we are going to operational-ize any of the algorithms we might benefit from using the strategy design pattern in implementing our service. We spend a lot of time in data science to come up with the right model and parameters but it is always good to have a fallback option as full text search especially if we are going to be changing the applications of our text analysis.
Microsoft ML package is used to push algorithms such as multiclass logistic regression into the database servers. This package is available out of box from R as well as Microsoft Sql Server. One of its application is for text classification. In this application Newsgroup20 corpus is used to train the model. The newsgroup20 defines subject and text separately. When the word vectors are computed they are sourced from both the subject and text separately.  The model is saved to SQL Server which can then be used on test data.  This kind of analysis works well with on-premise data. If the data exists in the cloud such as in Cosmos Database, it is required to be imported into the Azure machine learning studio for use in an experiment. All we need is the Database ID for the name of the database to use, the DocumentDB key for the key to be pasted and the CollectionID for the name of the collection. SQL Query and its parameters can then be used from the database to filter the data.
# codingexercise added file uploader to http://shrink-text.westus2.cloudapp.azure.com:8668/add

No comments:

Post a Comment