Sunday, June 30, 2019

Today we discuss a data structure for Thesaurus. We have referred to a thesaurus as a source of information for natural language processing. There will be heavy use of the thesaurus when words from the incoming text are looked up for possible similarity. We will need a data structure that makes the lookups faster. A B-plus tree data structure to hold the words and their similarities will be helpful because they will make the lookups logarithmically faster while allowing the data to scale.  
A thesaurus can be viewed as a table of words with references to other words. We form hierarchical representations with group id where the id points back to a word. A hierarchical representation of words is easy to explore with a recursive common table expression. By relating words to their groups as the next level, we can define the recursion using a level increment for each group and a terminal condition for the recursion. Although this table of words is standalone in this case, we can view it as a relational table and use a recursive CTE to query the words in the table.  
The number of synonyms can vary for each word. A list of synonyms for each word can be easier to maintain but an arbitrary big table with a large number of columns will also suffice because a word will usually have a finite number of synonyms and their order does not matter.  Synonym1, Synonym2, Synonym3, … can be the names of the columns in this case.  Each synonym is a word. Each word is represented by the identifier associated with it in the table. A word and its list of synonyms can be represented with a column each where the second column stores varying number of word identifiers depending on the number of synonyms the word has. This might be efficient for storage purpose but it is better to store the words in a variety of words. Since the rows in the table of synonyms will be listed in alphabetical order of the words, a clustered index could prove helpful to make lookups faster. 
A table for the synonyms facilitates SQL queries over the entries. This facilitates application development over the thesaurus because now the built-in operators and features of the well-known language come useful. Also, it helps with the object-to-relational mapping of the entities and thus makes it easier for the development of services using the language native features. The separation of querying from the storage also helps keep create-update-deletes to be independent from read-only operations. 
The storage of synonyms does not need to be kept in a table. Alternative forms of persistence such as graphs with words as nodes and weighted edges as synonyms can also be used. However, the query language for graphs may not be the same as that for tables. Other than that, both forms of storage can be used to store the thesaurus. 

No comments:

Post a Comment