I came across a paper on Automatic Discovery of Similar Words by Senellart and Blondel. It talks about algorithms to extract similar words from a large corpus of documents. In this paper it refers to a partial syntactic analysis from another paper which is interesting. This method is called the SEXTANT (Semantic Extraction from Text via Analyzed Network of Terms) and uses the steps as mentioned below:
Lexical Analysis :
Words in the corpus are separated using a simple lexical analysis. Proper names are also recognized. Then each word is looked up in a lexicon and is assigned a part of speech. If a word has several possible parts of speech, a disambiguator is used to choose the most probable one.
Noun and verb phrase bracketing:
Noun and verb phrases are then detected in the sentences of the corpus. This is done with starting, ending and continuation rules such as a determiner can start a noun phrase, a noun can follow a determiner in a noun phrase, an adjective cannot start, end or follow any kind of word in a verb phrase and others.
Using multiple passes for parsing the text, several syntactic relations or contexts are then extracted from the bracketed
These are denoted as follows
ADJ : an adjective modifies a noun eg. civil unrest
NN : a noun modifies a noun eg. animal rights
NNPREP : a noun that is the object of a preposition modifies a preceding noun eg. measurements along the chest
SUBJ: a noun is the subject of a verb. eg. the table shook
DOBJ: a noun is the direct object of a verb. (e.g. shook the table )
IOBJ: a noun in a prepositional phrase modifying a verb. (eg. the book was placed on the table)
This kind of parsing was computationally intensive and multiple passes over large corpus added to it. Hence only the few above were used.
Only the similarity between nouns is focused on and other parts of speech are not discussed. After the parsing step, a noun has a number of attributes, all the words that modify it along with a kind of syntactical relation. This leads to high dimensions such as a noun that appears 83 times has 67 unique attributes.
Each attribute is assigned a weight using the probability of the attribute among all those for the noun.
These attributes are then used for with a similarity measure such as a weighted Jaccard similarity.
An interesting observation from such similarity measures is that the corpus has a great impact on the meaning of the word according to which similar words are selected.
Lexical Analysis :
Words in the corpus are separated using a simple lexical analysis. Proper names are also recognized. Then each word is looked up in a lexicon and is assigned a part of speech. If a word has several possible parts of speech, a disambiguator is used to choose the most probable one.
Noun and verb phrase bracketing:
Noun and verb phrases are then detected in the sentences of the corpus. This is done with starting, ending and continuation rules such as a determiner can start a noun phrase, a noun can follow a determiner in a noun phrase, an adjective cannot start, end or follow any kind of word in a verb phrase and others.
Using multiple passes for parsing the text, several syntactic relations or contexts are then extracted from the bracketed
These are denoted as follows
ADJ : an adjective modifies a noun eg. civil unrest
NN : a noun modifies a noun eg. animal rights
NNPREP : a noun that is the object of a preposition modifies a preceding noun eg. measurements along the chest
SUBJ: a noun is the subject of a verb. eg. the table shook
DOBJ: a noun is the direct object of a verb. (e.g. shook the table )
IOBJ: a noun in a prepositional phrase modifying a verb. (eg. the book was placed on the table)
This kind of parsing was computationally intensive and multiple passes over large corpus added to it. Hence only the few above were used.
Only the similarity between nouns is focused on and other parts of speech are not discussed. After the parsing step, a noun has a number of attributes, all the words that modify it along with a kind of syntactical relation. This leads to high dimensions such as a noun that appears 83 times has 67 unique attributes.
Each attribute is assigned a weight using the probability of the attribute among all those for the noun.
These attributes are then used for with a similarity measure such as a weighted Jaccard similarity.
An interesting observation from such similarity measures is that the corpus has a great impact on the meaning of the word according to which similar words are selected.
No comments:
Post a Comment