Cluster computing: Book Review : Key Words and Corpus Analysis, In Language Education.

Wednesday, March 27, 2013

Book Review : Key Words and Corpus Analysis, In Language Education.

Title: Textual Patterns: Key Words and Corpus Analysis, In Language Education.

Author: Mike Scott & Christopher Tribble (2006)

Summary of Chapter 4 and 5 relevant to keyword extraction.

In chapter 4 of the book, the authors introduce a method for identifying keywords in a text. They propose that there are two main kinds of output in a keyword list - aboutness and style.

He says keyness is a quality words may have in a given text or set of texts. It will be textual. He cites an example of a sentence where you can break the sentence into a dozen trivia each of which has a keyness. What we select out of these could be based on a threshold which is the third factor in this analysis. Once we have a list of words and their frequencies we can compare them to those from known corpus to eliminate usual and frequently occurring words such that only outstanding ones remain. The contrast between the reference and the target text is handled in a statistical way. Statistical tests of probability compare a given finding to what can be expected using a quantitative number.

But statistically significant contrast words need not reflect importance and aboutness. They can be considered to represent “style”. Aboutness can be gathered from the parts of speech analysis. Usually there is a target that references a “genid0” that ties it to the various parts of the speech.

This often depends on context which can be temporal and spatial. Context also has many levels such as collocations, same sentence, same paragraph, same section or chapter, same story or the whole text, a set of texts or the context of culture.

Therefore there are several issues with detecting style and aboutness. First is the issue of selecting a text section versus a text versus a corpus versus a sub-corpus. Second is the statistical issue of what can be claimed. Third issue is how to choose a reference corpus. Fourth issue is handling related forms such as anotonyms. The fifth issue is the status of the keywords one may identify and what is to be done with them.

Choosing a reference corpus can be tackled with a moderate size mixed bag of corpus. The keyword procedure using this reference corpus as mentioned earlier is fairly robust because even an unrelated reference corpus can detect words that indicate aboutness. And aboutness may not be one thing but a combination of several different ones.

Related forms can be avoided with a dictionary such as WordSmith by using stemming and ignoring semantic relations.

Status of the keyword can be determined with the help of context or purpose since its not evident from the text itself. As an example banana and weapons of mass destruction may not have status without context. But status can be a pointer to a specific textual aboutness and/or style or even a pattern. Or the keywords may have been statistically arrived but not established. Thus the keyword candidacy could be determined.

Keyword clusters or key phrases are also important for describing the aboutness and style of a given text or set of texts. When such groups of co-occuring words are selected, how many are positively or negatively key and are there any patterns in these two types ? These are also further lines of research.

Thus the authors conclude that keyness is a pointer to the importance which can be sub-textual, textual or intertextual.

Cluster computing

Wednesday, March 27, 2013

Book Review : Key Words and Corpus Analysis, In Language Education.

No comments:

Post a Comment