Title: Textual Patterns: Key Words and Corpus Analysis, In
Language Education.
Author: Mike Scott & Christopher Tribble (2006)
Summary of Chapter 4 and 5 relevant to keyword extraction.
In chapter 4 of the book, the authors introduce a method for
identifying keywords in a text. They propose that there are two main kinds of
output in a keyword list - aboutness and
style.
He says keyness is a quality words may have in a given text
or set of texts. It will be textual. He
cites an example of a sentence where you can break the sentence into a dozen
trivia each of which has a keyness. What we select out of these could be based
on a threshold which is the third factor in this analysis. Once we have a list
of words and their frequencies we can compare them to those from known corpus to
eliminate usual and frequently occurring words such that only outstanding ones
remain. The contrast between the reference and the target text is handled in a
statistical way. Statistical tests of probability compare a given finding to
what can be expected using a quantitative number.
But statistically significant contrast words need not
reflect importance and aboutness. They
can be considered to represent “style”. Aboutness can be gathered from the
parts of speech analysis. Usually there is a target that references a “genid0”
that ties it to the various parts of the speech.
This often depends on context which can be temporal and spatial.
Context also has many levels such as collocations, same sentence, same
paragraph, same section or chapter, same story or the whole text, a set of
texts or the context of culture.
Therefore there are several issues with detecting style and
aboutness. First is the issue of selecting a text section versus a text versus
a corpus versus a sub-corpus. Second is the statistical issue of what can be
claimed. Third issue is how to choose a
reference corpus. Fourth issue is handling related forms such as anotonyms. The
fifth issue is the status of the keywords one may identify and what is to be done
with them.
Choosing a reference corpus can be tackled with a moderate
size mixed bag of corpus. The keyword procedure using this reference corpus as
mentioned earlier is fairly robust because even an unrelated reference corpus
can detect words that indicate aboutness.
And aboutness may not be one thing but a combination of several different
ones.
Related forms can be avoided with a dictionary such as
WordSmith by using stemming and ignoring semantic relations.
Status of the keyword can be determined with the help of
context or purpose since its not evident from the text itself. As an example
banana and weapons of mass destruction may not have status without
context. But status can be a pointer to
a specific textual aboutness and/or style or even a pattern. Or the keywords
may have been statistically arrived but not established. Thus the keyword
candidacy could be determined.
Keyword clusters or key phrases are also important for
describing the aboutness and style of a given text or set of texts. When such
groups of co-occuring words are selected, how many are positively or negatively
key and are there any patterns in these two types ? These are also further
lines of research.
Thus the authors conclude that keyness is a pointer to the
importance which can be sub-textual, textual or intertextual.
No comments:
Post a Comment