There are test-driven non-academic approaches that deserve a mention. In the section below, I present two of these approaches not covered elsewhere in the literature. They come from the recognition that nlp testers find the prevalent use of a corpus or a training set and neither of them are truly universal. Just like a compression algorithm does not work universally for high entropy data, a keyword-detection algorithm will likely not be universal. These approaches help draw the solution closer to “good enough” bar for themselves.
First approach: any given text can be divided into three different sections – 1) sections with low number of salient keywords 2) sections with high number of salient keywords and 3) sections with a mixed number of non-salient keywords.
Out of these, if an algorithm can work well for 1) and 2) then the 3) section can be omitted altogether to meet the acceptance criteria mentioned. If algorithm needs to be different for 1) and different for 2) then the common subset of keywords between the two will likely be a better outcome than either of them independently.
Second approach: Treat the data set as a unit to be run via merely different clusterers. Each clusterer can have any approach involved for vector representation such as
• involving different metrics such as mutual information or themes such as syntax, semantics, location, statistical or latent-semantics, or word-embeddings
• may require multiple passes of the same text,
• multiple levels of analysis,
• Treating newer approaches including the dynamic grouping approach of treating different selection of keywords to be a clusterer by itself where the groups representing salient topics as representative of pseudo-keywords and
• defining clusters as having a set of terms as centroids.
Then the common keywords detected by these clusterers will allow the outcome of this approach to be better representation of the sample.
No comments:
Post a Comment