Thursday, June 13, 2013

Slide review of text analytics user perspectives on solution and providers by Seth Grimes ( Continued )
Text analysis involves statistical methods for a relative measure of the significance of words, first for individual words and then for sentences. Vector space models is used to represent documents for information retrieval, classification, and other tasks. The text content of a document is viewed as an unordered bag of words and measures such as TF-IDF (term-frequency-inverse-document-frequency) represent their distances in the vector space. Additional analytic techniques to group the text is used to identify the salient topics.
However, the limitation of statistical methods is that the statistical method have a hard time making sense of nuanced human language. Hence natural language processing is proposed where one or a sequence or pipeline of resolving steps are applied to text. These include:
tokenization - identification of distinct elements
stemming - identifying variants of word bases
Lemmatization - use of stemming and analysis of context and parts of speech
Entity Recognition - Lookup lexicons and gazetteers and use of pattern matching
Tagging - XML markup of distinct elements
Software using the above approaches have found applications in business, scientific and research problems. The application domains include: brand management, competitive intelligence, content management, customer service, e-discovery, financial services, compliance, insurance, law enforcement, life sciences, product / service design, research, voice of the customer etc.  Text analytics solution providers include young and mature software vendors as well as software giants. Early adopters have very high expectations on the return of investments from text analytics.  Among the survey conducted on adopters, some more findings are as follows:
Bulk of the text analytics users have been using it for four years or more.
Primary uses include brand management, competitive intelligence, customer experience, voice of the customer, Research and together they represent more than 50% of all applications.
Textual information sources are primarily blogs, news articles, e-mails, forums, surveys and technical literature.
Return on investment is measured by increased sales to existing customers, higher satisfaction ratings, new-customer acquisition, and higher customer retention.
A third of all spenders had budget below $50,000 and a quarter used open-source.
Software users also showed likes and dislikes primarily on flexibility, effectiveness, accuracy, and ease of use.
More than 70% of the users wanted the ability to extend the analytics to named entities such as people, companies, geographic locations, brands, ticker symbols, etc.
Ability to use specialized dictionaries, taxonomies, or extraction rules was considered more important than others.
This study was conducted in 2009.
 

No comments:

Post a Comment