Wednesday, June 12, 2013

Slide review of text analytics user perspectives on solution and providers by Seth Grimes:
Text analytics is seeing active market growth. The technology - text mining and related visualization and analytical software - continues to deliver unmatched capabilites fueling growth. There are applications in media and publishing, financial services and insurance, travel and hospitality and consumer products and retail. The slides claim that no single organization or approach dominates the market.The findings reported are :
1) Top business applications of text analysis for respondents are a) brand management, b) competitive intelligence and c) voice of the customer
2) These applications take data from online sources such as blogs, articles, forums,
3) Experienced users of these applications prefer specialized dictionaries, taxonomies, or extraction rules and they often like open source.
Text Analytics describes software and transformational steps that discover business value in "unstructured" text.
These steps are:
Compose, publish, manage and archive
Index and search,
categorize and classify according to metadata and contents
summarize and extract information


 
A quick recap of design patterns:
Creational patterns:
Abstract Factory : This provides an interface for creating families of related or dependent objects without specifying their concrete classes.
Builder: Specifes the construction of a complex object from its representation so that the same process can be applied elsewhere.
Factory : Define an interface for creating an object, but lets subclasses decide which class to instantiate.
Prototype: Specify the kind of objects to create using a prototypical instance, and create new objects by copying this prototype.
Singleton: Ensure a class only has one instance and provides a global point of access to it.
Structural Patterns:
Adapter : Convert the interface of a class into another interface clients expect/
Bridge : decouple an abstraction from its implementation so that the two can vary independently
Composite : Compose objects into tree structure to represent part-whole hierarchies.
Decorator : Attach additional responsibilities to an object dynamically.
Facade: Provide a unified interface to a set of interfaces in a subsystem
Flyweight: Use sharing to support a large number of fine-grained objects efficiently
Proxy: Provides a placeholder for another object to control access to it.
Behavioural Patterns:
Chain of responsibility: Avoid coupling the sender of a request to its receiver by giving more than one object  a chance to handle the request.
Command : Encapsulate a request as an object. so that requests can be queued, logged and support undo
Interpreter : Given a language, define a representation for its grammar and use it to interpret sentences in the language.
Iterator : Provide a way to access the elements of an aggregate object sequentially
Mediator: Define an object that encapsulates how a set of objects interact.
Memento: Without violating encapsulation, capture and externalize an objects state so that the object can be restored to that state:
Observer: Define a one-to-many dependency between objects so that when one object changes state, all its dependents are notified and updated automatically.
State: Allow an object to alter its behavior when its internal state changes.
Strategy: Define a family of algorithms, encapsulate each one, and make them interchangeable
Template method: Define the skeleton of an algorithm in an operation, deferring some steps to sub classeses.
Visitor: Represent an operation to be performed on the elements of an object structure without changing the elements and defining new operations.
 

Tuesday, June 11, 2013

In the Resource Governor feature of SQL Server, there was a clear separation between the classification rules and the resource plans. This was because the classification rules were dynamic in nature and could change for assigning the connections to different groups. Groups of connections shared the same pool of resources. The server only needed to keep track of the resource plans. These plans were determined by the administrator for the server and were actively looked up by the server when assigning resources to workloads. To the server the requests did not matter, they belonged to a group but the resources mattered since the server had to account for all resources and divide them between groups. The groups were a label for different connections and was an identifier to denote how much resources could be guaranteed to the server. The rules were for the connections and connections were transient. In comparison, the resource plans were more stable. Second the connections could have different characteristics. So the classification based on connection properties which could change with the next connection. So there is a need for a  classification logic. This logic is a simple user defined function that assigns the incoming connection to a group. This function evaluates the connections based on program order and in the form of a decision tree. This function called the classifier can be maintained independently from the resource plans. By its nature the classifier is code where as the resource plans is data. Furthermore, the resource plan data for the server is constantly read when assigning the resources and requires server reconfiguration after each change since it affects resource throttling for incoming connections. However the server need not know anything about the connections or persist any connection properties since these have been evaluated to a group. The group is a tag for the server that says these are a group of incoming connections on which a policy is applied as defined by the resource plans that this group is assigned to. The groups can be hierarchical as well where as the resource plans are discrete and flat.  The resource plans also have to tally up to the full server capability. Therefore it is owned and enforced by the server. The user connections on the other hand are mapped only once to different pools. In addition, it is written as any other user defined function. Although in practice this is done by the administrator and requires server reconfiguration since the server needs to know that the memberships to the groups have changed, the classifier is connection facing and hence qualifies as just one other user defined function. The decision to reconfigure the server after every classifier change was an important one. It is not merely sufficient to change the classifier to impact the next incoming connection, it is important for the server to know that the memberships to groups are being redefined. This means that connections that were previously coming to a group might now be classified to another group and the plans for that may deny resources to the new connection. The server treats the classifier and the plan definitions as together constituting the resource management policy. So if one of them changes the server's resource management behavior changes. In a way this is a way to tell the server that the policy has changed and is intended as a control for the administrator. Lastly, the policies and the plans are different because there are checks placed on the plan whereas the policies are arbitrary and have no particular relevance to the server. The checks on the plan however determine whether the next quantum of processing or the next memory allocation is changed The calculations by the scheduler or memory manager are dependent on the plan information and this is a state that's persisted so that the server can automatically pick up the state between restarts. Thus the resource policies and plans are treated differently.

classifier implementation

Let us consider an implementation of a machine learning classifier. Let us say we want the classifier to tell us if a given name is a male or a female.
Then the classifier we design has the following methods:
1) buildClassifer(Instances) This builds the classifier from scratch with the given dataset.
2) ToString() This returns information about the built classifier, e.g. the decision tree
3) distributionForInstance(Instance) returns a double array containing a percentage for each class label. This is a prediction method.
4) classifyInstance(Instance) returns the label.
Let us say we start with training data and test data. In the training data we have tokenized the words and that we have a hash table of words and their labels.
Next we form a decision tree based on rules we model from the data.
This we base on features. Features are like attributes to the names. They could be for example, start letter, end letter, count of characters, has alphabet, the number of syllables,  prefix, suffix etc.
We start with all the possible hypothesis for gender classification. While we can work with simple and obvious features, we should check each feature to see if it's helpful. This doesn't mean we build the model to include all the features, that is called overfitting and it doesn't work well because the model will have a higher chance of relying on quirkiness of training data and not generalize to newer data.
The features can be evaluated based on error analysis. The training set is used to perform error analysis. Having divided the corpus into appropriate data sets, we can separate out the training set from a dev-test set for error analysis. Keeping a separate data set for error analysis is important for the reason mentioned above.
The classify method is different from the accuracy and we need the frequency distribution to calculate the latter.
The implementation for the classifier can involve building a decision tree. The nodes of the decision trees will be the operations on the features. When one or more features are used to label a classifier, the decision tree is an efficient data structure to run each tuple on from a large data set. The decision tree is built by inserting expressions in the tree. This involves cloning the existing tree and adding the expressions.
The expressions are converted and expanded using a visitor pattern. The visitor pattern is useful in traversing the expression tree and adding the new expressions. A wide variety of expressions can be supported such as binary operations, logical operations, method call expressions etc. These expressions can be serialized so that they can be imported and exported. The expressions also support different operands such as strings, numerals etc.  In the case of the name classifier, some common examples are that names ending with 'yn' suffix are generally female and those ending "hn" are usually male. These kind of rules may have an error set which we can find and tune the model better. The implementation of the APIs are based on rules that are evaluated by iterating over the data. When the accuracy of the prediction is to be measured we lookup a frequency distribution that the classifier has populated from the training data.

Monday, June 10, 2013

knowledge discovery

Knowledge extraction requires the reuse of existing formal knowledge (reusing identifiers or ontologies)  or the generation of schema based on source data.  It is similar to NLP and ETL but involves representing the results in a format that is called RDF or Resource Description Framework.  Resource Description Framework is a modeling of information ( metadata) such as what is implemented for web resources.  The RDF data model makes use of subject-predicate-object expressions. RDF is an abstract model and has several serialization formats. This format is machine readable and machine interpretable.  A collection of RDF statements intrinsically represents a labeled directed multi-graph. As such an RDF based data model is persisted in relational stores in the triple tuple format and similar. The reverse of mapping relational databases to RDF is also common because this data can be made available to semantic web.
The process of information extraction uses traditional methods from ETL, which transforms the data into structured formats. Knowledge extraction can be categorized based on
1) source such as whether it is text, DB, XML, CSV etc.
2) exposition such as with an ontology file or semantic database.
3) synchronization such as if the extraction is once or multiple times and how it is synced.
4) reuse of vocabularies such as if the tool is able to map table columns to resources
5) automatization such as the degree  to which the extraction is assisted or if its manual.
6) requiring domain ontology such as with pre-existing ones
 Mapping from RDB tables / views to RDF entities proceeds with the conversion of each column to a predicate, each column value to an object, each row key to a subject and each row to a collection of triples with a common subject.
Similarly, the reverse mapping involves creating a RDFS class for each table, a conversion of all primary keys and foreign keys into IRIs, assigning a predicate IRI to each column, assigning an rdf type predicate for each row, linking it to an RDFS class IRI corresponding to the table.

machine learning

Machine Learning

Unsupervised learning is clustering. The class labels of training data are unknown and given a set of measurements, the data is clustered
Supervised learning is classification : The training data has both measurements and labels indicating the class of observations. New data is classified based on the training data.
Classification constructs a model. A model is a set of rules such as if this condition, then this label. The classification model is used to predict categorical class labels or estimating accuracy of the model.
Prediction is computation such as a formula on the attributes of data and models continuous-valued functions.
Classification tasks include induction from training set and deduction from test set. A learning algorithm is used to build a learning model. When the model is trained, it can be used for deduction from test set.
The speed at which a classifier can be used is time it takes to construct the model and then to use it for classification.
Classification models are evaluated based on accuracy, speed, robustness, scalability, interpretability, and other such measures.
Classification algorithms are called supervised learning because a supervisor prepares a set of examples of a target concept and a set of tasks that finds a hypothesis that explains a target concept which is then used and measured as performance of how accurately the hypotheis explains the example
If we define the problem domain as set of X then classification is a function c that maps X to a result set D which is going to be analyzed.
if we consider a set of <x,d> pairs  with x belonging to X and d belonging to D, then this set called Experience E is explained by a hypothesis h and the set of all such hypotheses is H. The goodness of H is the percentage of examples that are correctly explained.
Given the examples in E, supervised learning algorithms search the hypotheses H  for the one that best explains the examples in E. The type of hypothesis required influences the search algorithm. The more complex the representation, the more complex the search algorithm. Search can go from general to specific and from specific to general. It's better to work with single dimensions and boolean d aka concepts.
Inductive Bias is the set of assumptions that together with the training data deductively justify the classification assigned by the learner to future instances. There can be a number of hypotheses consistent with the training data. Each learning algorithm has an inductive bias that affects the hypothesis selection. Inductive Bias can be based on  language-syntax, heuristic-based-semantics, rank based  preference and restriction based on search space.
[Courtesy : Prof. Pier Luca Lanzi lecture notes]

Sunday, June 9, 2013

Text mining targets unstructured text as opposed to structured text in web mining and databases in data mining. and patterns in natural language processing. Data Mining and Natural language processing both find patterns  and information is retrieved with queries while Text mining finds nuggets. That said, text mining has overlap with all of the above.
The processing stages in text mining are text storage, text preprocessing, text transformation or attribute generation, attribute selection, Data mining or pattern discovery and interpretation or evaluation. Storing text doesn't necessarily need to be in raw form only but can be stored as document clusters. Text is characterized by one or more of the following:
source  : human input, automated input in different languages and formats
context : words and phrases create context.
ambiguity : word and sentence are disambiguated based on ontology
noise : erroneous data, misspelt data, stop words, etc
format : normal speech, interactive chat, etc.
sparseness: aka document density percentage in typical document
Text processing involves cleanup, tokenization, part of speech tagging, word sense disambiguation and semantic structures. Text cleanup involves removing junk characters, binary formats, tables, figures and formulas. Tokenization is splitting up into a set of tokens. Part of speech tagging is associating words with parts of speech which can be grammar based or statistically based. Word sense disambiguation is how many distinct senses is used in a given sentence. Semantic processing involves chunking which produces syntactic constructs like noun phrases and verb phrases or full parsing which yields a tree. Chunking is more common.
Text transformation involves text representation and feature selection which characterizes the document. A classifier is used to automatically generate labels (attributes) from the features fed into it.
Feature selection is based on two approaches - one is to select the features before using them in a classifier, which requires a feature ranking method and the othe is to select the features on how well they work in a classifier. In the latter case the classifier is part of the feature selection method and is often an iterative process.  However the classifier needs to be trained and the evaluation is based on actual use. The features are evaluated iteratively. In the former case, there are many more choices to feature selection since this is independent of the classifier. Each feature is evaluated once lowering computational cost. Attributes generated are the labels of the classes automatically produced by the classifier on the selected features.
Attribute selection is done because higher dimensions causes issues with machine learners and hence irrelevant features are removed.
After the attributes are selected, patterns can be found with data mining techniques and the results interpreted.

Review of text mining slides from CSE634 presentation by Chiwara, Al-Ayyoub, Hossain and Gupta