Tuesday, June 11, 2013

classifier implementation

Let us consider an implementation of a machine learning classifier. Let us say we want the classifier to tell us if a given name is a male or a female.
Then the classifier we design has the following methods:
1) buildClassifer(Instances) This builds the classifier from scratch with the given dataset.
2) ToString() This returns information about the built classifier, e.g. the decision tree
3) distributionForInstance(Instance) returns a double array containing a percentage for each class label. This is a prediction method.
4) classifyInstance(Instance) returns the label.
Let us say we start with training data and test data. In the training data we have tokenized the words and that we have a hash table of words and their labels.
Next we form a decision tree based on rules we model from the data.
This we base on features. Features are like attributes to the names. They could be for example, start letter, end letter, count of characters, has alphabet, the number of syllables,  prefix, suffix etc.
We start with all the possible hypothesis for gender classification. While we can work with simple and obvious features, we should check each feature to see if it's helpful. This doesn't mean we build the model to include all the features, that is called overfitting and it doesn't work well because the model will have a higher chance of relying on quirkiness of training data and not generalize to newer data.
The features can be evaluated based on error analysis. The training set is used to perform error analysis. Having divided the corpus into appropriate data sets, we can separate out the training set from a dev-test set for error analysis. Keeping a separate data set for error analysis is important for the reason mentioned above.
The classify method is different from the accuracy and we need the frequency distribution to calculate the latter.
The implementation for the classifier can involve building a decision tree. The nodes of the decision trees will be the operations on the features. When one or more features are used to label a classifier, the decision tree is an efficient data structure to run each tuple on from a large data set. The decision tree is built by inserting expressions in the tree. This involves cloning the existing tree and adding the expressions.
The expressions are converted and expanded using a visitor pattern. The visitor pattern is useful in traversing the expression tree and adding the new expressions. A wide variety of expressions can be supported such as binary operations, logical operations, method call expressions etc. These expressions can be serialized so that they can be imported and exported. The expressions also support different operands such as strings, numerals etc.  In the case of the name classifier, some common examples are that names ending with 'yn' suffix are generally female and those ending "hn" are usually male. These kind of rules may have an error set which we can find and tune the model better. The implementation of the APIs are based on rules that are evaluated by iterating over the data. When the accuracy of the prediction is to be measured we lookup a frequency distribution that the classifier has populated from the training data.

No comments:

Post a Comment