Thursday, October 16, 2014

In this article, we investigate Kernel methods in continuation of our readings on the use of different machine learning techniques in web applications. We can think of these as advanced classifiers in the set of classifiers that included decision trees, Bayesian classifiers and neural networks.  A typical example of a dataset for which this kind of classifier is used is for matchmaking people as for a dating site. There are many variables, both numerical and nominal, and many non-linear relationships. One of the things we should realize is that a dataset and a classifier cannot be arbitrarily paired. We have to preprocess the data and select the best algorithm to get good results.

In the matchmaking example we want to build a classifier on, we recognize several different variables such as age, habits, interests, wants children, location etc. Instantly we see how complex this dataset is and how simplifying our assumption is for what we model the dataset to be. Let us say we write a function that can take the population and generate a dataset with values assigned against the variables. This could simply be reading a preprocessed data file. The next step is to take these variables in pairs and make a scatter plot to see how correlated they are. For example, age difference may be moretolerated as we grow older. If we build a decision tree classifier on this scatter plot, we will have split the data on numerical boundaries. This boundary that splits the data into different categories is called thedecision boundaries and it helps to visualize how divided the data is. Next, we can build a linear classifier that finds the average of all the data in each class, finds a center point and then classifies the new points to the nearest center point. By drawing a line through the averages in the scatter plot, we can take new data for matching based on this variable and predict the match based on how close they are to the nearest average. While closeness of a point to the average was measured with Euclidean distance earlier, we could now improve it with vectors and dot-product. The dot-product is a scalar quantity and can be positive or negative based on the cosine of the angle between the vectors. And with this way to find the closeness, we now have a linear classifier.

However, data may not always indicate a dividing line for a linear classifier to work effectively. In such a case, we use Kernel methods to improve the classifier to do non-linear classification.

But before we jump into the kernel methods let us quickly see how to work with variables that are not necessarily as simple as age to use a scatter plot. For example, interests could be diverse and we may end up creating a new variable for every interest. To overcome this, we represent the interests in a hierarchy and then use the structure to compute the closeness of interests.  Together with a higher number of close interests, we can have a much smaller variable for this dataset. Similarly location could be represented by many different things such as street, city, state, country etc. To normalize location, we could use geocoding API that translates locations into latitudes and longitudes and then computes the distances as miles. Lastly we can scale the dataset for higher or lower resolution to enable applying alinear classifier.

Nevertheless, we may have a complex dataset that is not subject to a linear classifier. For example, if the data were divided by a circle forming two different regions with a concentric average, it is not east to apply a linear classifier because there is no line. Instead we use kernel methods. For the same example of concentric regions in a scatter plot of X and Y, we could take the squares of the corresponding variables and a scatter plot of the squares shows the inner region data points moving to a corner and all the others as away from the corner. In such a case, it’s easy to draw a dividing line separating the corner from the rest. Transforming the data for a couple of variables is easy. But transforming the data for vectors with a large number of variables or dimensions is not easy. Instead we apply what is called the kernel trick which involves replacing the dot product function with a new function that returns what thedot-product would have been if the data had first been transformed to a higher dimensional space using a mapping function. One such function is the radial-basis function which is like a dot product in that it takes the two vectors and returns a value but it also takes in an additional parameter which can be adjusted to get the best linear separation. And instead of calculating the dot-product of the new point to classify and the average point, we now find the radial-basis between the point and every other point in the class and average them. The weight adjustment to help with categorization called the offset is alsodifferent in the transformed space. It is usually pre-calculate for this dataset and passed in when classifying new points. This has now tremendously improved the linear classifier for say the impact of age difference in match making.

#codingexercise

No comments:

Post a Comment