In continuation of our discussion on classifiers such as Kernel methods, we will now cover Support Vector Machines (SVM). These are probably the most sophisticated of all classifiers. An SVM finds the line that separates the data more cleanly, which means that its the greatest possible distance from points near the dividing line. The only points necessary to determine where the line should be are the points closest to it and hence are called support vectors. There's no need to go through the training data to classify new points once the line has been found.
As opposed to the linear classification which may mis-classify two points closest to the line while fitting the line for the rest of the data points whose averages are computed, the dividing line is chosen such that it is as far away from the points in the class and the only points needed to make that determination in these cases are the two points. The trick is to find just these support vectors and then draw the dividing line. Also, kernel methods used dot product to perform a non-linear classification by using the kernel trick where a dot product was used for comparisons. SVMs also use dot products and with kernels to perform non-linear classification. SVM datasets are typically those that are very complex, high dimensional and data intensive. For example, facial expressions can be classified with SVMs. Another important feature for applications is that they can naturally accept input data that are not in the form of vectors such as strings, trees and images.
Mathematically, the underlying idea is that a hyperplane that is far from any observed data points, should minimize the risk of making wrong decisions when classifying new data. The hyperplane that solves this constraint is called the optimal separating hyperplane. This hyperplane (wx + b) can be found as the solution to an equivalent optimization problem:
min w,b = 1/2 (|| w ||) ^ 2 subject to (w'. x + b ) >= 1 constraint. Notice that only a small set of data will solve this equation and these are the support vectors which has all the information necessary to derive the decision rule. However, when w has very high dimension, this optimization becomes intractable.Instead we solve it with the transformation
find min alpha alpha'Qalpha - alpha subject to constraint alpha-i >= 0, y'.alpha = 0 where Q = (yi yj xj xi) and this has a unique solution when Q is positive semidefinite.
When we solve for alpha, we can get
w = Sigma-i (Alpha-i, yi, xi )
b = 1/2 ( w ' x(+) + w ' x(-) ) where x(+) and x(-) are two support vectors belonging to opposite classes.
Courtesy : Wolfram media
As opposed to the linear classification which may mis-classify two points closest to the line while fitting the line for the rest of the data points whose averages are computed, the dividing line is chosen such that it is as far away from the points in the class and the only points needed to make that determination in these cases are the two points. The trick is to find just these support vectors and then draw the dividing line. Also, kernel methods used dot product to perform a non-linear classification by using the kernel trick where a dot product was used for comparisons. SVMs also use dot products and with kernels to perform non-linear classification. SVM datasets are typically those that are very complex, high dimensional and data intensive. For example, facial expressions can be classified with SVMs. Another important feature for applications is that they can naturally accept input data that are not in the form of vectors such as strings, trees and images.
Mathematically, the underlying idea is that a hyperplane that is far from any observed data points, should minimize the risk of making wrong decisions when classifying new data. The hyperplane that solves this constraint is called the optimal separating hyperplane. This hyperplane (wx + b) can be found as the solution to an equivalent optimization problem:
min w,b = 1/2 (|| w ||) ^ 2 subject to (w'. x + b ) >= 1 constraint. Notice that only a small set of data will solve this equation and these are the support vectors which has all the information necessary to derive the decision rule. However, when w has very high dimension, this optimization becomes intractable.Instead we solve it with the transformation
find min alpha alpha'Qalpha - alpha subject to constraint alpha-i >= 0, y'.alpha = 0 where Q = (yi yj xj xi) and this has a unique solution when Q is positive semidefinite.
When we solve for alpha, we can get
w = Sigma-i (Alpha-i, yi, xi )
b = 1/2 ( w ' x(+) + w ' x(-) ) where x(+) and x(-) are two support vectors belonging to opposite classes.
Courtesy : Wolfram media
No comments:
Post a Comment