We were discussing the MicrosoftML rxFastTree algorithm.
The gradient boost algorithm for rxFastTree is described by Friedman in his paper as possible with several loss functions including the squared loss function. The squared loss function especially helps to serve as a "reality check". The reason to treat it as such a standard is that it comes closest to the stagewise approach of iteratively fitting current residuals in any gradient descent method.
The algorithm for the least squares regression can be written as :
1. Set the initial approximation
2. For a set of successive increments or boosts each based on the preceding iterations, do
3. Calculate the new residuals
4. Find the line of search by aggregating and minimizing the residuals
5. Perform the boost along the line of search
6. Repeat 3,4,5 for each of 2.
In a joint distribution of all output and input variables, the goal was to obtain an estimate of the mapping function that minimizes the expected value of the least squared loss function. This was the original non-parametric approach which required numerical optimization in a function space.The evaluation of the mapping function at the data point is itself a parameter we seek to minimize. In a function space, there are an infinite number of such parameter but in the datasets only a finite number is involved. Therefore we take a summation over these mapped approximations as 1 to M. Each such incremental function in this range is then called a "step" or a "boost" in the overall optimization method. In the steepest descent method, we write these incremental functions in the form of multiplier and gradients. The multiplier is given by the line of search which is the minimum at the steepest gradient and is expressed as the expected value of the loss function.
However, Friedman mentions in his paper that this non-parametric approach breaks down when the joint distribution of (y,x) is estimated by a finite data sample. In this case of finite data, the expected value cannot be estimated accurately by its data value at each data point. Moreover the data points are all already training points. So the estimation can only be improved by borrowing strength from the nearby data points by imposing smoothness functions such as the Gaussian. The smoothness can also be achieved with the help of a parameterized form of mapping functions and do parameter optimization one at a time over this set.
If the parameterized form is also not feasible, Friedman recommends a "greedy stagewise" approach. The stagewise approach differs from the stepwise approach above which read the last round of entered terms when new ones are added. This stagewise strategy is also called "matching pursuit" by other researchers because the parameter function set is taken from a wavelet-like dictionary. In machine learning, this translates to a "classification tree" and the approach is called "boosting". Since the convergence is in terms of the best greedy step towards the data-based estimate of the approximation objective, the approach is called "greedy stagewise". It just happens that when we do this for one parameter function at a time from the parameterized set, the greedy approch translates as the steepest descent under that constraint. The steepest gradient matches as a parallel to the approximation of the parameterized class for a particular member of the parameterized class. The approximation is thus updated along the line of search. In this way we have abandoned the smoothness constraint technique in favor of the constraint applied to the rough solution obtained by fitting the parameterized class to what can now be called "pseudoresponses". Putting it all together in the forward stagewise incremental additive modeling, we have this now overall algorithm of Gradient Boost.
The gradient boost algorithm for rxFastTree is described by Friedman in his paper as possible with several loss functions including the squared loss function. The squared loss function especially helps to serve as a "reality check". The reason to treat it as such a standard is that it comes closest to the stagewise approach of iteratively fitting current residuals in any gradient descent method.
The algorithm for the least squares regression can be written as :
1. Set the initial approximation
2. For a set of successive increments or boosts each based on the preceding iterations, do
3. Calculate the new residuals
4. Find the line of search by aggregating and minimizing the residuals
5. Perform the boost along the line of search
6. Repeat 3,4,5 for each of 2.
In a joint distribution of all output and input variables, the goal was to obtain an estimate of the mapping function that minimizes the expected value of the least squared loss function. This was the original non-parametric approach which required numerical optimization in a function space.The evaluation of the mapping function at the data point is itself a parameter we seek to minimize. In a function space, there are an infinite number of such parameter but in the datasets only a finite number is involved. Therefore we take a summation over these mapped approximations as 1 to M. Each such incremental function in this range is then called a "step" or a "boost" in the overall optimization method. In the steepest descent method, we write these incremental functions in the form of multiplier and gradients. The multiplier is given by the line of search which is the minimum at the steepest gradient and is expressed as the expected value of the loss function.
However, Friedman mentions in his paper that this non-parametric approach breaks down when the joint distribution of (y,x) is estimated by a finite data sample. In this case of finite data, the expected value cannot be estimated accurately by its data value at each data point. Moreover the data points are all already training points. So the estimation can only be improved by borrowing strength from the nearby data points by imposing smoothness functions such as the Gaussian. The smoothness can also be achieved with the help of a parameterized form of mapping functions and do parameter optimization one at a time over this set.
If the parameterized form is also not feasible, Friedman recommends a "greedy stagewise" approach. The stagewise approach differs from the stepwise approach above which read the last round of entered terms when new ones are added. This stagewise strategy is also called "matching pursuit" by other researchers because the parameter function set is taken from a wavelet-like dictionary. In machine learning, this translates to a "classification tree" and the approach is called "boosting". Since the convergence is in terms of the best greedy step towards the data-based estimate of the approximation objective, the approach is called "greedy stagewise". It just happens that when we do this for one parameter function at a time from the parameterized set, the greedy approch translates as the steepest descent under that constraint. The steepest gradient matches as a parallel to the approximation of the parameterized class for a particular member of the parameterized class. The approximation is thus updated along the line of search. In this way we have abandoned the smoothness constraint technique in favor of the constraint applied to the rough solution obtained by fitting the parameterized class to what can now be called "pseudoresponses". Putting it all together in the forward stagewise incremental additive modeling, we have this now overall algorithm of Gradient Boost.
No comments:
Post a Comment