Stochastic gradient boosting

https://doi.org/10.1016/S0167-9473(01)00065-2Get rights and content

Abstract

Gradient boosting constructs additive regression models by sequentially fitting a simple parameterized function (base learner) to current “pseudo”-residuals by least squares at each iteration. The pseudo-residuals are the gradient of the loss functional being minimized, with respect to the model values at each training data point evaluated at the current step. It is shown that both the approximation accuracy and execution speed of gradient boosting can be substantially improved by incorporating randomization into the procedure. Specifically, at each iteration a subsample of the training data is drawn at random (without replacement) from the full training data set. This randomly selected subsample is then used in place of the full sample to fit the base learner and compute the model update for the current iteration. This randomized approach also increases robustness against overcapacity of the base learner.

Section snippets

Gradient boosting

In the function estimation problem one has a system consisting of a random “output” or “response” variable y and a set of random “input” or “explanatory” variables x={x1,…,xn}. Given a “training” sample {yi,xi}1N of known (y,x) values, the goal is to find a function F(x) that maps x to y, such that over the joint distribution of all (y,x) values, the expected value of some specified loss function Ψ(y,F(x)) is minimizedF(x)=argminF(x)Ey,xΨ(y,F(x)).

Boosting approximates F(x) by an “additive”

Stochastic gradient boosting

With his “bagging” procedure, Breiman (1996) introduced the notion that injecting randomness into function estimation procedures could improve their performance. Early implementations of AdaBoost (Freund and Schapire, 1996) also employed random sampling, but this was considered an approximation to deterministic weighting when the implementation of the base learner did not support observation weights, rather than as an essential ingredient. Recently, Breiman (1999) proposed a hybrid bagging

Simulation studies

The effect of randomization on gradient tree_boost procedures will likely depend on the particular problem at hand. Important characteristics of problems that affect performance include training sample size N, true underlying “target” function F(x) (1), and the distribution of the departures, ε, of y|x from F(x). In order to gauge the value of any estimation method it is necessary to accurately evaluate its performance over many different situations. This is most conveniently accomplished

Discussion

The results of the previous section indicate that the accuracy of gradient boosting can be substantially improved by introducing randomization through the simple expedient of training the base learner on different randomly selected data subsets at each iteration. The degree of improvement is seen to depend on the particular problem at hand in terms of the training sample size N, the true underlying target function F(x) (1), the distribution of y|x (Gaussian, slash, Bernoulli), and the capacity

Acknowledgements

Helpful discussions with Leo Breiman are gratefully acknowledged. This work was partially supported by CSIRO Mathematical and Information Sciences, Australia, the Department of Energy under contract DE-AC03-76SF00515, and by Grant DMS9764431 of the National Science Foundation.

References (5)

  • L. Breiman

    Bagging predictors

    Mach. Learning

    (1996)
  • Breiman, L., 1999. Using adaptive bagging to debias regressions. Technical Report, Department of Statistics, University...
There are more references available in the full text version of this article.

Cited by (5176)

View all citing articles on Scopus
View full text