The Annals of Statistics

A new perspective on boosting in linear regression via subgradient optimization and relatives

Abstract

We analyze boosting algorithms [Ann. Statist. 29 (2001) 1189–1232; Ann. Statist. 28 (2000) 337–407; Ann. Statist. 32 (2004) 407–499] in linear regression from a new perspective: that of modern first-order methods in convex optimization. We show that classic boosting algorithms in linear regression, namely the incremental forward stagewise algorithm ($\text{FS}_{\varepsilon}$) and least squares boosting [LS-BOOST$(\varepsilon)$], can be viewed as subgradient descent to minimize the loss function defined as the maximum absolute correlation between the features and residuals. We also propose a minor modification of $\text{FS}_{\varepsilon}$ that yields an algorithm for the LASSO, and that may be easily extended to an algorithm that computes the LASSO path for different values of the regularization parameter. Furthermore, we show that these new algorithms for the LASSO may also be interpreted as the same master algorithm (subgradient descent), applied to a regularized version of the maximum absolute correlation loss function. We derive novel, comprehensive computational guarantees for several boosting algorithms in linear regression (including LS-BOOST$(\varepsilon)$ and $\text{FS}_{\varepsilon}$) by using techniques of first-order methods in convex optimization. Our computational guarantees inform us about the statistical properties of boosting algorithms. In particular, they provide, for the first time, a precise theoretical description of the amount of data-fidelity and regularization imparted by running a boosting algorithm with a prespecified learning rate for a fixed but arbitrary number of iterations, for any dataset.

Article information

Source
Ann. Statist., Volume 45, Number 6 (2017), 2328-2364.

Dates
Revised: August 2016
First available in Project Euclid: 15 December 2017

https://projecteuclid.org/euclid.aos/1513328575

Digital Object Identifier
doi:10.1214/16-AOS1505

Mathematical Reviews number (MathSciNet)
MR3737894

Zentralblatt MATH identifier
06838135

Subjects
Primary: 62J05: Linear regression 62J07k
Secondary: 90C25: Convex programming

Citation

M. Freund, Robert; Grigas, Paul; Mazumder, Rahul. A new perspective on boosting in linear regression via subgradient optimization and relatives. Ann. Statist. 45 (2017), no. 6, 2328--2364. doi:10.1214/16-AOS1505. https://projecteuclid.org/euclid.aos/1513328575

References

• [1] Bach, F. (2015). Duality between subgradient and conditional gradient methods. SIAM J. Optim. 25 115–129.
• [2] Bertsekas, D. (1999). Nonlinear Programming. Athena Scientific, Belmont, MA.
• [3] Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge Univ. Press, Cambridge.
• [4] Breiman, L. (1998). Arcing classifiers. Ann. Statist. 26 801–849.
• [5] Breiman, L. (1999). Prediction games and arcing algorithms. Neural Comput. 11 1493–1517.
• [6] Bühlmann, P. (2006). Boosting for high-dimensional linear models. Ann. Statist. 559–583.
• [7] Bühlmann, P. and Hothorn, T. (2007). Boosting algorithms: Regularization, prediction and model fitting. Statist. Sci. 22 477–505.
• [8] Bühlmann, P. and Yu, B. (2003). Boosting with the L2 loss: Regression and classification. J. Amer. Statist. Assoc. 98 324–339.
• [9] Bühlmann, P. and Yu, B. (2006). Sparse boosting. J. Mach. Learn. Res. 7 1001–1024.
• [10] Dettling, M. and Bühlmann, P. (2003). Boosting for tumor classification with gene expression data. Bioinformatics 19 1061–1069.
• [11] Duchi, J. and Singer, Y. (2009). Boosting with structural sparsity. In Proceedings of the 26th Annual International Conference on Machine Learning 297–304. ACM, New York.
• [12] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression (with discussion). Ann. Statist. 32 407–499.
• [13] Frank, M. and Wolfe, P. (1956). An algorithm for quadratic programming. Nav. Res. Logist. Q. 3 95–110.
• [14] Freund, R., Grigas, P. and Mazumder, R. (2017). Supplement to “A new perspective on boosting in linear regression via subgradient optimization and relatives.” DOI:10.1214/16-AOS1505SUPP.
• [15] Freund, R. M. and Grigas, P. (2014). New analysis and results for the Frank–Wolfe method. Math. Program. To appear.
• [16] Freund, R. M., Grigas, P. and Mazumder, R. (2013). AdaBoost and forward stagewise regression are first-order convex optimization methods. Preprint. Available at arXiv:1307.1192.
• [17] Freund, Y. (1995). Boosting a weak learning algorithm by majority. Inform. and Comput. 121 256–285.
• [18] Freund, Y. and Schapire, R. (1996). Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference 148–156. Morgan Kauffman, San Francisco.
• [19] Friedman, J. (2008). Fast sparse regression and classification. Technical Report, Dept. Statistics, Stanford Univ.
• [20] Friedman, J., Hastie, T., Hoefling, H. and Tibshirani, R. (2007). Pathwise coordinate optimization. Ann. Appl. Stat. 2 302–332.
• [21] Friedman, J., Hastie, T. and Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. Ann. Statist. 28 337–407.
• [22] Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Ann. Statist. 29 1189–1232.
• [23] Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Ann. Statist. 29 1189–1232.
• [24] Friedman, J. H. (2012). Fast sparse regression and classification. Int. J. Forecast. 28 722–738.
• [25] Friedman, J. H. and Popescu, B. E. (2003). Importance sampled learning ensembles. J. Mach. Learn. Res. 94305.
• [26] Gärtner, B., Jaggi, M. and Maria, C. (2012). An exponential lower bound on the complexity of regularization paths. J. Comput. Geom. 3 168–195.
• [27] Giesen, J., Jaggi, M. and Laue, S. (2012). Optimizing over the Growing Spectrahedron. In Algorithms—ESA 2012: 20th Annual European Symposium, Ljubljana, Slovenia, September 1012, 2012. Proceedings 503–514. Springer, Berlin.
• [28] Giesen, J., Jaggi, M. and Laue, S. (2012). Approximating parameterized convex optimization problems. ACM Trans. Algorithms 9 Art. 10, 17.
• [29] Giesen, J., Mueller, J., Laue, S. and Swiercy, S. (2012). Approximating concavely parameterized optimization problems. In Advances in Neural Information Processing Systems 25 (F. Pereira, C. J. C. Burges, L. Bottou and K. Q. Weinberger, eds.) 2105–2113. Curran Associates, Red Hook, NY.
• [30] Hastie, T., Taylor, J., Tibshirani, R. and Walther, G. (2007). Forward stagewise regression and the monotone lasso. Electron. J. Stat. 1 1–29.
• [31] Hastie, T., Tibshirani, R. and Friedman, J. (2009). Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York.
• [32] Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models. Monographs on Statistics and Applied Probability 43. Chapman & Hall, London.
• [33] Jaggi, M. (2013). Revisiting Frank–Wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning (ICML-13) 427–435.
• [34] Mallat, S. G. and Zhang, Z. (1993). Matching pursuits with time-frequency dictionaries. IEEE Trans. Signal Process. 41 3397–3415.
• [35] Mason, L., Baxter, J., Bartlett, P. and Frean, M. (2000). Boosting algorithms as gradient descent 12 512–518.
• [36] Miller, A. (2002). Subset Selection in Regression. CRC Press, Boca Raton, FL.
• [37] Nesterov, Y. E. (2003). Introductory Lectures on Convex Optimization: A Basic Course. Applied Optimization 87. Kluwer Academic, Boston, MA.
• [38] Pati, Y. C., Rezaiifar, R. and Krishnaprasad, P. S. (1993). Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Conference Record of the Twenty-Seventh Asilomar Conference on Signals, Systems and Computers, 1993 40–44. IEEE, New York.
• [39] Polyak, B. (1987). Introduction to Optimization. Optimization Software, New York.
• [40] Rätsch, G., Onoda, T. and Müller, K.-R. (2001). Soft margins for AdaBoost. Mach. Learn. 42 287–320.
• [41] Rosset, S., Swirszcz, G., Srebro, N. and Zhu, J. (2007). $\ell_{1}$ regularization in infinite dimensional feature spaces. In Conference on Learning Theory 544–558. Springer, Berlin.
• [42] Rosset, S., Zhu, J. and Hastie, T. (2003/2004). Boosting as a regularized path to a maximum margin classifier. J. Mach. Learn. Res. 5 941–973.
• [43] Schapire, R. (1990). The strength of weak learnability. Mach. Learn. 5 197–227.
• [44] Schapire, R. E. and Freund, Y. (2012). Boosting: Foundations and Algorithms. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA.
• [45] Shor, N. Z. (1985). Minimization Methods for Nondifferentiable Functions. Springer Series in Computational Mathematics 3. Springer, Berlin.
• [46] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
• [47] Tibshirani, R. J. (2015). A general framework for fast stagewise algorithms. J. Mach. Learn. Res. 16 2543–2588.
• [48] Tukey, J. (1977). Exploratory Data Analysis. Addison-Wesley, Boston, MA.
• [49] Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing 210–268. Cambridge Univ. Press, Cambridge.
• [50] Weisberg, S. (1980). Applied Linear Regression. Wiley, New York.
• [51] Zhao, P. and Yu, B. (2007). Stagewise lasso. J. Mach. Learn. Res. 8 2701–2726.

Supplemental materials

• Supplement to “A new perspective on boosting in linear regression via subgradient optimization and relatives”. Additional proofs, technical details, figures and tables are provided in the Supplementary Section.