## The Annals of Statistics

### Aggregation for Gaussian regression

#### Abstract

This paper studies statistical aggregation procedures in the regression setting. A motivating factor is the existence of many different methods of estimation, leading to possibly competing estimators. We consider here three different types of aggregation: model selection (MS) aggregation, convex (C) aggregation and linear (L) aggregation. The objective of (MS) is to select the optimal single estimator from the list; that of (C) is to select the optimal convex combination of the given estimators; and that of (L) is to select the optimal linear combination of the given estimators. We are interested in evaluating the rates of convergence of the excess risks of the estimators obtained by these procedures. Our approach is motivated by recently published minimax results [Nemirovski, A. (2000). Topics in non-parametric statistics. Lectures on Probability Theory and Statistics (Saint-Flour, 1998). Lecture Notes in Math. 1738 85–277. Springer, Berlin; Tsybakov, A. B. (2003). Optimal rates of aggregation. Learning Theory and Kernel Machines. Lecture Notes in Artificial Intelligence 2777 303–313. Springer, Heidelberg]. There exist competing aggregation procedures achieving optimal convergence rates for each of the (MS), (C) and (L) cases separately. Since these procedures are not directly comparable with each other, we suggest an alternative solution. We prove that all three optimal rates, as well as those for the newly introduced (S) aggregation (subset selection), are nearly achieved via a single “universal” aggregation procedure. The procedure consists of mixing the initial estimators with weights obtained by penalized least squares. Two different penalties are considered: one of them is of the BIC type, the second one is a data-dependent $\ell_1$-type penalty.

#### Article information

Source
Ann. Statist., Volume 35, Number 4 (2007), 1674-1697.

Dates
First available in Project Euclid: 29 August 2007

https://projecteuclid.org/euclid.aos/1188405626

Digital Object Identifier
doi:10.1214/009053606000001587

Mathematical Reviews number (MathSciNet)
MR2351101

Zentralblatt MATH identifier
1209.62065

Subjects
Primary: 62G08: Nonparametric regression
Secondary: 62C20: Minimax procedures 62G05: Estimation 62G20: Asymptotic properties

#### Citation

Bunea, Florentina; Tsybakov, Alexandre B.; Wegkamp, Marten H. Aggregation for Gaussian regression. Ann. Statist. 35 (2007), no. 4, 1674--1697. doi:10.1214/009053606000001587. https://projecteuclid.org/euclid.aos/1188405626

#### References

• Akaike, H. (1974). A new look at the statistical model identification. IEEE Trans. Automat. Control 19 716–723.
• Antoniadis, A. and Fan, J. (2001). Regularization of wavelet approximations (with discussion). J. Amer. Statist. Assoc. 96 939–967.
• Audibert, J.-Y. (2004). Aggregated estimators and empirical complexity for least square regression. Ann. Inst. H. Poincaré Probab. Statist. 40 685–736.
• Baraud, Y. (2000). Model selection for regression on a fixed design. Probab. Theory Related Fields 117 467–493.
• Baraud, Y. (2002). Model selection for regression on a random design. ESAIM Probab. Statist. 6 127–146.
• Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inform. Theory 39 930–945.
• Barron, A., Birgé, L. and Massart, P. (1999). Risk bounds for model selection via penalization. Probab. Theory Related Fields 113 301–413.
• Bartlett, P. L., Boucheron, S. and Lugosi, G. (2000). Model selection and error estimation. In Proc. 13th Annual Conference on Computational Learning Theory 286–297. Morgan Kaufmann, San Francisco.
• Birgé, L. (2006). Model selection via testing: An alternative to (penalized) maximum likelihood estimators. Ann. Inst. H. Poincaré Probab. Statist. 42 273–325.
• Birgé, L. and Massart, P. (2001). Gaussian model selection. J. Eur. Math. Soc. 3 203–268.
• Birgé, L. and Massart, P. (2001). A generalized $C_p$ criterion for Gaussian model selection. Prépublication 647, Laboratoire de Probabilités et Modèles Aléatoires, Univ. Paris 6 and Paris 7. Available at www.proba.jussieu.fr/mathdoc/preprints/index.html#2001.
• Bunea, F. (2004). Consistent covariate selection and postmodel selection inference in semiparametric regression. Ann. Statist. 32 898–927.
• Bunea, F. and Nobel, A. B. (2005). Sequential procedures for aggregating arbitrary estimators of a conditional mean. Technical Report M984, Dept. Statistics, Florida State Univ.
• Bunea, F., Tsybakov, A. and Wegkamp, M. H. (2004). Aggregation for regression learning. Available at www.arxiv.org/abs/math/0410214. Prépublication 948, Laboratoire de Probabilités et Modèles Aléatoires, Univ. Paris 6 and Paris 7. Available at hal.ccsd.cnrs.fr/ccsd-00003205.
• Catoni, O. (2004). Statistical Learning Theory and Stochastic Optimization. École d'Eté de Probabilités de Saint-Flour 2001. Lecture Notes in Math. 1851. Springer, Berlin.
• Cavalier, L., Golubev, G. K., Picard, D. and Tsybakov, A. B. (2002). Oracle inequalities for inverse problems. Ann. Statist. 30 843–874.
• Chen, S., Donoho, D. and Saunders, M. (2001). Atomic decomposition by basis pursuit. SIAM Rev. 43 129–159.
• Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer, New York.
• Donoho, D. L., Elad, M. and Temlyakov, V. (2006). Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Inform. Theory 52 6–18.
• Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression (with discussion). Ann. Statist. 32 407–499.
• Foster, D. and George, E. (1994). The risk inflation criterion for multiple regression. Ann. Statist. 22 1947–1975.
• Gilbert, E. N. (1952). A comparison of signalling alphabets. Bell System Tech. J. 31 504–522.
• Györfi, L., Kohler, M., Krzyżak, A. and Walk, H. (2002). A Distribution-Free Theory of Nonparametric Regression. Springer, New York.
• Härdle, W., Kerkyacharian, G., Picard, D. and Tsybakov, A. (1998). Wavelets, Approximation and Statistical Applications. Lecture Notes in Statist. 129. Springer, New York.
• Juditsky, A., Nazin, A., Tsybakov, A. and Vayatis, N. (2005). Recursive aggregation of estimators by the mirror descent method with averaging. Problems Inform. Transmission 41 368–384.
• Juditsky, A. and Nemirovski, A. (2000). Functional aggregation for nonparametric regression. Ann. Statist. 28 681–712.
• Juditsky, A., Rigollet, P. and Tsybakov, A. (2005). Learning by mirror averaging. Prépublication du Laboratoire de Probabilités et Modèles Aléatoires, Univ. Paris 6 and Paris 7. Available at hal.ccsd.cnrs.fr/ccsd-00014097.
• Kneip, A. (1994). Ordered linear smoothers. Ann. Statist. 22 835–866.
• Koltchinskii, V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization (with discussion). Ann. Statist. 34 2593–2706.
• Leung, G. and Barron, A. R. (2006). Information theory and mixing least-squares regressions. IEEE Trans. Inform. Theory 52 3396–3410.
• Loubes, J.-M. and van de Geer, S. A. (2002). Adaptive estimation with soft thresholding penalties. Statist. Neerlandica 56 454–479.
• Lugosi, G. and Nobel, A. (1999). Adaptive model selection using empirical complexities. Ann. Statist. 27 1830–1864.
• Mallows, C. L. (1973). Some comments on $C_P$. Technometrics 15 661–675.
• Nemirovski, A. (2000). Topics in non-parametric statistics. Lectures on Probability Theory and Statistics (Saint-Flour, 1998). Lecture Notes in Math. 1738 85–277. Springer, Berlin.
• Osborne, M., Presnell, B. and Turlach, B. (2000). On the LASSO and its dual. J. Comput. Graph. Statist. 9 319–337.
• Rao, C. R. and Wu, Y. (2001). On model selection (with discussion). In Model Selection (P. Lahiri, ed.) 1–64. IMS, Beachwood, OH.
• Schapire, R. E. (1990). The strength of weak learnability. Machine Learning 5 197–227.
• Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461–464.
• Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
• Tsybakov, A. B. (2003). Optimal rates of aggregation. In Learning Theory and Kernel Machines. Lecture Notes in Artificial Intelligence 2777 303–313. Springer, Heidelberg.
• Tsybakov, A. B. (2004). Introduction à l'estimation non-paramétrique. Springer, Berlin.
• Wegkamp, M. H. (2003). Model selection in nonparametric regression. Ann. Statist. 31 252–273.
• Yang, Y. (2000). Combining different procedures for adaptive regression. J. Multivariate Anal. 74 135–161.
• Yang, Y. (2001). Adaptive regression by mixing. J. Amer. Statist. Assoc. 96 574–588.
• Yang, Y. (2004). Aggregating regression procedures to improve performance. Bernoulli 10 25–47.