Bernoulli

• Bernoulli
• Volume 19, Number 5B (2013), 2153-2166.

Empirical risk minimization is optimal for the convex aggregation problem

Guillaume Lecué

Abstract

Let $F$ be a finite model of cardinality $M$ and denote by $\operatorname{conv}(F)$ its convex hull. The problem of convex aggregation is to construct a procedure having a risk as close as possible to the minimal risk over $\operatorname{conv} (F)$. Consider the bounded regression model with respect to the squared risk denoted by $R(\cdot)$. If ${ \widehat{f}}_{n}^{\mathit{ERM}\mbox{-}C}$ denotes the empirical risk minimization procedure over $\operatorname{conv}(F)$, then we prove that for any $x>0$, with probability greater than $1-4\exp(-x)$,

$R(\widehat{f}_{n}^{\mathit{ERM}\mbox{-}C})\leq\min_{f\in\operatorname{conv}(F)}R(f)+c_{0}\max\biggl(\psi_{n}^{(C)}(M),\frac{x}{n}\biggr),$

where $c_{0}>0$ is an absolute constant and $\psi_{n}^{(C)}(M)$ is the optimal rate of convex aggregation defined in (In Computational Learning Theory and Kernel Machines (COLT-2003) (2003) 303–313 Springer) by $\psi _{n}^{(C)}(M)=M/n$ when $M\leq\sqrt{n}$ and $\psi _{n}^{(C)}(M)=\sqrt{\log (\mathrm{e}M/\sqrt{n})/n}$ when $M>\sqrt{n}$.

Article information

Source
Bernoulli, Volume 19, Number 5B (2013), 2153-2166.

Dates
First available in Project Euclid: 3 December 2013

Permanent link to this document
https://projecteuclid.org/euclid.bj/1386078598

Digital Object Identifier
doi:10.3150/12-BEJ447

Mathematical Reviews number (MathSciNet)
MR3160549

Zentralblatt MATH identifier
06254557

Citation

Lecué, Guillaume. Empirical risk minimization is optimal for the convex aggregation problem. Bernoulli 19 (2013), no. 5B, 2153--2166. doi:10.3150/12-BEJ447. https://projecteuclid.org/euclid.bj/1386078598

References

• [1] Audibert, J.Y. (2004). Aggregated estimators and empirical complexity for least square regression. Ann. Inst. Henri Poincaré Probab. Stat. 40 685–736.
• [2] Audibert, J.Y. (2007). Progressive mixture rules are deviation suboptimal. In Adv. Neural Inf. Process. Syst. 20 41–48. Cambridge: MIT Press.
• [3] Audibert, J.Y. (2009). Fast learning rates in statistical inference through aggregation. Ann. Statist. 37 1591–1646.
• [4] Audibert, J.Y. and Catoni, O. (2011). Robust linear least squares regression. Ann. Statist. 39 2766–2794.
• [5] Bartlett, P.L. and Mendelson, S. (2006). Empirical minimization. Probab. Theory Related Fields 135 311–334.
• [6] Bartlett, P.L., Mendelson, S. and Neeman, J. (2012). $\ell_{1}$-regularized linear regression: Persistence and oracle inequalities. Probab. Theory Related Fields 154 193–224.
• [7] Birgé, L. (2006). Model selection via testing: An alternative to (penalized) maximum likelihood estimators. Ann. Inst. Henri Poincaré Probab. Stat. 42 273–325.
• [8] Bousquet, O. (2002). A Bennett concentration inequality and its application to suprema of empirical processes. C. R. Math. Acad. Sci. Paris 334 495–500.
• [9] Bousquet, O., Koltchinskii, V. and Panchenko, D. (2002). Some local measures of complexity of convex hulls and generalization bounds. In Computational Learning Theory (Sydney, 2002). Lecture Notes in Computer Science 2375 59–73. Berlin: Springer.
• [10] Bühlmann, P. and Hothorn, T. (2007). Boosting algorithms: Regularization, prediction and model fitting. Statist. Sci. 22 477–505.
• [11] Bunea, F. and Nobel, A. (2008). Sequential procedures for aggregating arbitrary estimators of a conditional mean. IEEE Trans. Inform. Theory 54 1725–1735.
• [12] Bunea, F., Tsybakov, A.B. and Wegkamp, M.H. (2007). Aggregation for Gaussian regression. Ann. Statist. 35 1674–1697.
• [13] Carl, B. (1985). Inequalities of Bernstein-Jackson-type and the degree of compactness of operators in Banach spaces. Ann. Inst. Fourier (Grenoble) 35 79–118.
• [14] Catoni, O. (2004). Statistical Learning Theory and Stochastic Optimization. Lecture Notes in Math. 1851. Berlin: Springer. Lecture notes from the 31st Summer School on Probability Theory held in Saint-Flour, July 8–25, 2001.
• [15] Dalalyan, A.S. and Tsybakov, A.B. (2008). Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Machine Learning 72 39–61.
• [16] Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Applications of Mathematics (New York) 31. New York: Springer.
• [17] Emery, M., Nemirovski, A. and Voiculescu, D. (2000). Lectures on Probability Theory and Statistics (P. Bernard, ed.). Lecture Notes in Math. 1738. Berlin: Springer. Lectures from the 28th Summer School on Probability Theory held in Saint-Flour, August 17–September 3, 1998.
• [18] Juditsky, A. and Nemirovski, A. (2000). Functional aggregation for nonparametric regression. Ann. Statist. 28 681–712.
• [19] Juditsky, A., Rigollet, P. and Tsybakov, A.B. (2008). Learning by mirror averaging. Ann. Statist. 36 2183–2206.
• [20] Klein, T. and Rio, E. (2005). Concentration around the mean for maxima of empirical processes. Ann. Probab. 33 1060–1077.
• [21] Koltchinskii, V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Statist. 34 2593–2656.
• [22] Lecué, G. (2011). Interplay between concentration, complexity and geometry in learning theory with applications to high dimensional data analysis. Habilitation à diriger des recherches.
• [23] Lecué, G. and Mendelson, S. (2009). Aggregation via empirical risk minimization. Probab. Theory Related Fields 145 591–613.
• [24] Lecué, G. and Mendelson, S. (2010). Sharper lower bounds on the performance of the empirical risk minimization algorithm. Bernoulli 16 605–613.
• [25] Lecué, G. and Mendelson, S. (2013). On the optimality of the empirical risk minimization procedure for the Convex aggregation problem. Ann. Inst. Henri Poincaré Probab. Stat. 49 288–306.
• [26] Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and Processes. Ergebnisse der Mathematik und Ihrer Grenzgebiete (3) [Results in Mathematics and Related Areas (3)] 23. Berlin: Springer.
• [27] Lounici, K. (2007). Generalized mirror averaging and $D$-convex aggregation. Math. Methods Statist. 16 246–259.
• [28] Massart, P. (2007). Concentration Inequalities and Model Selection. Lecture Notes in Math. 1896. Berlin: Springer. Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, July 6–23, 2003, with a foreword by Jean Picard.
• [29] Massart, P. and Nédélec, É. (2006). Risk bounds for statistical learning. Ann. Statist. 34 2326–2366.
• [30] Mendelson, S. (2008). Lower bounds for the empirical minimization algorithm. IEEE Trans. Inform. Theory 54 3797–3803.
• [31] Pisier, G. (1981). Remarques sur un résultat non publié de B. Maurey. In Seminar on Functional Analysis, 19801981 Exp. No. V, 13. Palaiseau: École Polytech.
• [32] Rigollet, P. (2012). Kullback–Leibler aggregation and misspecified generalized linear models. Ann. Statist. 40 639–665.
• [33] Rigollet, P. and Tsybakov, A. (2011). Exponential screening and optimal rates of sparse estimation. Ann. Statist. 39 731–771.
• [34] Schapire, R.E., Freund, Y., Bartlett, P. and Lee, W.S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Statist. 26 1651–1686.
• [35] Talagrand, M. (1995). Concentration of measure and isoperimetric inequalities in product spaces. Inst. Hautes Études Sci. Publ. Math. 81 73–205.
• [36] Talagrand, M. (1996). New concentration inequalities in product spaces. Invent. Math. 126 505–563.
• [37] Tsybakov, A.B. (2003). Optimal rate of aggregation. In Computational Learning Theory and Kernel Machines (COLT-2003). Lecture Notes in Artificial Intelligence 2777 303–313. Heidelberg: Springer.
• [38] Tsybakov, A.B. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135–166.
• [39] Vapnik, V.N. (1998). Statistical Learning Theory. Adaptive and Learning Systems for Signal Processing, Communications, and Control. New York: Wiley.
• [40] Wang, Z., Paterlini, S., Gao, F. and Yang, Y. (2012). Adaptive minimax estimation over sparse $\ell_{q}$-hulls. Technical report. Available at arXiv:1108.1961.
• [41] Yang, Y. (2000). Combining different procedures for adaptive regression. J. Multivariate Anal. 74 135–161.
• [42] Yang, Y. (2000). Mixing strategies for density estimation. Ann. Statist. 28 75–87.
• [43] Yang, Y. (2004). Aggregating regression procedures to improve performance. Bernoulli 10 25–47.