Bernoulli

  • Bernoulli
  • Volume 19, Number 5B (2013), 2153-2166.

Empirical risk minimization is optimal for the convex aggregation problem

Guillaume Lecué

Full-text: Open access

Abstract

Let $F$ be a finite model of cardinality $M$ and denote by $\operatorname{conv}(F)$ its convex hull. The problem of convex aggregation is to construct a procedure having a risk as close as possible to the minimal risk over $\operatorname{conv} (F)$. Consider the bounded regression model with respect to the squared risk denoted by $R(\cdot)$. If ${ \widehat{f}}_{n}^{\mathit{ERM}\mbox{-}C}$ denotes the empirical risk minimization procedure over $\operatorname{conv}(F)$, then we prove that for any $x>0$, with probability greater than $1-4\exp(-x)$,

\[R(\widehat{f}_{n}^{\mathit{ERM}\mbox{-}C})\leq\min_{f\in\operatorname{conv}(F)}R(f)+c_{0}\max\biggl(\psi_{n}^{(C)}(M),\frac{x}{n}\biggr),\]

where $c_{0}>0$ is an absolute constant and $\psi_{n}^{(C)}(M)$ is the optimal rate of convex aggregation defined in (In Computational Learning Theory and Kernel Machines (COLT-2003) (2003) 303–313 Springer) by $\psi _{n}^{(C)}(M)=M/n$ when $M\leq\sqrt{n}$ and $\psi _{n}^{(C)}(M)=\sqrt{\log (\mathrm{e}M/\sqrt{n})/n}$ when $M>\sqrt{n}$.

Article information

Source
Bernoulli, Volume 19, Number 5B (2013), 2153-2166.

Dates
First available in Project Euclid: 3 December 2013

Permanent link to this document
https://projecteuclid.org/euclid.bj/1386078598

Digital Object Identifier
doi:10.3150/12-BEJ447

Mathematical Reviews number (MathSciNet)
MR3160549

Zentralblatt MATH identifier
06254557

Keywords
aggregation empirical processes theory empirical risk minimization learning theory

Citation

Lecué, Guillaume. Empirical risk minimization is optimal for the convex aggregation problem. Bernoulli 19 (2013), no. 5B, 2153--2166. doi:10.3150/12-BEJ447. https://projecteuclid.org/euclid.bj/1386078598


Export citation

References

  • [1] Audibert, J.Y. (2004). Aggregated estimators and empirical complexity for least square regression. Ann. Inst. Henri Poincaré Probab. Stat. 40 685–736.
  • [2] Audibert, J.Y. (2007). Progressive mixture rules are deviation suboptimal. In Adv. Neural Inf. Process. Syst. 20 41–48. Cambridge: MIT Press.
  • [3] Audibert, J.Y. (2009). Fast learning rates in statistical inference through aggregation. Ann. Statist. 37 1591–1646.
  • [4] Audibert, J.Y. and Catoni, O. (2011). Robust linear least squares regression. Ann. Statist. 39 2766–2794.
  • [5] Bartlett, P.L. and Mendelson, S. (2006). Empirical minimization. Probab. Theory Related Fields 135 311–334.
  • [6] Bartlett, P.L., Mendelson, S. and Neeman, J. (2012). $\ell_{1}$-regularized linear regression: Persistence and oracle inequalities. Probab. Theory Related Fields 154 193–224.
  • [7] Birgé, L. (2006). Model selection via testing: An alternative to (penalized) maximum likelihood estimators. Ann. Inst. Henri Poincaré Probab. Stat. 42 273–325.
  • [8] Bousquet, O. (2002). A Bennett concentration inequality and its application to suprema of empirical processes. C. R. Math. Acad. Sci. Paris 334 495–500.
  • [9] Bousquet, O., Koltchinskii, V. and Panchenko, D. (2002). Some local measures of complexity of convex hulls and generalization bounds. In Computational Learning Theory (Sydney, 2002). Lecture Notes in Computer Science 2375 59–73. Berlin: Springer.
  • [10] Bühlmann, P. and Hothorn, T. (2007). Boosting algorithms: Regularization, prediction and model fitting. Statist. Sci. 22 477–505.
  • [11] Bunea, F. and Nobel, A. (2008). Sequential procedures for aggregating arbitrary estimators of a conditional mean. IEEE Trans. Inform. Theory 54 1725–1735.
  • [12] Bunea, F., Tsybakov, A.B. and Wegkamp, M.H. (2007). Aggregation for Gaussian regression. Ann. Statist. 35 1674–1697.
  • [13] Carl, B. (1985). Inequalities of Bernstein-Jackson-type and the degree of compactness of operators in Banach spaces. Ann. Inst. Fourier (Grenoble) 35 79–118.
  • [14] Catoni, O. (2004). Statistical Learning Theory and Stochastic Optimization. Lecture Notes in Math. 1851. Berlin: Springer. Lecture notes from the 31st Summer School on Probability Theory held in Saint-Flour, July 8–25, 2001.
  • [15] Dalalyan, A.S. and Tsybakov, A.B. (2008). Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Machine Learning 72 39–61.
  • [16] Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Applications of Mathematics (New York) 31. New York: Springer.
  • [17] Emery, M., Nemirovski, A. and Voiculescu, D. (2000). Lectures on Probability Theory and Statistics (P. Bernard, ed.). Lecture Notes in Math. 1738. Berlin: Springer. Lectures from the 28th Summer School on Probability Theory held in Saint-Flour, August 17–September 3, 1998.
  • [18] Juditsky, A. and Nemirovski, A. (2000). Functional aggregation for nonparametric regression. Ann. Statist. 28 681–712.
  • [19] Juditsky, A., Rigollet, P. and Tsybakov, A.B. (2008). Learning by mirror averaging. Ann. Statist. 36 2183–2206.
  • [20] Klein, T. and Rio, E. (2005). Concentration around the mean for maxima of empirical processes. Ann. Probab. 33 1060–1077.
  • [21] Koltchinskii, V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Statist. 34 2593–2656.
  • [22] Lecué, G. (2011). Interplay between concentration, complexity and geometry in learning theory with applications to high dimensional data analysis. Habilitation à diriger des recherches.
  • [23] Lecué, G. and Mendelson, S. (2009). Aggregation via empirical risk minimization. Probab. Theory Related Fields 145 591–613.
  • [24] Lecué, G. and Mendelson, S. (2010). Sharper lower bounds on the performance of the empirical risk minimization algorithm. Bernoulli 16 605–613.
  • [25] Lecué, G. and Mendelson, S. (2013). On the optimality of the empirical risk minimization procedure for the Convex aggregation problem. Ann. Inst. Henri Poincaré Probab. Stat. 49 288–306.
  • [26] Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and Processes. Ergebnisse der Mathematik und Ihrer Grenzgebiete (3) [Results in Mathematics and Related Areas (3)] 23. Berlin: Springer.
  • [27] Lounici, K. (2007). Generalized mirror averaging and $D$-convex aggregation. Math. Methods Statist. 16 246–259.
  • [28] Massart, P. (2007). Concentration Inequalities and Model Selection. Lecture Notes in Math. 1896. Berlin: Springer. Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, July 6–23, 2003, with a foreword by Jean Picard.
  • [29] Massart, P. and Nédélec, É. (2006). Risk bounds for statistical learning. Ann. Statist. 34 2326–2366.
  • [30] Mendelson, S. (2008). Lower bounds for the empirical minimization algorithm. IEEE Trans. Inform. Theory 54 3797–3803.
  • [31] Pisier, G. (1981). Remarques sur un résultat non publié de B. Maurey. In Seminar on Functional Analysis, 19801981 Exp. No. V, 13. Palaiseau: École Polytech.
  • [32] Rigollet, P. (2012). Kullback–Leibler aggregation and misspecified generalized linear models. Ann. Statist. 40 639–665.
  • [33] Rigollet, P. and Tsybakov, A. (2011). Exponential screening and optimal rates of sparse estimation. Ann. Statist. 39 731–771.
  • [34] Schapire, R.E., Freund, Y., Bartlett, P. and Lee, W.S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Statist. 26 1651–1686.
  • [35] Talagrand, M. (1995). Concentration of measure and isoperimetric inequalities in product spaces. Inst. Hautes Études Sci. Publ. Math. 81 73–205.
  • [36] Talagrand, M. (1996). New concentration inequalities in product spaces. Invent. Math. 126 505–563.
  • [37] Tsybakov, A.B. (2003). Optimal rate of aggregation. In Computational Learning Theory and Kernel Machines (COLT-2003). Lecture Notes in Artificial Intelligence 2777 303–313. Heidelberg: Springer.
  • [38] Tsybakov, A.B. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135–166.
  • [39] Vapnik, V.N. (1998). Statistical Learning Theory. Adaptive and Learning Systems for Signal Processing, Communications, and Control. New York: Wiley.
  • [40] Wang, Z., Paterlini, S., Gao, F. and Yang, Y. (2012). Adaptive minimax estimation over sparse $\ell_{q}$-hulls. Technical report. Available at arXiv:1108.1961.
  • [41] Yang, Y. (2000). Combining different procedures for adaptive regression. J. Multivariate Anal. 74 135–161.
  • [42] Yang, Y. (2000). Mixing strategies for density estimation. Ann. Statist. 28 75–87.
  • [43] Yang, Y. (2004). Aggregating regression procedures to improve performance. Bernoulli 10 25–47.