The Annals of Statistics

Local Rademacher complexities and oracle inequalities in risk minimization

Abstract

Let ℱ be a class of measurable functions f:S↦[0, 1] defined on a probability space (S, $\mathcal{A}$, P). Given a sample (X1, …, Xn) of i.i.d. random variables taking values in S with common distribution P, let Pn denote the empirical measure based on (X1, …, Xn). We study an empirical risk minimization problem Pnf→min , f∈ℱ. Given a solution n of this problem, the goal is to obtain very general upper bounds on its excess risk $$\mathcal{E}_{P}(\hat{f}_{n}):=P\hat{f}_{n}-\inf_{f\in \mathcal{F}}Pf,$$ expressed in terms of relevant geometric parameters of the class ℱ. Using concentration inequalities and other empirical processes tools, we obtain both distribution-dependent and data-dependent upper bounds on the excess risk that are of asymptotically correct order in many examples. The bounds involve localized sup-norms of empirical and Rademacher processes indexed by functions from the class. We use these bounds to develop model selection techniques in abstract risk minimization problems that can be applied to more specialized frameworks of regression and classification.

Article information

Source
Ann. Statist., Volume 34, Number 6 (2006), 2593-2656.

Dates
First available in Project Euclid: 23 May 2007

https://projecteuclid.org/euclid.aos/1179935055

Digital Object Identifier
doi:10.1214/009053606000001019

Mathematical Reviews number (MathSciNet)
MR2329442

Zentralblatt MATH identifier
1118.62065

Citation

Koltchinskii, Vladimir. Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Statist. 34 (2006), no. 6, 2593--2656. doi:10.1214/009053606000001019. https://projecteuclid.org/euclid.aos/1179935055

References

• Anthony, M. and Bartlett, P. (1999). Neural Network Learning: Theoretical Foundations. Cambridge Univ. Press.
• Baraud, Y. (2002). Model selection for regression on a random design. ESAIM Probab. Statist. 6 127–146.
• Barron, A., Birgé, L. and Massart, P. (1999). Risk bounds for model selection via penalization. Probab. Theory Related Fields 113 301–413.
• Bartlett, P., Boucheron, S. and Lugosi, G. (2002). Model selection and error estimation. Machine Learning 48 85–113.
• Bartlett, P., Bousquet, O. and Mendelson, S. (2005). Local Rademacher complexities. Ann. Statist. 33 1497–1537.
• Bartlett, P., Jordan, M. and McAuliffe, J. (2006). Convexity, classification and risk bounds. J. Amer. Statist. Assoc. 101 138–156.
• Bartlett, P. and Mendelson, S. (2006). Empirical minimization. Probab. Theory Related Fields 135 311–334.
• Birgé, L. and Massart, P. (1997). From model selection to adaptive estimation. In Festschrift for Lucien Le Cam. Research Papers in Probability and Statistics (D. Pollard, E. Torgersen and G. Yang, eds.) 55–87. Springer, New York.
• Blanchard, G., Bousquet, O. and Massart, P. (2003). Statistical performance of support vector machines. Preprint.
• Blanchard, G., Lugosi, G. and Vayatis, N. (2003). On the rate of convergence of regularized boosting classifiers. J. Mach. Learn. Res. 4 861–894.
• Boucheron, S., Bousquet, O., Lugosi, G. and Massart, P. (2005). Moment inequalities for functions of independent random variables. Ann. Probab. 33 514–560.
• Boucheron, S., Lugosi, G. and Massart, P. (2000). A sharp concentration inequality with applications. Random Structures Algorithms 16 277–292.
• Bousquet, O. (2002). A Bennett concentration inequality and its application to suprema of empirical processes. C. R. Acad. Sci. Paris Ser. I Math. 334 495–500.
• Bousquet, O., Koltchinskii, V. and Panchenko, D. (2002). Some local measures of complexity of convex hulls and generalization bounds. In Computational Learning Theory. Lecture Notes in Comput. Sci. 2375 59–73. Springer, Berlin.
• Devroye, L., Györfi, G. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer, New York.
• Dudley, R. M. (1999). Uniform Central Limit Theorems. Cambridge Univ. Press.
• Einmahl, U. and Mason, D. (2000). An empirical processes approach to the uniform consistency of kernel type function estimators. J. Theoret. Probab. 13 1–37.
• Giné, E. and Guillou, A. (2001). On consistency of kernel density estimators for randomly censored data: Rates holding uniformly over adaptive intervals. Ann. Inst. H. Poincaré Probab. Statist. 37 503–522.
• Giné, E. and Koltchinskii, V. (2006). Concentration inequalities and asymptotic results for ratio type empirical processes. Ann. Probab. 34 1143–1216.
• Giné, E., Koltchinskii, V. and Wellner, J. (2003). Ratio limit theorems for empirical processes. In Stochastic Inequalities and Applications (E. Giné, C. Houdré and D. Nualart, eds.) 249–278. Birkhäuser, Basel.
• Giné, E. and Zinn, J. (1984). Some limit theorems for empirical processes (with discussion). Ann. Probab. 12 929–998.
• Györfi, L., Kohler, M., Krzyzak, A. and Walk, H. (2002). A Distribution-Free Theory of Nonparametric Regression. Springer, New York.
• Johnstone, I. M. (1998). Oracle inequalities and nonparametric function estimation. In Proc. International Congress of Mathematicians 3 267–278.
• Klein, T. (2002). Une inégalité de concentration à gauche pour les processus empiriques. C. R. Acad. Sci. Paris Ser. I Math. 334 501–504.
• Kohler, M. (2000). Inequalities for uniform deviations of averages from expectations with applications to nonparametric regression. J. Statist. Plann. Inference 89 1–23.
• Koltchinskii, V. (2001). Rademacher penalties and structural risk minimization. IEEE Trans. Inform. Theory 47 1902–1914.
• Koltchinskii, V. and Panchenko, D. (2000). Rademacher processes and bounding the risk of function learning. In High Dimensional Probability II (E. Giné, D. Mason and J. Wellner, eds.) 443–457. Birkhäuser, Boston.
• Koltchinskii, V. and Panchenko, D. (2002). Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statist. 30 1–50.
• Koltchinskii, V. and Panchenko, D. (2005). Complexities of convex combinations and bounding the generalization error in classification. Ann. Statist. 33 1455–1496.
• Koltchinskii, V., Panchenko, D. and Lozano, F. (2003). Bounding the generalization error of convex combinations of classifiers: balancing the dimensionality and the margins. Ann. Appl. Probab. 13 213–252.
• Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces. Isoperimetry and Processes. Springer, New York.
• Lee, W. S., Bartlett, P. and Williamson, R. C. (1996). Efficient agnostic learning of neural networks with bounded fan-in. IEEE Trans. Inform. Theory 42 2118–2132.
• Lugosi, G. and Vayatis, N. (2004). On the Bayes-risk consistency of regularized boosting methods. Ann. Statist. 32 30–55.
• Lugosi, G. and Wegkamp, M. (2004). Complexity regularization via localized random penalties. Ann. Statist. 32 1679–1697.
• Mammen, E. and Tsybakov, A. (1999). Smooth discrimination analysis. Ann. Statist. 27 1808–1829.
• Massart, P. (2000). Some applications of concentration inequalities to statistics. Ann. Fac. Sci. Toulouse Math. (6) 9 245–303.
• Mendelson, S. (2002). Improving the sample complexity using global data. IEEE Trans. Inform. Theory 48 1977–1991.
• Mendelson, S. (2002). Geometric parameters of kernel machines. In Computational Learning Theory. Lecture Notes in Comput. Sci. 2375 29–43. Springer, Berlin.
• Shen, X. and Wong, W.H. (1994). Convergence rate of sieve estimates. Ann. Statist. 22 580–615.
• Steinwart, I. (2005). Consistency of support vector machines and other regularized kernel machines. IEEE Trans. Inform. Theory 51 128–142.
• Talagrand, M. (1994). Sharper bounds for Gaussian and empirical processes. Ann. Probab. 22 28–76.
• Talagrand, M. (1996). A new look at independence. Ann. Probab. 24 1–34.
• Talagrand, M. (1996). New concentration inequalities in product spaces. Invent. Math. 126 505–563.
• Tsybakov, A. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135–166.
• Tsybakov, A. and van de Geer, S. (2005). Square root penalty: Adaptation to the margin in classification and in edge estimation. Ann. Statist. 33 1203–1224.
• van de Geer, S. (2000). Empirical Processes in M-Estimation. Cambridge Univ. Press.
• van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. With Applications to Statistics. Springer, New York.
• Vapnik, V. (1998). Statistical Learning Theory. Wiley, New York.
• Vapnik, V. and Chervonenkis, A. (1974). Theory of Pattern Recognition. Nauka, Moscow (in Russian).
• Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization (with discussion). Ann. Statist. 32 56–117, 128–134.