The Annals of Statistics

Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions

Pierre Alquier, Vincent Cottet, and Guillaume Lecué

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


We obtain estimation error rates and sharp oracle inequalities for regularization procedures of the form \begin{equation*}\hat{f}\in\mathop{\operatorname{argmin}}_{f\in F}\Bigg(\frac{1}{N}\sum_{i=1}^{N}\ell_{f}(X_{i},Y_{i})+\lambda \Vert f\Vert \Bigg)\end{equation*} when $\Vert \cdot \Vert $ is any norm, $F$ is a convex class of functions and $\ell$ is a Lipschitz loss function satisfying a Bernstein condition over $F$. We explore both the bounded and sub-Gaussian stochastic frameworks for the distribution of the $f(X_{i})$’s, with no assumption on the distribution of the $Y_{i}$’s. The general results rely on two main objects: a complexity function and a sparsity equation, that depend on the specific setting in hand (loss $\ell$ and norm $\Vert \cdot \Vert $).

As a proof of concept, we obtain minimax rates of convergence in the following problems: (1) matrix completion with any Lipschitz loss function, including the hinge and logistic loss for the so-called 1-bit matrix completion instance of the problem, and quantile losses for the general case, which enables to estimate any quantile on the entries of the matrix; (2) logistic LASSO and variants such as the logistic SLOPE, and also shape constrained logistic regression; (3) kernel methods, where the loss is the hinge loss, and the regularization function is the RKHS norm.

Article information

Ann. Statist., Volume 47, Number 4 (2019), 2117-2144.

Received: January 2018
Revised: June 2018
First available in Project Euclid: 21 May 2019

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 60K35: Interacting random processes; statistical mechanics type models; percolation theory [See also 82B43, 82C43] 62G08: Nonparametric regression
Secondary: 62C20: Minimax procedures 62G05: Estimation 62G20: Asymptotic properties

Empirical processes high-dimensional statistics


Alquier, Pierre; Cottet, Vincent; Lecué, Guillaume. Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions. Ann. Statist. 47 (2019), no. 4, 2117--2144. doi:10.1214/18-AOS1742.

Export citation


  • [1] Alquier, P. (2013). Bayesian methods for low-rank matrix estimation: Short survey and theoretical study. In Algorithmic Learning Theory. Lecture Notes in Computer Science 8139 309–323. Springer, Heidelberg.
  • [2] Alquier, P., Ridgway, J. and Chopin, N. (2016). On the properties of variational approximations of Gibbs posteriors. J. Mach. Learn. Res. 17 239.
  • [3] Alquier, P., Cottet, V. and Lecué, G. (2019). Supplement to “Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions.” DOI:10.1214/18-AOS1742SUPP.
  • [4] Audibert, J.-Y. and Tsybakov, A. B. (2007). Fast learning rates for plug-in classifiers. Ann. Statist. 35 608–633.
  • [5] Barthe, F., Guédon, O., Mendelson, S. and Naor, A. (2005). A probabilistic approach to the geometry of the $l^{n}_{p}$-ball. Ann. Probab. 33 480–513.
  • [6] Bartlett, P. L., Bousquet, O. and Mendelson, S. (2005). Local Rademacher complexities. Ann. Statist. 33 1497–1537.
  • [7] Bartlett, P. L. and Mendelson, S. (2006). Empirical minimization. Probab. Theory Related Fields 135 311–334.
  • [8] Bellec, P., Lecué, G. and Tsybakov, A. (2018). Slope meets Lasso: Improved oracle bounds and optimality Ann. Statist. 46 3603–3642.
  • [9] Belloni, A. and Chernozhukov, V. (2011). $\ell_{1}$-penalized quantile regression in high-dimensional sparse models. Ann. Statist. 39 82–130.
  • [10] Bogdan, M., van den Berg, E., Sabatti, C., Su, W. and Candès, E. J. (2015). SLOPE—adaptive variable selection via convex optimization. Ann. Appl. Stat. 9 1103–1140.
  • [11] Cai, T. and Zhou, W.-X. (2013). A max-norm constrained minimization approach to 1-bit matrix completion. J. Mach. Learn. Res. 14 3619–3647.
  • [12] Candès, E. J. and Plan, Y. (2010). Matrix completion with noise. Proc. IEEE 98 925–936.
  • [13] Catoni, O. (2004). Statistical Learning Theory and Stochastic Optimization: Ecole d’Eté de Probabilités de Saint-Flour, XXXI-2001. 31. Springer, Berlin.
  • [14] Catoni, O. (2007). Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. Institute of Mathematical Statistics Lecture Notes—Monograph Series 56. IMS, Beachwood, OH.
  • [15] Chafaï, D., Guédon, O., Lecué, G. and Pajor, A. (2012). Interactions Between Compressed Sensing Random Matrices and High Dimensional Geometry. Panoramas et Synthèses [Panoramas and Syntheses] 37. Société Mathématique de France, Paris.
  • [16] Chandrasekaran, V., Recht, B., Parrilo, P. A. and Willsky, A. S. (2012). The convex geometry of linear inverse problems. Found. Comput. Math. 12 805–849.
  • [17] Cottet, V. and Alquier, P. (2016). 1-bit Matrix Completion: PAC-Bayesian Analysis of a Variational Approximation. Machine Learning. To appear. Preprint arXiv:1604.04191.
  • [18] Dudley, R. M. (2002). Real Analysis and Probability. Cambridge Studies in Advanced Mathematics 74. Cambridge Univ. Press, Cambridge. Revised reprint of the 1989 original.
  • [19] Garcia-Magariños, M., Antoniadis, A., Cao, R. and González-Manteiga, W. (2010). Lasso logistic regression, GSoft and the cyclic coordinate descent algorithm: Application to gene expression data. Stat. Appl. Genet. Mol. Biol. 9 30.
  • [20] Gordon, Y., Litvak, A. E., Mendelson, S. and Pajor, A. (2007). Gaussian averages of interpolated bodies and applications to approximate reconstruction. J. Approx. Theory 149 59–73.
  • [21] Klopp, O. (2014). Noisy low-rank matrix completion with general sampling distribution. Bernoulli 20 282–303.
  • [22] Koltchinskii, V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Statist. 34 2593–2656.
  • [23] Koltchinskii, V. (2011). Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Lecture Notes in Math. 2033. Springer, Heidelberg.
  • [24] Koltchinskii, V., Lounici, K. and Tsybakov, A. B. (2011). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Statist. 39 2302–2329.
  • [25] Koltchinskii, V. and Panchenko, D. (2002). Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statist. 30 1–50.
  • [26] Lafond, J., Klopp, O., Moulines, E. and Salmon, J. (2014). Probabilistic low-rank matrix completion on finite alphabets. In Advances in Neural Information Processing Systems 1727–1735.
  • [27] Lecué, G. (2011). Interplay Between Concentration, Complexity and Geometry in Learning Theory with Applications to High Dimensional Data Analysis. Habilitation à Diriger des Recherches Université, Paris-Est Marne-la-vallée.
  • [28] Lecué, G. and Mendelson, S. (2012). General nonexact oracle inequalities for classes with a subexponential envelope. Ann. Statist. 40 832–860.
  • [29] Lecué, G. and Mendelson, S. (2013). Learning subgaussian classes: Upper and minimax bounds. Technical Report CNRS, Ecole polytechnique and Technion—to appear in Topics in Learning Theory—Societe Mathématique de France (S. Boucheron and N. Vayatis eds.).
  • [30] Lecué, G. and Mendelson, S. (2017). Regularization and the small-ball method II: Complexity dependent error rates. J. Mach. Learn. Res. 18 146.
  • [31] Lecué, G. and Mendelson, S. (2018). Regularization and the small-ball method I: Sparse recovery. Ann. Statist. 46 611–641.
  • [32] Mai, T. T. and Alquier, P. (2015). A Bayesian approach for noisy matrix completion: Optimal rate under general sampling distribution. Electron. J. Stat. 9 823–841.
  • [33] Mak, C. (1999). Polychotomous Logistic Regression Via the Lasso. ProQuest LLC, Ann Arbor, MI. Thesis (Ph.D.)—Univ. Toronto.
  • [34] Mammen, E. and Tsybakov, A. B. (1999). Smooth discrimination analysis. Ann. Statist. 27 1808–1829.
  • [35] Meier, L., van de Geer, S. and Bühlmann, P. (2008). The group Lasso for logistic regression. J. R. Stat. Soc. Ser. B. Stat. Methodol. 70 53–71.
  • [36] Mendelson, S. (2004). On the performance of kernel classes. J. Mach. Learn. Res. 4 759–771.
  • [37] Rao, M. M. and Ren, Z. D. (1991). Theory of Orlicz Spaces. Monographs and Textbooks in Pure and Applied Mathematics 146. Dekker, New York.
  • [38] Rohde, A. and Tsybakov, A. B. (2011). Estimation of high-dimensional low-rank matrices. Ann. Statist. 39 887–930.
  • [39] Sabbe, N., Thas, O. and Ottoy, J.-P. (2013). EMLasso: Logistic lasso with missing data. Stat. Med. 32 3143–3157.
  • [40] Srebro, N., Rennie, J. and Jaakkola, T. S. (2004). Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems 1329–1336.
  • [41] Su, W. and Candès, E. (2016). SLOPE is adaptive to unknown sparsity and asymptotically minimax. Ann. Statist. 44 1038–1068.
  • [42] Tian, G.-L., Tang, M.-L., Fang, H.-B. and Tan, M. (2008). Efficient methods for estimating constrained parameters with applications to regularized (lasso) logistic regression. Comput. Statist. Data Anal. 52 3528–3542.
  • [43] Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135–166.
  • [44] van de Geer, S. (2016). Estimation and Testing Under Sparsity. Lecture Notes in Math. 2159. Springer, Cham.
  • [45] van de Geer, S. A. (2008). High-dimensional generalized linear models and the lasso. Ann. Statist. 36 614–645.
  • [46] Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Statist. 32 56–85.

Supplemental materials

  • Supplementary material to “Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions”. In the supplementary material, we provide a simulation study on the different procedures that have been introduced for matrix completion. The example of kernel estimation is also developed. All the proofs have been gathered in this supplementary material. We finally propose a brief study of the ERM without penalization.