The Annals of Statistics

Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions

Abstract

We obtain estimation error rates and sharp oracle inequalities for regularization procedures of the form \begin{equation*}\hat{f}\in\mathop{\operatorname{argmin}}_{f\in F}\Bigg(\frac{1}{N}\sum_{i=1}^{N}\ell_{f}(X_{i},Y_{i})+\lambda \Vert f\Vert \Bigg)\end{equation*} when $\Vert \cdot \Vert$ is any norm, $F$ is a convex class of functions and $\ell$ is a Lipschitz loss function satisfying a Bernstein condition over $F$. We explore both the bounded and sub-Gaussian stochastic frameworks for the distribution of the $f(X_{i})$’s, with no assumption on the distribution of the $Y_{i}$’s. The general results rely on two main objects: a complexity function and a sparsity equation, that depend on the specific setting in hand (loss $\ell$ and norm $\Vert \cdot \Vert$).

As a proof of concept, we obtain minimax rates of convergence in the following problems: (1) matrix completion with any Lipschitz loss function, including the hinge and logistic loss for the so-called 1-bit matrix completion instance of the problem, and quantile losses for the general case, which enables to estimate any quantile on the entries of the matrix; (2) logistic LASSO and variants such as the logistic SLOPE, and also shape constrained logistic regression; (3) kernel methods, where the loss is the hinge loss, and the regularization function is the RKHS norm.

Article information

Source
Ann. Statist., Volume 47, Number 4 (2019), 2117-2144.

Dates
Revised: June 2018
First available in Project Euclid: 21 May 2019

https://projecteuclid.org/euclid.aos/1558425641

Digital Object Identifier
doi:10.1214/18-AOS1742

Mathematical Reviews number (MathSciNet)
MR3953446

Zentralblatt MATH identifier
07082281

Citation

Alquier, Pierre; Cottet, Vincent; Lecué, Guillaume. Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions. Ann. Statist. 47 (2019), no. 4, 2117--2144. doi:10.1214/18-AOS1742. https://projecteuclid.org/euclid.aos/1558425641

References

• [1] Alquier, P. (2013). Bayesian methods for low-rank matrix estimation: Short survey and theoretical study. In Algorithmic Learning Theory. Lecture Notes in Computer Science 8139 309–323. Springer, Heidelberg.
• [2] Alquier, P., Ridgway, J. and Chopin, N. (2016). On the properties of variational approximations of Gibbs posteriors. J. Mach. Learn. Res. 17 239.
• [3] Alquier, P., Cottet, V. and Lecué, G. (2019). Supplement to “Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions.” DOI:10.1214/18-AOS1742SUPP.
• [4] Audibert, J.-Y. and Tsybakov, A. B. (2007). Fast learning rates for plug-in classifiers. Ann. Statist. 35 608–633.
• [5] Barthe, F., Guédon, O., Mendelson, S. and Naor, A. (2005). A probabilistic approach to the geometry of the $l^{n}_{p}$-ball. Ann. Probab. 33 480–513.
• [6] Bartlett, P. L., Bousquet, O. and Mendelson, S. (2005). Local Rademacher complexities. Ann. Statist. 33 1497–1537.
• [7] Bartlett, P. L. and Mendelson, S. (2006). Empirical minimization. Probab. Theory Related Fields 135 311–334.
• [8] Bellec, P., Lecué, G. and Tsybakov, A. (2018). Slope meets Lasso: Improved oracle bounds and optimality Ann. Statist. 46 3603–3642.
• [9] Belloni, A. and Chernozhukov, V. (2011). $\ell_{1}$-penalized quantile regression in high-dimensional sparse models. Ann. Statist. 39 82–130.
• [10] Bogdan, M., van den Berg, E., Sabatti, C., Su, W. and Candès, E. J. (2015). SLOPE—adaptive variable selection via convex optimization. Ann. Appl. Stat. 9 1103–1140.
• [11] Cai, T. and Zhou, W.-X. (2013). A max-norm constrained minimization approach to 1-bit matrix completion. J. Mach. Learn. Res. 14 3619–3647.
• [12] Candès, E. J. and Plan, Y. (2010). Matrix completion with noise. Proc. IEEE 98 925–936.
• [13] Catoni, O. (2004). Statistical Learning Theory and Stochastic Optimization: Ecole d’Eté de Probabilités de Saint-Flour, XXXI-2001. 31. Springer, Berlin.
• [14] Catoni, O. (2007). Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. Institute of Mathematical Statistics Lecture Notes—Monograph Series 56. IMS, Beachwood, OH.
• [15] Chafaï, D., Guédon, O., Lecué, G. and Pajor, A. (2012). Interactions Between Compressed Sensing Random Matrices and High Dimensional Geometry. Panoramas et Synthèses [Panoramas and Syntheses] 37. Société Mathématique de France, Paris.
• [16] Chandrasekaran, V., Recht, B., Parrilo, P. A. and Willsky, A. S. (2012). The convex geometry of linear inverse problems. Found. Comput. Math. 12 805–849.
• [17] Cottet, V. and Alquier, P. (2016). 1-bit Matrix Completion: PAC-Bayesian Analysis of a Variational Approximation. Machine Learning. To appear. Preprint arXiv:1604.04191.
• [18] Dudley, R. M. (2002). Real Analysis and Probability. Cambridge Studies in Advanced Mathematics 74. Cambridge Univ. Press, Cambridge. Revised reprint of the 1989 original.
• [19] Garcia-Magariños, M., Antoniadis, A., Cao, R. and González-Manteiga, W. (2010). Lasso logistic regression, GSoft and the cyclic coordinate descent algorithm: Application to gene expression data. Stat. Appl. Genet. Mol. Biol. 9 30.
• [20] Gordon, Y., Litvak, A. E., Mendelson, S. and Pajor, A. (2007). Gaussian averages of interpolated bodies and applications to approximate reconstruction. J. Approx. Theory 149 59–73.
• [21] Klopp, O. (2014). Noisy low-rank matrix completion with general sampling distribution. Bernoulli 20 282–303.
• [22] Koltchinskii, V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Statist. 34 2593–2656.
• [23] Koltchinskii, V. (2011). Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Lecture Notes in Math. 2033. Springer, Heidelberg.
• [24] Koltchinskii, V., Lounici, K. and Tsybakov, A. B. (2011). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Statist. 39 2302–2329.
• [25] Koltchinskii, V. and Panchenko, D. (2002). Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statist. 30 1–50.
• [26] Lafond, J., Klopp, O., Moulines, E. and Salmon, J. (2014). Probabilistic low-rank matrix completion on finite alphabets. In Advances in Neural Information Processing Systems 1727–1735.
• [27] Lecué, G. (2011). Interplay Between Concentration, Complexity and Geometry in Learning Theory with Applications to High Dimensional Data Analysis. Habilitation à Diriger des Recherches Université, Paris-Est Marne-la-vallée.
• [28] Lecué, G. and Mendelson, S. (2012). General nonexact oracle inequalities for classes with a subexponential envelope. Ann. Statist. 40 832–860.
• [29] Lecué, G. and Mendelson, S. (2013). Learning subgaussian classes: Upper and minimax bounds. Technical Report CNRS, Ecole polytechnique and Technion—to appear in Topics in Learning Theory—Societe Mathématique de France (S. Boucheron and N. Vayatis eds.).
• [30] Lecué, G. and Mendelson, S. (2017). Regularization and the small-ball method II: Complexity dependent error rates. J. Mach. Learn. Res. 18 146.
• [31] Lecué, G. and Mendelson, S. (2018). Regularization and the small-ball method I: Sparse recovery. Ann. Statist. 46 611–641.
• [32] Mai, T. T. and Alquier, P. (2015). A Bayesian approach for noisy matrix completion: Optimal rate under general sampling distribution. Electron. J. Stat. 9 823–841.
• [33] Mak, C. (1999). Polychotomous Logistic Regression Via the Lasso. ProQuest LLC, Ann Arbor, MI. Thesis (Ph.D.)—Univ. Toronto.
• [34] Mammen, E. and Tsybakov, A. B. (1999). Smooth discrimination analysis. Ann. Statist. 27 1808–1829.
• [35] Meier, L., van de Geer, S. and Bühlmann, P. (2008). The group Lasso for logistic regression. J. R. Stat. Soc. Ser. B. Stat. Methodol. 70 53–71.
• [36] Mendelson, S. (2004). On the performance of kernel classes. J. Mach. Learn. Res. 4 759–771.
• [37] Rao, M. M. and Ren, Z. D. (1991). Theory of Orlicz Spaces. Monographs and Textbooks in Pure and Applied Mathematics 146. Dekker, New York.
• [38] Rohde, A. and Tsybakov, A. B. (2011). Estimation of high-dimensional low-rank matrices. Ann. Statist. 39 887–930.
• [39] Sabbe, N., Thas, O. and Ottoy, J.-P. (2013). EMLasso: Logistic lasso with missing data. Stat. Med. 32 3143–3157.
• [40] Srebro, N., Rennie, J. and Jaakkola, T. S. (2004). Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems 1329–1336.
• [41] Su, W. and Candès, E. (2016). SLOPE is adaptive to unknown sparsity and asymptotically minimax. Ann. Statist. 44 1038–1068.
• [42] Tian, G.-L., Tang, M.-L., Fang, H.-B. and Tan, M. (2008). Efficient methods for estimating constrained parameters with applications to regularized (lasso) logistic regression. Comput. Statist. Data Anal. 52 3528–3542.
• [43] Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135–166.
• [44] van de Geer, S. (2016). Estimation and Testing Under Sparsity. Lecture Notes in Math. 2159. Springer, Cham.
• [45] van de Geer, S. A. (2008). High-dimensional generalized linear models and the lasso. Ann. Statist. 36 614–645.
• [46] Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Statist. 32 56–85.

Supplemental materials

• Supplementary material to “Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions”. In the supplementary material, we provide a simulation study on the different procedures that have been introduced for matrix completion. The example of kernel estimation is also developed. All the proofs have been gathered in this supplementary material. We finally propose a brief study of the ERM without penalization.