The Annals of Statistics

Best subset selection, persistence in high-dimensional statistical learning and optimization under l1 constraint

Eitan Greenshtein

Full-text: Open access


Let (Y, X1, …, Xm) be a random vector. It is desired to predict Y based on (X1, …, Xm). Examples of prediction methods are regression, classification using logistic regression or separating hyperplanes, and so on.

We consider the problem of best subset selection, and study it in the context m=nα, α>1, where n is the number of observations. We investigate procedures that are based on empirical risk minimization. It is shown, that in common cases, we should aim to find the best subset among those of size which is of order o(n / log(n)). It is also shown, that in some “asymptotic sense,” when assuming a certain sparsity condition, there is no loss in letting m be much larger than n, for example, m=nα, α>1. This is in comparison to starting with the “best” subset of size smaller than n and regardless of the value of α.

We then study conditions under which empirical risk minimization subject to l1 constraint yields nearly the best subset. These results extend some recent results obtained by Greenshtein and Ritov.

Finally we present a high-dimensional simulation study of a “boosting type” classification procedure.

Article information

Ann. Statist., Volume 34, Number 5 (2006), 2367-2386.

First available in Project Euclid: 23 January 2007

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62C99: None of the above, but in this section

Variable selection persistence


Greenshtein, Eitan. Best subset selection, persistence in high-dimensional statistical learning and optimization under l 1 constraint. Ann. Statist. 34 (2006), no. 5, 2367--2386. doi:10.1214/009053606000000768.

Export citation


  • Bickel, P. and Levina, E. (2004). Some theory of Fisher's linear discriminant function, ``naive Bayes,'' and some alternatives where there are many more variables than observations. Bernoulli 10 989--1010.
  • Breiman, L. (2001). Statistical modeling: The two cultures (with discussion). Statist. Sci. 16 199--231.
  • Breiman, L. (2004). Population theory for boosting ensembles. Ann. Statist. 32 1--11.
  • Bühlmann, P. and Bin, Y. (2004). Discussion of boosting papers. Ann. Statist. 32 96--101.
  • Chen, S., Donoho, D. and Saunders, M. (2001). Atomic decomposition by basis pursuit. SIAM Rev. 43 129--159.
  • Donoho, D. (2004). For most large underdetermined systems of linear equations of minimal $l^1$-norm solution is also the sparsest solution. Technical Report 2004-9, Dept. Statistics, Stanford Univ.
  • Donoho, D. (2004). For most large undetermined systems of equations, the minimal $l^1$-norm near-solution approximates the sparsest near-solution. Technical Report 2004-10, Dept. Statistics, Stanford Univ.
  • Efron, B., Johnstone, I., Hastie, T. and Tibshirani, R. (2004). Least angle regression (with discussion). Ann. Statist. 32 407--499.
  • Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348--1360.
  • Fan, J. and Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. Ann. Statist. 32 928--961.
  • Friedman, J., Hastie, T., Rosset, S., Tibshirani, R. and Zhu, J. (2004). Discussion of boosting papers. Ann. Statist. 32 102--107.
  • Greenshtein, E. (2005). Prediction, model selection and random dimension penalties. Sankhyā 67 46--73.
  • Greenshtein, E. and Ritov, Y. (2004). Persistence in high-dimensional predictor selection and the virtue of overparametrization. Bernoulli 10 971--988.
  • Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning. Data Mining, Interence and Prediction. Springer, New York.
  • Huber, P. (1973). Robust regression: Asymptotics, conjectures, and Monte Carlo. Ann. Statist. 1 799--821.
  • Juditsky, A. and Nemirovski, A. (2000). Functional aggregation for nonparametric regression. Ann. Statist. 28 681--712.
  • Lee, W. S., Bartlett, P. L. and Williamson, R. C. (1996). Efficient agnostic learning of neural networks with bounded fan-in. IEEE Trans. Inform. Theory 42 2118--2132.
  • Lugosi, G. and Vayatis, N. (2004). On the Bayes-risk consistency of regularized boosting methods. Ann. Statist. 32 30--55.
  • Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the Lasso. Ann. Statist. 34 1436--1462.
  • Nemirovski, A. and Yudin, D. (1983). Problem Complexity and Method Efficiency in Optimization. Wiley, New York.
  • Nguyen, D. V., Arpat, A. B., Wang, N. and Carroll, R. J. (2002). DNA microarray experiments: Biological and technological aspects. Biometrics 58 701--717.
  • Pisier, G. (1981). Remarques sur un résultat non publié de B. Maurey. In Seminaire d'Analyse Fonctionelle 112. École Polytechnique, Palaiseau.
  • Portnoy, S. (1984). Asymptotic behavior of M-estimators of $p$ regression parameters when $p^2/n$ is large. I. Consistency. Ann. Statist. 12 1298--1309.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267--288.
  • Vapnik, N. V. (1998). Statistical Learning Theory. Wiley, New York.
  • Yohai, V. J. and Maronna, R. A. (1979). Asymptotic behavior of M-estimators for the linear model. Ann. Statist. 7 258--268.