The Annals of Statistics

SLOPE is adaptive to unknown sparsity and asymptotically minimax

Weijie Su and Emmanuel Candès

Full-text: Open access

Abstract

We consider high-dimensional sparse regression problems in which we observe $\mathbf{y}=\mathbf{X}\boldsymbol{\beta} +\mathbf{z}$, where $\mathbf{X}$ is an $n\times p$ design matrix and $\mathbf{z}$ is an $n$-dimensional vector of independent Gaussian errors, each with variance $\sigma^{2}$. Our focus is on the recently introduced SLOPE estimator [Ann. Appl. Stat. 9 (2015) 1103–1140], which regularizes the least-squares estimates with the rank-dependent penalty $\sum_{1\le i\le p}\lambda_{i}|\widehat{\beta} |_{(i)}$, where $|\widehat{\beta} |_{(i)}$ is the $i$th largest magnitude of the fitted coefficients. Under Gaussian designs, where the entries of $\mathbf{X}$ are i.i.d. $\mathcal{N}(0,1/n)$, we show that SLOPE, with weights $\lambda_{i}$ just about equal to $\sigma\cdot\Phi^{-1}(1-iq/(2p))$ [$\Phi^{-1}(\alpha)$ is the $\alpha$th quantile of a standard normal and $q$ is a fixed number in $(0,1)$] achieves a squared error of estimation obeying \[\sup_{\|\boldsymbol{\beta} \|_{0}\le k}\mathbb{P} (\|\widehat{\boldsymbol {\beta}}_{\mathrm{SLOPE}}-\boldsymbol{\beta} \|^{2}>(1+\varepsilon) 2\sigma^{2}k\log(p/k))\longrightarrow 0\] as the dimension $p$ increases to $\infty$, and where $\varepsilon >0$ is an arbitrary small constant. This holds under a weak assumption on the $\ell_{0}$-sparsity level, namely, $k/p\rightarrow 0$ and $(k\log p)/n\rightarrow 0$, and is sharp in the sense that this is the best possible error any estimator can achieve. A remarkable feature is that SLOPE does not require any knowledge of the degree of sparsity, and yet automatically adapts to yield optimal total squared errors over a wide range of $\ell_{0}$-sparsity classes. We are not aware of any other estimator with this property.

Article information

Source
Ann. Statist., Volume 44, Number 3 (2016), 1038-1068.

Dates
Received: April 2015
Revised: September 2015
First available in Project Euclid: 11 April 2016

Permanent link to this document
https://projecteuclid.org/euclid.aos/1460381686

Digital Object Identifier
doi:10.1214/15-AOS1397

Mathematical Reviews number (MathSciNet)
MR3485953

Zentralblatt MATH identifier
1338.62032

Subjects
Primary: 62C20: Minimax procedures
Secondary: 62G05: Estimation 62G10: Hypothesis testing 62J15: Paired and multiple comparisons

Keywords
SLOPE sparse regression adaptivity false discovery rate (FDR) Benjamini–Hochberg procedure FDR thresholding

Citation

Su, Weijie; Candès, Emmanuel. SLOPE is adaptive to unknown sparsity and asymptotically minimax. Ann. Statist. 44 (2016), no. 3, 1038--1068. doi:10.1214/15-AOS1397. https://projecteuclid.org/euclid.aos/1460381686


Export citation

References

  • [1] Abramovich, F. and Benjamini, Y. (1996). Adaptive thresholding of wavelet coefficients. Comput. Statist. Data Anal. 22 351–361.
  • [2] Abramovich, F., Benjamini, Y., Donoho, D. L. and Johnstone, I. M. (2006). Adapting to unknown sparsity by controlling the false discovery rate. Ann. Statist. 34 584–653.
  • [3] Adke, S. R., Waikar, V. B. and Schuurmann, F. J. (1987). A two-stage shrinkage testimator for the mean of an exponential distribution. Comm. Statist. Theory Methods 16 1821–1834.
  • [4] Baraud, Y. (2002). Model selection for regression on a random design. ESAIM Probab. Statist. 6 127–146 (electronic).
  • [5] Barber, R. F. and Candès, E. J. (2015). Controlling the false discovery rate via knockoffs. Ann. Statist. 43 2055–2085.
  • [6] Barlow, R. E., Bartholomew, D. J., Bremner, J. M. and Brunk, H. D. (1972). Statistical Inference Under Order Restrictions. the Theory and Application of Isotonic Regression. Wiley, New York.
  • [7] Bayati, M. and Montanari, A. (2012). The LASSO risk for Gaussian matrices. IEEE Trans. Inform. Theory 58 1997–2017.
  • [8] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300.
  • [9] Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Ann. Statist. 29 1165–1188.
  • [10] Bickel, P. J. (1981). Minimax estimation of the mean of a normal distribution when the parameter space is restricted. Ann. Statist. 9 1301–1309.
  • [11] Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732.
  • [12] Birgé, L. (2004). Model selection for Gaussian regression with random design. Bernoulli 10 1039–1051.
  • [13] Birgé, L. and Massart, P. (2001). Gaussian model selection. J. Eur. Math. Soc. (JEMS) 3 203–268.
  • [14] Bland, J. M. and Altman, D. G. (1995). Multiple significance tests: The Bonferroni method. The British Medical Journal 310 170.
  • [15] Bogdan, M., van den Berg, E., Sabatti, C., Su, W. and Candès, E. J. (2015). SLOPE—Adaptive variable selection via convex optimization. Ann. Appl. Stat. 9 1103–1140.
  • [16] Bogdan, M., van den Berg, E., Su, W. and Candès, E. J. (2013). Statistical estimation and testing via the sorted $\ell_{1}$ norm. Preprint. Available at arXiv:1310.1969.
  • [17] Bondell, H. D. and Reich, B. J. (2008). Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics 64 115–123, 322–323.
  • [18] Brown, L. D., Cai, T. T., Low, M. G. and Zhang, C.-H. (2002). Asymptotic equivalence theory for nonparametric regression with random design. Ann. Statist. 30 688–707.
  • [19] Bunea, F., Tsybakov, A. B. and Wegkamp, M. H. (2007). Aggregation for Gaussian regression. Ann. Statist. 35 1674–1697.
  • [20] Cai, T. T. and Zhou, H. H. (2009). A data-driven block thresholding approach to wavelet estimation. Ann. Statist. 37 569–595.
  • [21] Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when $p$ is much larger than $n$. Ann. Statist. 35 2313–2351.
  • [22] Candès, E. J. and Plan, Y. (2009). Near-ideal model selection by $\ell_{1}$ minimization. Ann. Statist. 37 2145–2177.
  • [23] Candès, E. J., Romberg, J. K. and Tao, T. (2006). Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math. 59 1207–1223.
  • [24] Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika 81 425–455.
  • [25] Donoho, D. L. and Johnstone, I. M. (1994). Minimax risk over $l_{p}$-balls for $l_{q}$-error. Probab. Theory Related Fields 99 277–303.
  • [26] Donoho, D. L. and Johnstone, I. M. (1995). Adapting to unknown smoothness via wavelet shrinkage. J. Amer. Statist. Assoc. 90 1200–1224.
  • [27] Donoho, D. L., Johnstone, I. M., Maleki, A. and Montanari, A. (2011). Compressed sensing over $\ell_{p}$-balls: Minimax mean square error. In Proceedings of the IEEE International Symposium on Information Theory 129–133. IEEE, New York.
  • [28] Donoho, D. L. and Montanari, A. (2013). High dimensional robust M-estimation: Asymptotic variance via approximate message passing. Preprint. Available at arXiv:1310.7320.
  • [29] Donoho, D. L. and Tanner, J. (2009). Counting faces of randomly projected polytopes when the projection radically lowers dimension. J. Amer. Math. Soc. 22 1–53.
  • [30] Donoho, D. L. and Tanner, J. (2010). Exponential bounds implying construction of compressed sensing matrices, error-correcting codes, and neighborly polytopes by random sampling. IEEE Trans. Inform. Theory 56 2002–2016.
  • [31] Fan, J., Han, X. and Gu, W. (2012). Estimating false discovery proportion under arbitrary covariance dependence. J. Amer. Statist. Assoc. 107 1019–1035.
  • [32] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
  • [33] Figueiredo, M. and Nowak, R. (2014). Sparse estimation with strongly correlated variables using ordered weighted $\ell_{1}$ regularization. Preprint. Available at arXiv:1409.4005.
  • [34] Foster, D. P. and George, E. I. (1994). The risk inflation criterion for multiple regression. Ann. Statist. 22 1947–1975.
  • [35] Foster, D. P. and Stine, R. A. (1999). Local asymptotic coding and the minimum description length. IEEE Trans. Inform. Theory 45 1289–1293.
  • [36] G’Sell, M., Wager, S., Chouldechova, A. and Tibshirani, R. (2013). Sequential selection procedures and false discovery rate control. Preprint. Available at arXiv:1309.5352.
  • [37] Greenshtein, E. and Ritov, Y. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10 971–988.
  • [38] Ji, P. and Zhao, Z. (2014). Rate optimal multiple testing procedure in high-dimensional regression. Preprint. Available at arXiv:1404.2961.
  • [39] Jiang, W. and Zhang, C.-H. (2013). A nonparametric empirical Bayes approach to adaptive minimax estimation. J. Multivariate Anal. 122 82–95.
  • [40] Johnstone, I. M. (2013). Gaussian estimation: Sequence and wavelet models. Available at http://statweb.stanford.edu/~imj/GE06-11-13.pdf.
  • [41] Kruskal, J. B. (1964). Nonmetric multidimensional scaling: A numerical method. Psychometrika 29 115–129.
  • [42] Liu, W. (2013). Gaussian graphical model estimation with false discovery rate control. Ann. Statist. 41 2948–2978.
  • [43] Lockhart, R., Taylor, J., Tibshirani, R. J. and Tibshirani, R. (2014). A significance test for the lasso. Ann. Statist. 42 413–468.
  • [44] Marshall, A. W., Olkin, I. and Arnold, B. C. (2011). Inequalities: Theory of Majorization and Its Applications, 2nd ed. Springer, New York.
  • [45] Meinshausen, N. and Yu, B. (2009). Lasso-type recovery of sparse representations for high-dimensional data. Ann. Statist. 37 246–270.
  • [46] Parikh, N. and Boyd, S. (2013). Proximal algorithms. Foundations and Trends in Optimization 1 123–231.
  • [47] Raskutti, G., Wainwright, M. J. and Yu, B. (2011). Minimax rates of estimation for high-dimensional linear regression over $\ell_{q}$-balls. IEEE Trans. Inform. Theory 57 6976–6994.
  • [48] Ravikumar, P., Wainwright, M. J. and Lafferty, J. D. (2010). High-dimensional Ising model selection using $\ell_{1}$-regularized logistic regression. Ann. Statist. 38 1287–1319.
  • [49] Stein, C. M. (1981). Estimation of the mean of a multivariate normal distribution. Ann. Statist. 9 1135–1151.
  • [50] Su, W. and Candès, E. (2015). Supplement to “SLOPE is adaptive to unknown sparsity and asymptotically minimax.” DOI:10.1214/15-AOS1397SUPP.
  • [51] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • [52] Tibshirani, R. and Knight, K. (1999). The covariance inflation criterion for adaptive model selection. J. R. Stat. Soc. Ser. B Stat. Methodol. 61 529–546.
  • [53] van de Geer, S. A. and Bühlmann, P. (2009). On the conditions used to prove oracle results for the Lasso. Electron. J. Stat. 3 1360–1392.
  • [54] Verzelen, N. (2012). Minimax risks for sparse regressions: Ultra-high dimensional phenomenons. Electron. J. Stat. 6 38–90.
  • [55] Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell_{1}$-constrained quadratic programming (Lasso). IEEE Trans. Inform. Theory 55 2183–2202.
  • [56] Wu, Z. and Zhou, H. H. (2013). Model selection and sharp asymptotic minimaxity. Probab. Theory Related Fields 156 165–191.
  • [57] Ye, F. and Zhang, C.-H. (2010). Rate minimaxity of the Lasso and Dantzig selector for the $\ell_{q}$ loss in $\ell_{r}$ balls. J. Mach. Learn. Res. 11 3519–3540.
  • [58] Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38 894–942.
  • [59] Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429.

Supplemental materials