## Electronic Journal of Statistics

### False discovery rate control via debiased lasso

#### Abstract

We consider the problem of variable selection in high-dimensional statistical models where the goal is to report a set of variables, out of many predictors $X_{1},\dotsc ,X_{p}$, that are relevant to a response of interest. For linear high-dimensional model, where the number of parameters exceeds the number of samples $(p>n)$, we propose a procedure for variables selection and prove that it controls the directional false discovery rate (FDR) below a pre-assigned significance level $q\in [0,1]$. We further analyze the statistical power of our framework and show that for designs with subgaussian rows and a common precision matrix $\Omega \in{\mathbb{R}} ^{p\times p}$, if the minimum nonzero parameter $\theta_{\min }$ satisfies $\sqrt{n}\theta_{\min }-\sigma \sqrt{2(\max_{i\in [p]}\Omega_{ii})\log \left(\frac{2p}{qs_{0}}\right)}\to \infty \,,$ then this procedure achieves asymptotic power one.

Our framework is built upon the debiasing approach and assumes the standard condition $s_{0}=o(\sqrt{n}/(\log p)^{2})$, where $s_{0}$ indicates the number of true positives among the $p$ features. Notably, this framework achieves exact directional FDR control without any assumption on the amplitude of unknown regression parameters, and does not require any knowledge of the distribution of covariates or the noise level. We test our method in synthetic and real data experiments to assess its performance and to corroborate our theoretical results.

#### Article information

Source
Electron. J. Statist., Volume 13, Number 1 (2019), 1212-1253.

Dates
First available in Project Euclid: 5 April 2019

https://projecteuclid.org/euclid.ejs/1554429628

Digital Object Identifier
doi:10.1214/19-EJS1554

Mathematical Reviews number (MathSciNet)
MR3935848

Zentralblatt MATH identifier
07056150

#### Citation

Javanmard, Adel; Javadi, Hamid. False discovery rate control via debiased lasso. Electron. J. Statist. 13 (2019), no. 1, 1212--1253. doi:10.1214/19-EJS1554. https://projecteuclid.org/euclid.ejs/1554429628

#### References

• [B+42] Z. W. Birnbaum et al. An inequality for mill’s ratio. The Annals of Mathematical Statistics, 13(2):245–246, 1942.
• [BC13] A. Belloni and V. Chernozhukov. Least squares after model selection in high-dimensional sparse models., Bernoulli, 19(2):521–547, 2013.
• [BC15] R. F. Barber and E. J. Candès. Controlling the false discovery rate via knockoffs., The Annals of Statistics, 43(5) :2055–2085, 2015.
• [BC16] R. F. Barber and E. J. Candes. A knockoff filter for high-dimensional selective inference., arXiv :1602.03574, 2016.
• [BCC+18] A. Belloni, V. Chernozhukov, D. Chetverikov, C. Hansen, and K. Kato. High-dimensional econometrics and generalized gmm. arXiv preprint arXiv :1806.01888, 2018.
• [BCH14] A. Belloni, V. Chernozhukov, and C. Hansen. Inference on treatment effects after selection among high-dimensional controls., The Review of Economic Studies, 81(2):608–650, 2014.
• [BCS18] R. F. Barber, E. J. Candès, and R. J. Samworth. Robust inference with knockoffs., arXiv preprint arXiv :1801.03896, 2018.
• [BH95] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing., Journal of the Royal Statistical Society. Series B (Methodological), pages 289–300, 1995.
• [BRT09] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector., American Journal of Mathematics, 37 :1705–1732, 2009.
• [Büh12] P. Bühlmann. Statistical significance in high-dimensional linear models. arXiv :1202.1377, 2012.
• [BvdG11] P. Bühlmann and S. Van de Geer., Statistics for high-dimensional data. Springer-Verlag, 2011.
• [BY01] Y. Benjamini and D. Yekutieli. The control of the false discovery rate in multiple testing under dependency., The Annals of Statistics, pages 1165–1188, 2001.
• [CFJL18] E. Candes, Y. Fan, L. Janson, and J. Lv. Panning for gold: Model-x knockoffs for high dimensional controlled variable selection., Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2018.
• [CP09] E. Candès and Y. Plan. Near-ideal model selection by $\ell_1$ minimization., The Annals of Statistics, 37(5A) :2145–2177, 2009.
• [CT05] E. J. Candés and T. Tao. Decoding by linear programming., IEEE Transactions on Information Theory, 51 :4203–4215, 2005.
• [FDLL17] Y. Fan, E. Demirkaya, G. Li, and J. Lv. Rank: large-scale inference with graphical nonlinear knockoffs., arXiv preprint arXiv :1709.00092, 2017.
• [FGH12] J. Fan, S. Guo, and N. Hao. Variance estimation using refitted cross-validation in ultrahigh dimensional regression., Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(1) :1467–9868, 2012.
• [FHG12] J. Fan, X. Han, and W. Gu. Control of the false discovery rate under arbitrary covariance dependence (with discussion)., Journal of American Statistical Association, 107 :1019–1045, 2012.
• [FL01] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties., Journal of the American Statistical Association, 96(456) :1348–1360, 2001.
• [FL08] J. Fan and J. Lv. Sure independence screening for ultrahigh dimensional feature space., Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849–911, 2008.
• [GLN02] C. R. Genovese, N. A. Lazar, and T. Nichols. Thresholding of statistical maps in functional neuroimaging using the false discovery rate., Neuroimage, 15(4):870–878, 2002.
• [GR04] E. Greenshtein and Y. Ritov. Persistence in high-dimensional predictor selection and the virtue of over-parametrization., Bernoulli, 10:971–988, 2004.
• [GT00] A. Gelman and F. Tuerlinckx. Type s error rates for classical and bayesian single and multiple comparison procedures., Computational Statistics, 15(3):373–390, 2000.
• [JM13a] A. Javanmard and A. Montanari. Model selection for high-dimensional regression under the generalized irrepresentability condition. In, Advances in Neural Information Processing Systems, pages 3012–3020, 2013.
• [JM13b] A. Javanmard and A. Montanari. Nearly optimal sample size in hypothesis testing for high-dimensional regression. In, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1427–1434. IEEE, 2013.
• [JM14a] A. Javanmard and A. Montanari. Confidence intervals and hypothesis testing for high-dimensional regression., The Journal of Machine Learning Research, 15(1) :2869–2909, 2014.
• [JM14b] A. Javanmard and A. Montanari. Hypothesis Testing in High-Dimensional Regression under the Gaussian Random Design Model: Asymptotic Theory., IEEE Transactions on Information Theory, 60(10) :6522–6554, 2014.
• [JM18] A. Javanmard and A. Montanari. Debiasing the lasso: Optimal sample size for gaussian designs., The Annals of Statistics, 46(6A) :2593–2622, 2018.
• [KF00] K. Knight and W. Fu. Asymptotics for lasso-type estimators., The Annals of Statistics, pages 1356–1378, 2000.
• [Liu13] W. Liu. Gaussian graphical model estimation with false discovery rate control., The Annals of Statistics, pages 2948–2978, 2013.
• [Lou08] K. Lounici. Sup-norm convergence rate and sign concentration property of lasso and dantzig estimators., Electronic Journal of Statistics, 2:90–102, 2008.
• [LS+14] W. Liu, Q.-M. Shao, et al. Phase transition and regularized bootstrap in large-scale $t$-tests with false discovery rate control. The Annals of Statistics, 42(5) :2003–2025, 2014.
• [MB06] N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the lasso., The Annals of Statistics, 34 :1436–1462, 2006.
• [Owe05] A. B. Owen. Variance of the number of false discoveries., Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(3):411–426, 2005.
• [RFZ+05] S.-Y. Rhee, W. J. Fessel, A. R. Zolopa, L. Hurley, T. Liu, J. Taylor, D. P. Nguyen, S. Slome, D. Klein, M. Horberg, et al. Hiv-1 protease and reverse-transcriptase mutations: correlations with antiretroviral therapy in subtype b isolates and implications for drug-resistance surveillance. The Journal of Infectious Diseases, 192(3):456–465, 2005.
• [RTF16] S. Reid, R. Tibshirani, and J. Friedman. A study of error variance estimation in lasso regression., Statistica Sinica, pages 35–67, 2016.
• [RTW+06] S.-Y. Rhee, J. Taylor, G. Wadhera, A. Ben-Hur, D. L. Brutlag, and R. W. Shafer. Genotypic predictors of human immunodeficiency virus type 1 drug resistance. Proceedings of the National Academy of Sciences, 103(46) :17355–17360, 2006.
• [RYB03] A. Reiner, D. Yekutieli, and Y. Benjamini. Identifying differentially expressed genes using false discovery rate controlling procedures., Bioinformatics, 19(3):368–375, 2003.
• [RZ11] M. Rudelson and S. Zhou. Reconstruction from anisotropic random measurements., 2011.
• [SBvdG10] N. Städler, P. Bühlmann, and S. Van de Geer. $\ell_1$-penalization for Mixture Regression Models (with discussion)., Test, 19:209–285, 2010.
• [SRC+15] W. Sun, B. Reich, T. Cai, M. Guindani, and A. Schwartzman. False discovery control in large-scale spatial multiple testing. Journal of the Royal Statistical Society, Series B, 77:59–83, 2015.
• [SZ12] T. Sun and C.-H. Zhang. Scaled sparse linear regression., Biometrika, 99(4):879–898, 2012.
• [Tib96] R. Tibshirani. Regression shrinkage and selection with the Lasso., Journal of the Royal Statistical Society: Series B, 58:267–288, 1996.
• [Tuk91] J. W. Tukey. The philosophy of multiple comparisons., Statistical Science, pages 100–116, 1991.
• [vdGB09] S. Van de Geer and P. Bühlmann. On the conditions used to prove oracle results for the lasso., Electronic Journal of Statistics, 3 :1360–1392, 2009.
• [VdGBRD14] S. Van de Geer, P. Bühlmann, Y. Ritov, and R. Dezeure. On asymptotically optimal confidence regions and tests for high-dimensional models., The Annals of Statistics, 42(3) :1166–1202, 2014.
• [VdV00] A. W. Van der Vaart., Asymptotic statistics, volume 3. Cambridge university press, 2000.
• [Wai09] M. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell_1$-constrained quadratic programming., IEEE Transactions on Information Theory, 55 :2183–2202, 2009.
• [Wu08] W. B. Wu. On false discovery control under dependence., The Annals of Statistics, 36:364–380, 2008.
• [XCML11] J. Xie, T. T. Cai, J. Maris, and H. Li. Optimal false discovery rate control for dependent data., Statistics and Its Interface, 4(4):417–430, 2011.
• [Zha10] C.-H. Zhang. Nearly unbiased variable selection under minimax concave penalty., The Annals of Statistics, 38(2):894–942, 2010.
• [ZY06] P. Zhao and B. Yu. On model selection consistency of Lasso., The Journal of Machine Learning Research, 7 :2541–2563, 2006.
• [ZZ11] C.-H. Zhang and S. Zhang. Confidence Intervals for Low-Dimensional Parameters in High-Dimensional Linear Models. arXiv :1110.2563, 2011.
• [ZZ14] C.-H. Zhang and S. S. Zhang. Confidence intervals for low dimensional parameters in high dimensional linear models., Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):217–242, 2014.