## The Annals of Statistics

### A rate optimal procedure for recovering sparse differences between high-dimensional means under dependence

#### Abstract

The paper considers the problem of recovering the sparse different components between two high-dimensional means of column-wise dependent random vectors. We show that dependence can be utilized to lower the identification boundary for signal recovery. Moreover, an optimal convergence rate for the marginal false nondiscovery rate (mFNR) is established under dependence. The convergence rate is faster than the optimal rate without dependence. To recover the sparse signal bearing dimensions, we propose a Dependence-Assisted Thresholding and Excising (DATE) procedure, which is shown to be rate optimal for the mFNR with the marginal false discovery rate (mFDR) controlled at a pre-specified level. Extensions of the DATE to recover the differences in contrasts among multiple population means and differences between two covariance matrices are also provided. Simulation studies and case study are given to demonstrate the performance of the proposed signal identification procedure.

#### Article information

Source
Ann. Statist., Volume 45, Number 2 (2017), 557-590.

Dates
Revised: February 2016
First available in Project Euclid: 16 May 2017

https://projecteuclid.org/euclid.aos/1494921950

Digital Object Identifier
doi:10.1214/16-AOS1459

Mathematical Reviews number (MathSciNet)
MR3650393

Zentralblatt MATH identifier
1368.62152

Subjects
Primary: 62H15: Hypothesis testing
Secondary: 62G20: Asymptotic properties

#### Citation

Li, Jun; Zhong, Ping-Shou. A rate optimal procedure for recovering sparse differences between high-dimensional means under dependence. Ann. Statist. 45 (2017), no. 2, 557--590. doi:10.1214/16-AOS1459. https://projecteuclid.org/euclid.aos/1494921950

#### References

• Allen, G. I. and Tibshirani, R. (2012). Inference with transposable data: Modelling the effects of row and column correlations. J. R. Stat. Soc. Ser. B. Stat. Methodol. 74 721–743.
• Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300.
• Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Ann. Statist. 29 1165–1188.
• Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance matrices. Ann. Statist. 36 199–227.
• Cai, T., Liu, W. and Luo, X. (2011). A constrained $\ell_{1}$ minimization approach to sparse precision matrix estimation. J. Amer. Statist. Assoc. 106 594–607.
• Cai, T., Liu, W. and Xia, Y. (2013). Two-sample covariance matrix testing and support recovery in high-dimensional and sparse settings. J. Amer. Statist. Assoc. 108 265–277.
• Cai, T. T., Liu, W. and Xia, Y. (2014). Two-sample test of high dimensional means under dependence. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 349–372.
• Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Ann. Statist. 32 962–994.
• Efron, B. (2007). Correlation and large-scale simultaneous significance testing. J. Amer. Statist. Assoc. 102 93–103.
• Fan, J., Han, X. and Gu, W. (2012). Estimating false discovery proportion under arbitrary covariance dependence. J. Amer. Statist. Assoc. 107 1019–1035.
• Friedman, J., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 432–441.
• Genovese, C. and Wasserman, L. (2002). Operating characteristics and extensions of the false discovery rate procedure. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 499–517.
• Genovese, C. R., Jin, J., Wasserman, L. and Yao, Z. (2012). A comparison of the lasso and marginal regression. J. Mach. Learn. Res. 13 2107–2143.
• Hall, P. and Jin, J. (2010). Innovated higher criticism for detecting sparse signals in correlated noise. Ann. Statist. 38 1686–1732.
• Ji, P. and Jin, J. (2012). UPS delivers optimal phase diagram in high-dimensional variable selection. Ann. Statist. 40 73–103.
• Ji, P. and Zhao, Z. (2014). Rate optimal multiple testing procedure in high-dimensional regression. Technical report.
• Jin, J. (2012). Comment: “Estimating false discovery proportion under arbitrary covariance dependence”. J. Amer. Statist. Assoc. 107 1042–1045.
• Jin, J., Zhang, C.-H. and Zhang, Q. (2014). Optimality of graphlet screening in high dimensional variable selection. J. Mach. Learn. Res. 15 2723–2772.
• Ke, Z. T., Jin, J. and Fan, J. (2014). Covariate assisted screening and estimation. Ann. Statist. 42 2202–2242.
• Li, J. and Chen, S. X. (2012). Two sample tests for high-dimensional covariance matrices. Ann. Statist. 40 908–940.
• Li, J. and Zhong, P. (2016). Supplement to “A rate optimal procedure for recovering sparse differences between high-dimensional means under dependence.” DOI:10.1214/16-AOS1459SUPP.
• Meinshausen, N. and Bühlmann, P. (2005). Lower bounds for the number of false null hypotheses for multiple testing of associations under general dependence structures. Biometrika 92 893–907.
• Qiu, X., Klebanov, L. and Yakovlev, A. (2005). Correlation between gene expression levels and limitations of the empirical Bayes methodology for finding differentially expressed genes. Stat. Appl. Genet. Mol. Biol. 4 Art. 34.
• Richardson, A., Wang, Z., Nicolo, A., Lu, X., Brown, M., Miron, A., Liao, X., Iglehart, J., Livingston, D. and Ganesan, S. (2006). X chromosomal abnormalities in basal-like human breast cancer. Cancer Cell 9 121–132.
• Schott, J. R. (2007). A test for the equality of covariance matrices when the dimension is large relative to the sample sizes. Comput. Statist. Data Anal. 51 6535–6542.
• Schweder, T. and Spjøtvoll, E. (1982). Plots of $p$-values to evaluate many tests simultaneously. Biometrika 69 493–502.
• Srivastava, M. S. and Yanagihara, H. (2010). Testing the equality of several covariance matrices with fewer observations than the dimension. J. Multivariate Anal. 101 1319–1329.
• Storey, J. D. (2002). A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 479–498.
• Sun, W. and Cai, T. T. (2007). Oracle and adaptive compound decision rules for false discovery rate control. J. Amer. Statist. Assoc. 102 901–912.
• Sun, W. and Cai, T. T. (2009). Large-scale multiple testing under dependence. J. R. Stat. Soc. Ser. B Stat. Methodol. 71 393–424.
• Xie, J., Cai, T. T. and Li, H. (2011). Sample size and power analysis for sparse signal recovery in genome-wide association studies. Biometrika 98 273–290.
• Zhao, T., Liu, H., Roeder, K., Lafferty, J. and Wasserman, L. (2012). The huge package for high-dimensional undirected graph estimation in R. J. Mach. Learn. Res. 13 1059–1062.

#### Supplemental materials

• Supplementary material for “A rate optimal procedure for recovering sparse differences between high-dimensional means under dependence”. The supplementary material provides the proofs of Lemmas 1–4 and Theorems 1–5.