## The Annals of Statistics

### Conditional mean and quantile dependence testing in high dimension

#### Abstract

Motivated by applications in biological science, we propose a novel test to assess the conditional mean dependence of a response variable on a large number of covariates. Our procedure is built on the martingale difference divergence recently proposed in Shao and Zhang [J. Amer. Statist. Assoc. 109 (2014) 1302–1318], and it is able to detect certain type of departure from the null hypothesis of conditional mean independence without making any specific model assumptions. Theoretically, we establish the asymptotic normality of the proposed test statistic under suitable assumption on the eigenvalues of a Hermitian operator, which is constructed based on the characteristic function of the covariates. These conditions can be simplified under banded dependence structure on the covariates or Gaussian design. To account for heterogeneity within the data, we further develop a testing procedure for conditional quantile independence at a given quantile level and provide an asymptotic justification. Empirically, our test of conditional mean independence delivers comparable results to the competitor, which was constructed under the linear model framework, when the underlying model is linear. It significantly outperforms the competitor when the conditional mean admits a nonlinear form.

#### Article information

Source
Ann. Statist., Volume 46, Number 1 (2018), 219-246.

Dates
Revised: November 2016
First available in Project Euclid: 22 February 2018

https://projecteuclid.org/euclid.aos/1519268429

Digital Object Identifier
doi:10.1214/17-AOS1548

Mathematical Reviews number (MathSciNet)
MR3766951

Zentralblatt MATH identifier
06865110

Subjects
Primary: 62G10: Hypothesis testing
Secondary: 62G20: Asymptotic properties

#### Citation

Zhang, Xianyang; Yao, Shun; Shao, Xiaofeng. Conditional mean and quantile dependence testing in high dimension. Ann. Statist. 46 (2018), no. 1, 219--246. doi:10.1214/17-AOS1548. https://projecteuclid.org/euclid.aos/1519268429

#### References

• Chen, S. X. and Qin, Y.-L. (2010). A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Statist. 38 808–835.
• Efron, B. and Tibshirani, R. (2007). On testing the significance of sets of genes. Ann. Appl. Stat. 1 107–129.
• Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B. Stat. Methodol. 70 849–911.
• Fang, K. T., Kotz, S. and Ng, K. W. (1990). Symmetric Multivariate and Related Distributions. Monographs on Statistics and Applied Probability 36. Chapman & Hall, London.
• Feng, L., Zou, C., Wang, Z. and Chen, B. (2013). Rank-based score tests for high-dimensional regression coefficients. Electron. J. Stat. 7 2131–2149.
• Goeman, J. J., van de Geer, S. A. and van Houwelingen, H. C. (2006). Testing against a high dimensional alternative. J. R. Stat. Soc. Ser. B. Stat. Methodol. 68 477–493.
• Hall, P. (1984). Central limit theorem for integrated square error of multivariate nonparametric density estimators. J. Multivariate Anal. 14 1–16.
• He, X., Wang, L. and Hong, H. G. (2013). Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. Ann. Statist. 41 342–369.
• Koenker, R. and Bassett, G. Jr. (1978). Regression quantiles. Econometrica 46 33–50.
• Lan, W., Wang, H. and Tsai, C.-L. (2014). Testing covariates in high-dimensional regression. Ann. Inst. Statist. Math. 66 279–301.
• Li, Q., Hsiao, C. and Zinn, J. (2003). Consistent specification tests for semiparametric/nonparametric models based on series estimation methods. J. Econometrics 112 295–325.
• Liu, J., Zhong, W. and Li, R. (2015). A selective overview of feature screening for ultrahigh-dimensional data. Sci. China Math. 58 2033–2054.
• McKeague, I. W. and Qian, M. (2015). An adaptive resampling test for detecting the presence of significant predictors. J. Amer. Statist. Assoc. 110 1422–1433.
• Newton, M. A., Quintana, F. A., den Boon, J. A., Sengupta, S. and Ahlquist, P. (2007). Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis. Ann. Appl. Stat. 1 85–106.
• Park, T., Shao, X. and Yao, S. (2015). Partial martingale difference correlation. Electron. J. Stat. 9 1492–1517.
• Sejdinovic, D., Sriperumbudur, B., Gretton, A. and Fukumizu, K. (2013). Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Ann. Statist. 41 2263–2291.
• Serfling, R. J. (1980). Approximation Theorems of Mathematical Statistics. Wiley, New York.
• Shao, X. and Zhang, J. (2014). Martingale difference correlation and its use in high-dimensional variable screening. J. Amer. Statist. Assoc. 109 1302–1318.
• Subramanian, A., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A., Pomeroy, S. L., Golub, T. R., Lander, E. S. and Mesirov, J. P. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102 15545–15550.
• Székely, G. J. and Rizzo, M. L. (2013). The distance correlation $t$-test of independence in high dimension. J. Multivariate Anal. 117 193–213.
• Székely, G. J. and Rizzo, M. L. (2014). Partial distance correlation with methods for dissimilarities. Ann. Statist. 42 2382–2412.
• Székely, G. J., Rizzo, M. L. and Bakirov, N. K. (2007). Measuring and testing dependence by correlation of distances. Ann. Statist. 35 2769–2794.
• Wang, S. and Cui, H. (2013). Generalized $F$ test for high dimensional linear regression coefficients. J. Multivariate Anal. 117 134–149.
• Wang, H. and He, X. (2007). Detecting differential expressions in GeneChip microarray studies: A quantile approach. J. Amer. Statist. Assoc. 102 104–112.
• Wang, H. and He, X. (2008). An enhanced quantile approach for assessing differential gene expressions. Biometrics 64 449–457, 666.
• Wang, L., Wu, Y. and Li, R. (2012). Quantile regression for analyzing heterogeneity in ultra-high dimension. J. Amer. Statist. Assoc. 107 214–222.
• Westfall, P. H. and Young, S. S. (1993). Resampling-Based Multiple Testing: Examples and Methods for P-Value Adjustment. Wiley, New York.
• Wu, C. F. J. and Hamada, M. S. (2009). Experiments: Planning, Analysis, and Optimization, 2nd ed. Wiley, Hoboken, NJ.
• Yata, K. and Aoshima, M. (2013). Correlation tests for high-dimensional data using extended cross-data-matrix methodology. J. Multivariate Anal. 117 313–331.
• Zhang, X., Yao, S. and Shao, X. (2018). Supplement to “Conditional mean and quantile dependence testing in high dimension.” DOI:10.1214/17-AOS1548SUPP.
• Zhong, P.-S. and Chen, S. X. (2011). Tests for high-dimensional regression coefficients with factorial designs. J. Amer. Statist. Assoc. 106 260–274.

#### Supplemental materials

• Supplement to “Conditional mean and quantile dependence testing in high dimension”. This supplement contains proofs of the main results in the paper, extension to factorial designs, additional discussions and numerical results.