## The Annals of Statistics

### Are discoveries spurious? Distributions of maximum spurious correlations and their applications

#### Abstract

Over the last two decades, many exciting variable selection methods have been developed for finding a small group of covariates that are associated with the response from a large pool. Can the discoveries from these data mining approaches be spurious due to high dimensionality and limited sample size? Can our fundamental assumptions about the exogeneity of the covariates needed for such variable selection be validated with the data? To answer these questions, we need to derive the distributions of the maximum spurious correlations given a certain number of predictors, namely, the distribution of the correlation of a response variable $Y$ with the best $s$ linear combinations of $p$ covariates $\mathbf{X}$, even when $\mathbf{X}$ and $Y$ are independent. When the covariance matrix of $\mathbf{X}$ possesses the restricted eigenvalue property, we derive such distributions for both a finite $s$ and a diverging $s$, using Gaussian approximation and empirical process techniques. However, such a distribution depends on the unknown covariance matrix of $\mathbf{X}$. Hence, we use the multiplier bootstrap procedure to approximate the unknown distributions and establish the consistency of such a simple bootstrap approach. The results are further extended to the situation where the residuals are from regularized fits. Our approach is then used to construct the upper confidence limit for the maximum spurious correlation and to test the exogeneity of the covariates. The former provides a baseline for guarding against false discoveries and the latter tests whether our fundamental assumptions for high-dimensional model selection are statistically valid. Our techniques and results are illustrated with both numerical examples and real data analysis.

#### Article information

Source
Ann. Statist., Volume 46, Number 3 (2018), 989-1017.

Dates
Revised: April 2017
First available in Project Euclid: 3 May 2018

Permanent link to this document
https://projecteuclid.org/euclid.aos/1525313073

Digital Object Identifier
doi:10.1214/17-AOS1575

Mathematical Reviews number (MathSciNet)
MR3797994

Zentralblatt MATH identifier
1402.62097

#### Citation

Fan, Jianqing; Shao, Qi-Man; Zhou, Wen-Xin. Are discoveries spurious? Distributions of maximum spurious correlations and their applications. Ann. Statist. 46 (2018), no. 3, 989--1017. doi:10.1214/17-AOS1575. https://projecteuclid.org/euclid.aos/1525313073

#### References

• Arlot, S., Blanchard, G. and Roquain, E. (2010). Some nonasymptotic results on resampling in high dimension. I. Confidence regions. Ann. Statist. 38 51–82.
• Barrett, G. F. and Donald, S. G. (2003). Consistent tests for stochastic dominance. Econometrica 71 71–104.
• Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732.
• Brusco, M. J. and Stahl, S. (2005). Branch-and-Bound Applications in Combinatorial Data Analysis. Springer, New York.
• Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Heidelberg.
• Cai, T., Fan, J. and Jiang, T. (2013). Distributions of angles in random packing on spheres. J. Mach. Learn. Res. 14 1837–1864.
• Cai, T. T. and Jiang, T. (2011). Limiting laws of coherence of random matrices with applications to testing covariance structure and construction of compressed sensing matrices. Ann. Statist. 39 1496–1525.
• Cai, T. T., Liu, W. and Xia, Y. (2014). Two-sample test of high dimensional means under dependence. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 349–372.
• Chang, J., Zheng, C., Zhou, W.-X. and Zhou, W. (2017). Simulation-based hypothesis testing of high dimensional means under covariance heterogeneity. Biometrics 73 1300–1310.
• Chatterjee, S. and Bose, A. (2005). Generalized bootstrap for estimating equations. Ann. Statist. 33 414–436.
• Chernozhukov, V., Chetverikov, D. and Kato, K. (2013). Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann. Statist. 41 2786–2819.
• Chernozhukov, V., Chetverikov, D. and Kato, K. (2014). Gaussian approximation of suprema of empirical processes. Ann. Statist. 42 1564–1597.
• Davydov, Yu. A., Lifshits, M. A. and Smorodina, N. V. (1998). Local Properties of Distributions of Stochastic Functionals. Translations of Mathematical Monographs 173. Amer. Math. Soc., Providence, RI. Translated from the 1995 Russian original by V. E. Nazaĭkinskiĭ and M. A. Shishkova.
• Dudoit, S. and van der Laan, M. J. (2008). Multiple Testing Procedures with Applications to Genomics. Springer, New York.
• Efron, B. (2010). Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Institute of Mathematical Statistics (IMS) Monographs 1. Cambridge Univ. Press, Cambridge.
• Fan, J., Guo, S. and Hao, N. (2012). Variance estimation using refitted cross-validation in ultrahigh dimensional regression. J. R. Stat. Soc. Ser. B. Stat. Methodol. 74 37–65.
• Fan, J., Han, F. and Liu, H. (2014). Challenges of big data analysis. Natl. Sci. Rev. 1 293–314.
• Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
• Fan, J. and Liao, Y. (2014). Endogeneity in high dimensions. Ann. Statist. 42 872–917.
• Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statist. Sinica 20 101–148.
• Fan, J., Shao, Q.-M. and Zhou, W.-X. (2018). Supplement to “Are discoveries spurious? Distributions of maximum spurious correlations and their applications.” DOI:10.1214/17-AOS1575SUPP.
• Fan, J., Xue, L. and Zou, H. (2014). Strong oracle optimality of folded concave penalized estimation. Ann. Statist. 42 819–849.
• Goeman, J. J., van de Geer, S. A. and van Houwelingen, H. C. (2006). Testing against a high dimensional alternative. J. R. Stat. Soc. Ser. B. Stat. Methodol. 68 477–493.
• Hansen, B. E. (1996). Inference when a nuisance parameter is not identified under the null hypothesis. Econometrica 64 413–430.
• Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, New York.
• Shao, Q.-M. and Zhou, W.-X. (2014). Necessary and sufficient conditions for the asymptotic distributions of coherence of ultra-high dimensional random matrices. Ann. Probab. 42 623–648.
• Stranger, B. E., Nica, A. C., Forrest, M. S., Dimas, A., Bird, C. P., Beazley, C., Ingle, C. E., Dunning, M., Flicek, P., Koller, D., Montgomery, S., Tavaré, S., Deloukas, P. and Dermitzakis, E. T. (2007). Population genomics of human gene expression. Nat. Genet. 39 1217–1224.
• Thorgeirsson, T. E. et al. (2010). Sequence variants at CHRNB3-CHRNA6 and CYP2A6 affect smoking behavior. Nat. Genet. 42 448–453.
• Thorisson, G. A., Smith, A. V., Krishnan, L. and Stein, L. D. (2005). The international HapMap project web site. Genome Res. 15 1592–1593.
• Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol. 58 267–288.
• van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, New York.
• Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing (Y. Eldar and G. Kutyniok, eds.) 210–268. Cambridge Univ. Press, Cambridge.
• Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38 894–942.
• Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Ann. Statist. 36 1509–1533.

#### Supplemental materials

• Supplement to “Are discoveries spurious? Distributions of maximum spurious correlations and their applications”. This supplemental material contains additional proofs for all the remaining theoretical results in the main text, including Lemmas 7.2–7.6, Theorems 3.2, 4.1 and 4.2 and Propositions 3.1 and 3.2. A discussion on the moment assumptions is also included.