The Annals of Statistics

Sharp detection in PCA under correlations: All eigenvalues matter

Edgar Dobriban

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

Principal component analysis (PCA) is a widely used method for dimension reduction. In high-dimensional data, the “signal” eigenvalues corresponding to weak principal components (PCs) do not necessarily separate from the bulk of the “noise” eigenvalues. Therefore, popular tests based on the largest eigenvalue have little power to detect weak PCs. In the special case of the spiked model, certain tests asymptotically equivalent to linear spectral statistics (LSS)—averaging effects over all eigenvalues—were recently shown to achieve some power.

We consider a “local alternatives” model for the spectrum of covariance matrices that allows a general correlation structure. We develop new tests to detect PCs in this model. While the top eigenvalue contains little information, due to the strong correlations between the eigenvalues we can detect weak PCs by averaging over all eigenvalues using LSS. We show that it is possible to find the optimal LSS, by solving a certain integral equation. To solve this equation, we develop efficient algorithms that build on our recent method for computing the limit empirical spectrum [Dobriban (2015)]. The solvability of this equation also presents a new perspective on phase transitions in spiked models.

Article information

Source
Ann. Statist., Volume 45, Number 4 (2017), 1810-1833.

Dates
Received: February 2016
Revised: August 2016
First available in Project Euclid: 28 June 2017

Permanent link to this document
https://projecteuclid.org/euclid.aos/1498636875

Digital Object Identifier
doi:10.1214/16-AOS1514

Mathematical Reviews number (MathSciNet)
MR3670197

Zentralblatt MATH identifier
06773292

Subjects
Primary: 62H25: Factor analysis and principal components; correspondence analysis
Secondary: 62H15: Hypothesis testing 45B05: Fredholm integral equations

Keywords
Principal component analysis linear spectral statistic random matrix theory linear integral equation optimal testing

Citation

Dobriban, Edgar. Sharp detection in PCA under correlations: All eigenvalues matter. Ann. Statist. 45 (2017), no. 4, 1810--1833. doi:10.1214/16-AOS1514. https://projecteuclid.org/euclid.aos/1498636875


Export citation

References

  • Anderson, T. W. (1963). Asymptotic theory for principal component analysis. Ann. Math. Stat. 34 122–148.
  • Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis, 3rd ed. Wiley-Interscience, Hoboken, NJ.
  • Bai, Z. D. and Silverstein, J. W. (2004). CLT for linear spectral statistics of large-dimensional sample covariance matrices. Ann. Probab. 32 553–605.
  • Bai, Z. and Silverstein, J. W. (2009). Spectral Analysis of Large Dimensional Random Matrices. Springer.
  • Bai, Z. and Yao, J. (2012). On sample eigenvalues in a generalized spiked population model. J. Multivariate Anal. 106 167–177.
  • Baik, J., Ben Arous, G. and Péché, S. (2005). Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Ann. Probab. 33 1643–1697.
  • Baik, J. and Silverstein, J. W. (2006). Eigenvalues of large sample covariance matrices of spiked population models. J. Multivariate Anal. 97 1382–1408.
  • Benaych-Georges, F. and Nadakuditi, R. R. (2011). The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices. Adv. Math. 227 494–521.
  • Bogachev, V. I. (2007). Measure Theory, Vol. 1. Springer, Berlin.
  • Couillet, R. and Debbah, M. (2011). Random Matrix Methods for Wireless Communications. Cambridge Univ. Press, Cambridge.
  • Dharmawansa, P., Johnstone, I. M. and Onatski, A. (2014). Local asymptotic normality of the spectrum of high-dimensional spiked F-ratios. ArXiv preprint arXiv:1411.3875.
  • Dobriban, E. (2015). Efficient computation of limit spectra of sample covariance matrices. Random Matrices Theory Appl. 4 1550019, 36.
  • Dobriban, E. (2017). Supplement to “Sharp detection in PCA under correlations: All eigenvalues matter.” DOI:10.1214/16-AOS1514SUPP.
  • Dobriban, E. and Wager, S. (2015). High-dimensional asymptotics of prediction: Ridge regression and classification. ArXiv preprint arXiv:1507.03003.
  • Groetsch, C. W. (1977). Generalized Inverses of Linear Operators: Representation and Approximation. Monographs and Textbooks in Pure and Applied Mathematics 37. Dekker, New York.
  • Hachem, W., Hardy, A. and Najim, J. (2015). A survey on the eigenvalues local behavior of large complex correlated Wishart matrices. In Modélisation Aléatoire et Statistique—Journées MAS 2014. ESAIM Proc. Surveys 51 150–174. EDP Sci., Les Ulis.
  • Huang, J. (2014). Mesoscopic perturbations of large random matrices. ArXiv preprint arXiv:1412.4193.
  • Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29 295–327.
  • Johnstone, I. M. and Onatski, A. (2015). Testing in high-dimensional spiked models. ArXiv preprint arXiv:1509.07269.
  • Kress, R. (2014). Linear Integral Equations, 3rd ed. Springer, New York.
  • Lehmann, E. L. and Romano, J. P. (2005). Testing Statistical Hypotheses, 3rd ed. Springer, New York.
  • Marchenko, V. A. and Pastur, L. A. (1967). Distribution of eigenvalues for some sets of random matrices. Mat. Sb. 114 507–536.
  • Onatski, A. (2012). Asymptotics of the principal components estimator of large factor models with weakly influential factors. J. Econometrics 168 244–258.
  • Onatski, A., Moreira, M. J. and Hallin, M. (2013). Asymptotic power of sphericity tests for high-dimensional data. Ann. Statist. 41 1204–1231.
  • Onatski, A., Moreira, M. J. and Hallin, M. (2014). Signal detection in high dimension: The multispiked case. Ann. Statist. 42 225–254.
  • Pan, G. M. and Zhou, W. (2008). Central limit theorem for signal-to-interference ratio of reduced rank linear receiver. Ann. Appl. Probab. 18 1232–1270.
  • Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statist. Sinica 17 1617–1642.
  • Paul, D. and Aue, A. (2014). Random matrix theory in statistics: A review. J. Statist. Plann. Inference 150 1–29.
  • Silverstein, J. W. and Choi, S.-I. (1995). Analysis of the limiting spectral distribution of large-dimensional random matrices. J. Multivariate Anal. 54 295–309.
  • van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics 3. Cambridge Univ. Press, Cambridge.
  • Yao, J., Zheng, S. and Bai, Z. (2015). Large Sample Covariance Matrices and High-Dimensional Data Analysis. Cambridge Univ. Press, New York.
  • Zheng, S., Bai, Z. and Yao, J. (2015). Substitution principle for CLT of linear spectral statistics of high-dimensional sample covariance matrices with applications to hypothesis testing. Ann. Statist. 43 546–591.

Supplemental materials

  • Supplement to “Sharp detection in PCA under correlations: All eigenvalues matter”. In the supplementary material, we give the remaining details of the proofs, algorithms implementing our method and further simulations.