The Annals of Statistics

Projected principal component analysis in factor models

Jianqing Fan, Yuan Liao, and Weichen Wang

Full-text: Open access


This paper introduces a Projected Principal Component Analysis (Projected-PCA), which employs principal component analysis to the projected (smoothed) data matrix onto a given linear space spanned by covariates. When it applies to high-dimensional factor analysis, the projection removes noise components. We show that the unobserved latent factors can be more accurately estimated than the conventional PCA if the projection is genuine, or more precisely, when the factor loading matrices are related to the projected linear space. When the dimensionality is large, the factors can be estimated accurately even when the sample size is finite. We propose a flexible semiparametric factor model, which decomposes the factor loading matrix into the component that can be explained by subject-specific covariates and the orthogonal residual component. The covariates’ effects on the factor loadings are further modeled by the additive model via sieve approximations. By using the newly proposed Projected-PCA, the rates of convergence of the smooth factor loading matrices are obtained, which are much faster than those of the conventional factor analysis. The convergence is achieved even when the sample size is finite and is particularly appealing in the high-dimension-low-sample-size situation. This leads us to developing nonparametric tests on whether observed covariates have explaining powers on the loadings and whether they fully explain the loadings. The proposed method is illustrated by both simulated data and the returns of the components of the S&P 500 index.

Article information

Ann. Statist., Volume 44, Number 1 (2016), 219-254.

Received: January 2015
Revised: July 2015
First available in Project Euclid: 10 December 2015

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62H25: Factor analysis and principal components; correspondence analysis
Secondary: 62H15: Hypothesis testing

Semiparametric factor models high-dimensionality loading matrix modeling sieve approximation


Fan, Jianqing; Liao, Yuan; Wang, Weichen. Projected principal component analysis in factor models. Ann. Statist. 44 (2016), no. 1, 219--254. doi:10.1214/15-AOS1364.

Export citation


  • Ahn, S. C. and Horenstein, A. R. (2013). Eigenvalue ratio test for the number of factors. Econometrica 81 1203–1227.
  • Ahn, J., Marron, J. S., Muller, K. M. and Chi, Y.-Y. (2007). The high-dimension, low-sample-size geometric representation holds under mild conditions. Biometrika 94 760–766.
  • Alessi, L., Barigozzi, M. and Capasso, M. (2010). Improved penalization for determining the number of factors in approximate factor models. Statist. Probab. Lett. 80 1806–1813.
  • Andrews, D. W. K. (1991). Heteroskedasticity and autocorrelation consistent covariance matrix estimation. Econometrica 59 817–858.
  • Bai, J. (2003). Inferential theory for factor models of large dimensions. Econometrica 71 135–171.
  • Bai, J. and Li, K. (2012). Statistical analysis of factor models of high dimension. Ann. Statist. 40 436–465.
  • Bai, J. and Ng, S. (2002). Determining the number of factors in approximate factor models. Econometrica 70 191–221.
  • Bai, J. and Ng, S. (2013). Principal components estimation and identification of static factors. J. Econometrics 176 18–29.
  • Bickel, P. J. and Levina, E. (2008). Covariance regularization by thresholding. Ann. Statist. 36 2577–2604.
  • Breitung, J. and Pigorsch, U. (2009). A canonical correlation approach for selecting the number of dynamic factors. Oxford Bulletin of Economics and Statistics 75 23–36.
  • Breitung, J. and Tenhofen, J. (2011). GLS estimation of dynamic factor models. J. Amer. Statist. Assoc. 106 1150–1166.
  • Brillinger, D. R. (1981). Time Series: Data Analysis and Theory, 2nd ed. Holden-Day, Oakland, CA.
  • Cai, T. T., Ma, Z. and Wu, Y. (2013). Sparse PCA: Optimal rates and adaptive estimation. Ann. Statist. 41 3074–3110.
  • Candès, E. J. and Recht, B. (2009). Exact matrix completion via convex optimization. Found. Comput. Math. 9 717–772.
  • Chen, X. (2007). Large sample sieve estimation of semi-nonparametric models. In Handbook of Econometrics 76. North Holland, Amsterdam.
  • Connor, G., Hagmann, M. and Linton, O. (2012). Efficient semiparametric estimation of the Fama–French model and extensions. Econometrica 80 713–754.
  • Connor, G. and Linton, O. (2007). Semiparametric estimation of a characteristic-based factor model of stock returns. Journal of Empirical Finance 14 694–717.
  • Desai, K. H. and Storey, J. D. (2012). Cross-dimensional inference of dependent high-dimensional data. J. Amer. Statist. Assoc. 107 135–151.
  • Efron, B. (2010). Correlated $z$-values and the accuracy of large-scale statistical estimates. J. Amer. Statist. Assoc. 105 1042–1055.
  • Fan, J., Han, X. and Gu, W. (2012). Estimating false discovery proportion under arbitrary covariance dependence. J. Amer. Statist. Assoc. 107 1019–1035.
  • Fan, J., Liao, Y. and Mincheva, M. (2013). Large covariance estimation by thresholding principal orthogonal complements. J. R. Stat. Soc. Ser. B. Stat. Methodol. 75 603–680.
  • Fan, J., Liao, Y. and Shi, X. (2015). Risks of large portfolios. J. Econometrics 186 367–387.
  • Fan, J., Liao, Y. and Wang, W. (2015). Supplement to “Projected principal component analysis in factor models.” DOI:10.1214/15-AOS1364SUPP.
  • Forni, M. and Lippi, M. (2001). The generalized dynamic factor model: Representation theory. Econometric Theory 17 1113–1141.
  • Forni, M., Hallin, M., Lippi, M. and Reichlin, L. (2000). The generalized dynamic-factor model: Identification and estimation. Rev. Econom. Statist. 82 540–554.
  • Forni, M., Hallin, M., Lippi, M. and Zaffaroni, P. (2015). Dynamic factor models with infinite-dimensional factor spaces: One-sided representations. J. Econometrics 185 359–371.
  • Friguet, C., Kloareg, M. and Causeur, D. (2009). A factor model approach to multiple testing under dependence. J. Amer. Statist. Assoc. 104 1406–1415.
  • Hallin, M. and Liška, R. (2007). Determining the number of factors in the general dynamic factor model. J. Amer. Statist. Assoc. 102 603–617.
  • Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29 295–327.
  • Jung, S. and Marron, J. S. (2009). PCA consistency in high dimension, low sample size context. Ann. Statist. 37 4104–4130.
  • Koltchinskii, V., Lounici, K. and Tsybakov, A. B. (2011). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Statist. 39 2302–2329.
  • Lam, C. and Yao, Q. (2012). Factor modeling for high-dimensional time series: Inference for the number of factors. Ann. Statist. 40 694–726.
  • Leek, J. T. and Storey, J. D. (2008). A general framework for multiple testing dependence. Proc. Natl. Acad. Sci. USA 105 18718–18723.
  • Li, G., Yang, D., Nobel, A. B. and Shen, H. (2015). Supervised singular value decomposition and its asymptotic properties. J. Multivariate Anal. To appear.
  • Lorentz, G. G. (1986). Approximation of Functions, 2nd ed. Chelsea Publishing, New York.
  • Ma, Z. (2013). Sparse principal component analysis and iterative thresholding. Ann. Statist. 41 772–801.
  • Negahban, S. and Wainwright, M. J. (2011). Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann. Statist. 39 1069–1097.
  • Newey, W. K. and West, K. D. (1987). A simple, positive semidefinite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica 55 703–708.
  • Park, B. U., Mammen, E., Härdle, W. and Borak, S. (2009). Time series modelling with semiparametric factor dynamics. J. Amer. Statist. Assoc. 104 284–298.
  • Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statist. Sinica 17 1617–1642.
  • Shen, D., Shen, H. and Marron, J. S. (2013). Consistency of sparse PCA in high dimension, low sample size contexts. J. Multivariate Anal. 115 317–333.
  • Shen, D., Shen, H., Zhu, H. and Marron, J. (2013). Surprising asymptotic conical structure in critical sample eigen-directions. Technical report, Univ. North Carolina.
  • Stock, J. H. and Watson, M. W. (2002). Forecasting using principal components from a large number of predictors. J. Amer. Statist. Assoc. 97 1167–1179.
  • Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing 210–268. Cambridge Univ. Press, Cambridge.

Supplemental materials