The Annals of Statistics

Factor models and variable selection in high-dimensional regression analysis

Alois Kneip and Pascal Sarda

Full-text: Open access


The paper considers linear regression problems where the number of predictor variables is possibly larger than the sample size. The basic motivation of the study is to combine the points of view of model selection and functional regression by using a factor approach: it is assumed that the predictor vector can be decomposed into a sum of two uncorrelated random components reflecting common factors and specific variabilities of the explanatory variables. It is shown that the traditional assumption of a sparse vector of parameters is restrictive in this context. Common factors may possess a significant influence on the response variable which cannot be captured by the specific effects of a small number of individual variables. We therefore propose to include principal components as additional explanatory variables in an augmented regression model. We give finite sample inequalities for estimates of these components. It is then shown that model selection procedures can be used to estimate the parameters of the augmented model, and we derive theoretical properties of the estimators. Finite sample performance is illustrated by a simulation study.

Article information

Ann. Statist., Volume 39, Number 5 (2011), 2410-2447.

First available in Project Euclid: 30 November 2011

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62J05: Linear regression
Secondary: 62H25: Factor analysis and principal components; correspondence analysis 62F12: Asymptotic properties of estimators

Linear regression model selection functional regression factor models


Kneip, Alois; Sarda, Pascal. Factor models and variable selection in high-dimensional regression analysis. Ann. Statist. 39 (2011), no. 5, 2410--2447. doi:10.1214/11-AOS905.

Export citation


  • Bai, J. (2003). Inferential theory for factor models of large dimensions. Econometrica 71 135–171.
  • Bai, J. (2009). Panel data models with interactive fixed effects. Econometrica 77 1229–1279.
  • Bai, J. and Ng, S. (2002). Determining the number of factors in approximate factor models. Econometrica 70 191–221.
  • Bernanke, B. S. and Boivin, J. (2003). Monetary policy in a data-rich environment. Journal of Monetary Economics 50 525–546.
  • Bhatia, R. (1997). Matrix Analysis. Graduate Texts in Mathematics 169. Springer, New York.
  • Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance matrices. Ann. Statist. 36 199–227.
  • Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732.
  • Cai, T. T. and Hall, P. (2006). Prediction in functional linear regression. Ann. Statist. 34 2159–2179.
  • Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Statist. 35 2313–2351.
  • Cardot, H., Ferraty, F. and Sarda, P. (1999). Functional linear model. Statist. Probab. Lett. 45 11–22.
  • Cardot, H., Mas, A. and Sarda, P. (2007). CLT in functional linear regression models. Probab. Theory Related Fields 138 325–361.
  • Crambes, C., Kneip, A. and Sarda, P. (2009). Smoothing splines estimators for functional linear regression. Ann. Statist. 37 35–72.
  • Cuevas, A., Febrero, M. and Fraiman, R. (2002). Linear functional regression: The case of fixed design and functional response. Canad. J. Statist. 30 285–300.
  • Forni, M. and Lippi, M. (1997). Aggregation and the Microfoundations of Dynamic Macroeconomics. Oxford Univ. Press, Oxford.
  • Forni, M., Hallin, M., Lippi, M. and Reichlin, L. (2000). The generalized dynamic factor model: Identification and estimation. Review of Economics and Statistics 82 540–554.
  • Hall, P. and Horowitz, J. L. (2007). Methodology and convergence rates for functional linear regression. Ann. Statist. 35 70–91.
  • Hall, P. and Hosseini-Nasab, M. (2006). On properties of functional principal components analysis. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 109–126.
  • Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58 13–30.
  • Kneip, A. and Utikal, K. J. (2001). Inference for density families using functional principal component analysis. J. Amer. Statist. Assoc. 96 519–542. With comments and a rejoinder by the authors.
  • Koltchinskii, V. (2009). The Dantzig selector and sparsity oracle inequalities. Bernoulli 15 799–828.
  • Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462.
  • Ramsay, J. O. and Dalzell, C. J. (1991). Some tools for functional data analysis (with discussion). J. Roy. Statist. Soc. Ser. B 53 539–572.
  • Stock, J. H. and Watson, M. W. (2002). Forecasting using principal components from a large number of predictors. J. Amer. Statist. Assoc. 97 1167–1179.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • van de Geer, S. A. (2008). High-dimensional generalized linear models and the lasso. Ann. Statist. 36 614–645.
  • Yao, F., Müller, H.-G. and Wang, J.-L. (2005). Functional linear regression analysis for longitudinal data. Ann. Statist. 33 2873–2903.
  • Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res. 7 2541–2563.
  • Zhou, S., Lafferty, J. and Wassermn, L. (2008). Time varying undirected graphs. In Proceedings of the 21st Annual Conference on Computational Learning Theory (COLT’08). Available at arXiv:0903.2515.
  • Zhou, S., van de Geer, S. and Bülhmann, P. (2009). Adaptive Lasso for high dimensional regression and Gaussian graphical modeling. Preprint.