## The Annals of Statistics

### Prediction when fitting simple models to high-dimensional data

#### Abstract

We study linear subset regression in the context of a high-dimensional linear model. Consider $y=\vartheta +\theta 'z+\epsilon$ with univariate response $y$ and a $d$-vector of random regressors $z$, and a submodel where $y$ is regressed on a set of $p$ explanatory variables that are given by $x=M'z$, for some $d\times p$ matrix $M$. Here, “high-dimensional” means that the number $d$ of available explanatory variables in the overall model is much larger than the number $p$ of variables in the submodel. In this paper, we present Pinsker-type results for prediction of $y$ given $x$. In particular, we show that the mean squared prediction error of the best linear predictor of $y$ given $x$ is close to the mean squared prediction error of the corresponding Bayes predictor $\mathbb{E}[y\|x]$, provided only that $p/\log d$ is small. We also show that the mean squared prediction error of the (feasible) least-squares predictor computed from $n$ independent observations of $(y,x)$ is close to that of the Bayes predictor, provided only that both $p/\log d$ and $p/n$ are small. Our results hold uniformly in the regression parameters and over large collections of distributions for the design variables $z$.

#### Article information

Source
Ann. Statist., Volume 47, Number 3 (2019), 1408-1442.

Dates
Revised: April 2017
First available in Project Euclid: 13 February 2019

https://projecteuclid.org/euclid.aos/1550026843

Digital Object Identifier
doi:10.1214/18-AOS1719

Mathematical Reviews number (MathSciNet)
MR3911117

Zentralblatt MATH identifier
07053513

#### Citation

Steinberger, Lukas; Leeb, Hannes. Prediction when fitting simple models to high-dimensional data. Ann. Statist. 47 (2019), no. 3, 1408--1442. doi:10.1214/18-AOS1719. https://projecteuclid.org/euclid.aos/1550026843

#### References

• Abadie, A., Imbens, G. W. and Zheng, F. (2014). Inference for misspecified models with fixed regressors. J. Amer. Statist. Assoc. 109 1601–1614.
• Bachoc, F., Leeb, H. and Pötscher, B. M. (2015). Valid confidence intervals for post-model-selection prediction. Arxiv preprint. Available at arXiv:1412.4605.
• Beran, R. and Dümbgen, L. (1998). Modulation of estimators and confidence sets. Ann. Statist. 26 1826–1856.
• Berk, R., Brown, L., Buja, A., Zhang, K. and Zhao, L. (2013). Valid post-selection inference. Ann. Statist. 41 802–837.
• Brannath, W. and Scharpenberg, M. (2014). Interpretation of linear regression coefficients under mean model miss-specification. Arxiv preprint. Available at arXiv:1409.8544.
• Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Series in Statistics. Springer, Heidelberg.
• Buja, A. R., Brown, L. D., George, E., Pitkin, E., Traskin, M., Zhan, K. and Zhao, L. (2014). A conspiracy of random predictors and model violations against classical inference in regression. Arxiv preprint. Available at arXiv:1404.1578.
• Diaconis, P. and Freedman, D. (1984). Asymptotics of graphical projection pursuit. Ann. Statist. 12 793–815.
• Dümbgen, L. and Del Conte-Zerial, P. (2013). On low-dimensional projections of high-dimensional distributions. In From Probability to Statistics and Back: High-Dimensional Models and Processes. Inst. Math. Stat. (IMS) Collect. 9 91–104. IMS, Beachwood, OH.
• Eaton, M. L. (1986). A characterization of spherical distributions. J. Multivariate Anal. 20 272–276.
• El Karoui, N. (2010). The spectrum of kernel random matrices. Ann. Statist. 38 1–50.
• Greenshtein, E. and Ritov, Y. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10 971–988.
• Hall, P. and Li, K.-C. (1993). On almost linearity of low-dimensional projections from high-dimensional data. Ann. Statist. 21 867–889.
• Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. In Proc. Fifth Berkeley Sympos. Math. Statist. and Probability (Berkeley, Calif., 1965/66), Vol. I: Statistics 221–233. Univ. California Press, Berkeley, CA.
• Lee, J. D., Sun, D. L., Sun, Y. and Taylor, J. E. (2016). Exact post-selection inference, with application to the lasso. Ann. Statist. 44 907–927.
• Leeb, H. (2008). Evaluation and selection of models for out-of-sample prediction when the sample size is small relative to the complexity of the data-generating process. Bernoulli 14 661–690.
• Leeb, H. (2009). Conditional predictive inference post model selection. Ann. Statist. 37 2838–2876.
• Leeb, H. (2013). On the conditional distributions of low-dimensional projections from high-dimensional data. Ann. Statist. 41 464–483.
• Leeb, H., Pötscher, B. M. and Ewald, K. (2015). On various confidence intervals post-model-selection. Statist. Sci. 30 216–227.
• Pinsker, M. S. (1980). Optimal filtration of square-integrable signals in Gaussian noise. Probl. Inf. Transm. 16 120–133.
• Rosenthal, H. P. (1970). On the subspaces of $L^{p}$ ($p>2$) spanned by sequences of independent random variables. Israel J. Math. 8 273–303.
• Srivastava, N. and Vershynin, R. (2013). Covariance estimation for distributions with $2+\varepsilon$ moments. Ann. Probab. 41 3081–3111.
• Steinberger, L. (2015). Statistical inference in high-dimensional linear regression based on simple working models. Ph.D. thesis, Univ. Vienna.
• Steinberger, L. and Leeb, H. (2018). On conditional moments of high-dimensional random vectors given lower-dimensional projections. Bernoulli 24 565–591.
• Taylor, J., Lockhart, R., Tibshirani, R. J. and Tibshirani, R. (2014). Exact post-selection inference for forward stepwise least angle regression. Arxiv preprint. Available at arXiv:1401.3889.