Open Access
October 2020 Valid post-selection inference in model-free linear regression
Arun K. Kuchibhotla, Lawrence D. Brown, Andreas Buja, Junhui Cai, Edward I. George, Linda H. Zhao
Ann. Statist. 48(5): 2953-2981 (October 2020). DOI: 10.1214/19-AOS1917


Modern data-driven approaches to modeling make extensive use of covariate/model selection. Such selection incurs a cost: it invalidates classical statistical inference. A conservative remedy to the problem was proposed by Berk et al. (Ann. Statist. 41 (2013) 802–837) and further extended by Bachoc, Preinerstorfer and Steinberger (2016). These proposals, labeled “PoSI methods,” provide valid inference after arbitrary model selection. They are computationally NP-hard and have limitations in their theoretical justifications. We therefore propose computationally efficient confidence regions, named “UPoSI’ (“U” is for “uniform” or “universal.”) and prove large-$p$ asymptotics for them. We do this for linear OLS regression allowing misspecification of the normal linear model, for both fixed and random covariates, and for independent as well as some types of dependent data. We start by proving a general equivalence result for the post-selection inference problem and a simultaneous inference problem in a setting that strips inessential features still present in a related result of Berk et al. (Ann. Statist. 41 (2013) 802–837). We then construct valid PoSI confidence regions that are the first to have vastly improved computational efficiency in that the required computation times grow only quadratically rather than exponentially with the total number $p$ of covariates. These are also the first PoSI confidence regions with guaranteed asymptotic validity when the total number of covariates $p$ diverges (almost exponentially) with the sample size $n$. Under standard tail assumptions, we only require $(\log p)^{7}=o(n)$ and $k=o(\sqrt{n/\log p})$ where $k$ ($\le p$) is the largest number of covariates (model size) considered for selection. We study various properties of these confidence regions, including their Lebesgue measures, and compare them theoretically with those proposed previously.


Download Citation

Arun K. Kuchibhotla. Lawrence D. Brown. Andreas Buja. Junhui Cai. Edward I. George. Linda H. Zhao. "Valid post-selection inference in model-free linear regression." Ann. Statist. 48 (5) 2953 - 2981, October 2020.


Received: 1 October 2018; Revised: 1 September 2019; Published: October 2020
First available in Project Euclid: 19 September 2020

MathSciNet: MR4152630
Digital Object Identifier: 10.1214/19-AOS1917

Primary: 62F12 , 62F25 , 62F40 , 62J05

Keywords: Concentration inequalities , high-dimensional linear regression , Model selection , multiplier bootstrap , Orlicz norms , simultaneous inference , uniform consistency

Rights: Copyright © 2020 Institute of Mathematical Statistics

Vol.48 • No. 5 • October 2020
Back to Top