## The Annals of Statistics

### Bootstrapping and sample splitting for high-dimensional, assumption-lean inference

#### Abstract

Several new methods have been recently proposed for performing valid inference after model selection. An older method is sample splitting: use part of the data for model selection and the rest for inference. In this paper, we revisit sample splitting combined with the bootstrap (or the Normal approximation). We show that this leads to a simple, assumption-lean approach to inference and we establish results on the accuracy of the method. In fact, we find new bounds on the accuracy of the bootstrap and the Normal approximation for general nonlinear parameters with increasing dimension which we then use to assess the accuracy of regression inference. We define new parameters that measure variable importance and that can be inferred with greater accuracy than the usual regression coefficients. Finally, we elucidate an inference-prediction trade-off: splitting increases the accuracy and robustness of inference but can decrease the accuracy of the predictions.

#### Article information

Source
Ann. Statist., Volume 47, Number 6 (2019), 3438-3469.

Dates
Revised: November 2018
First available in Project Euclid: 31 October 2019

https://projecteuclid.org/euclid.aos/1572487399

Digital Object Identifier
doi:10.1214/18-AOS1784

Mathematical Reviews number (MathSciNet)
MR4025748

#### Citation

Rinaldo, Alessandro; Wasserman, Larry; G’Sell, Max. Bootstrapping and sample splitting for high-dimensional, assumption-lean inference. Ann. Statist. 47 (2019), no. 6, 3438--3469. doi:10.1214/18-AOS1784. https://projecteuclid.org/euclid.aos/1572487399

#### References

• [1] Anastasiou, A. and Gaunt, R. E. (2016). Multivariate normal approximation of the maximum likelihood estimator via the delta method. Preprint. Available at arXiv:1609.03970.
• [2] Anastasiou, A. and Ley, C. (2015). New simpler bounds to assess the asymptotic normality of the maximum likelihood estimator. Preprint. Available at arXiv:1508.04948.
• [3] Anastasiou, A. and Reinert, G. (2017). Bounds for the normal approximation of the maximum likelihood estimator. Bernoulli 23 191–218.
• [4] Andrews, D. W. K. and Guggenberger, P. (2009). Hybrid and size-corrected subsampling methods. Econometrica 77 721–762.
• [5] Bachoc, F., Leeb, H. and Pötscher, B. M. (2014). Valid confidence intervals for post-model-selection predictors. Available at arXiv:1412.4605.
• [6] Bachoc, F., Preinerstorfer, D. and Steinberger, L. (2016). Uniformly valid confidence intervals post-model-selection. Available at arXiv:1611.01043.
• [7] Barber, R. F. and Candès, E. J. (2015). Controlling the false discovery rate via knockoffs. Ann. Statist. 43 2055–2085.
• [8] Barnard, G. A. (1974). Discussion of “Cross-validatory choice and assessment of statistical predictions,” by M. Stone. J. Roy. Statist. Soc. Ser. B 133–135.
• [9] Belloni, A., Chernozhukov, V. and Hansen, C. B. (2013). Inference for High-Dimensional Sparse Econometric Models. vol. 3 245–295. Cambridge Univ. Press.
• [10] Belloni, A., Chernozhukov, V. and Kato, K. (2015). Uniform post-selection inference for least absolute deviation regression and other Z-estimation problems. Biometrika 102 77–94.
• [11] Bentkus, V. Y. (1985). Lower bounds for the rate of convergence in the central limit theorem in Banach spaces. Lith. Math. J. 25 312–320.
• [12] Berk, R., Brown, L., Buja, A., Zhang, K. and Zhao, L. (2013). Valid post-selection inference. Ann. Statist. 41 802–837.
• [13] Bühlmann, P. (2013). Statistical significance in high-dimensional linear models. Bernoulli 19 1212–1242.
• [14] Bühlmann, P. and van de Geer, S. (2015). High-dimensional inference in misspecified linear models. Electron. J. Stat. 9 1449–1473.
• [15] Buja, A., Berk, R., Brown, L., George, E., Pitkin, E., Traskin, M., Zhao, L. and Zhang, K. (2015). Models as approximations—A conspiracy of random regressors and model deviations against classical inference in regression. Statist. Sci. 1460.
• [16] Candès, E., Fan, Y., Janson, L. and Lv, J. (2018). Panning for gold: “model-X” knockoffs for high dimensional controlled variable selection. J. R. Stat. Soc. Ser. B. Stat. Methodol. 80 551–577.
• [17] Chatterjee, A. and Lahiri, S. N. (2011). Bootstrapping lasso estimators. J. Amer. Statist. Assoc. 106 608–625.
• [18] Chatterjee, A. and Lahiri, S. N. (2013). Rates of convergence of the adaptive LASSO estimators to the oracle distribution and higher order refinements by the bootstrap. Ann. Statist. 41 1232–1259.
• [19] Chen, L. H. Y. and Shao, Q.-M. (2007). Normal approximation for nonlinear statistics using a concentration inequality approach. Bernoulli 13 581–599.
• [20] Chernozhukov, V., Chetverikov, D. and Kato, K. (2013). Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann. Statist. 41 2786–2819.
• [21] Chernozhukov, V., Chetverikov, D. and Kato, K. (2015). Comparison and anti-concentration bounds for maxima of Gaussian random vectors. Probab. Theory Related Fields 162 47–70.
• [22] Chernozhukov, V., Chetverikov, D. and Kato, K. (2017). Central limit theorems and bootstrap in high dimensions. Ann. Probab. 45 2309–2352.
• [23] Cox, D. R. (1975). A note on data-splitting for the evaluation of significance levels. Biometrika 62 441–444.
• [24] Dezeure, R., Bühlmann, P., Meier, L. and Meinshausen, N. (2015). High-dimensional inference: Confidence intervals, $p$-values and R-software hdi. Statist. Sci. 30 533–558.
• [25] Dezeure, R., Bühlmann, P. and Zhang, C.-H. (2017). High-dimensional simultaneous inference with the bootstrap. TEST 26 685–719.
• [26] Efron, B. (2014). Estimation and accuracy after model selection. J. Amer. Statist. Assoc. 109 991–1007.
• [27] Faraway, J. J. (1995). Data splitting strategies for reducing the e ect of model selection on inference. Technical report, Citeseer.
• [28] Fithian, W., Sun, D. L. and Taylor, J. (2014). Optimal inference after model selection. Available at arXiv:1410.2597.
• [29] Hartigan, J. A. (1969). Using subsample values as typical values. J. Amer. Statist. Assoc. 64 1303–1317.
• [30] Hjort, N. L. and Claeskens, G. (2003). Frequentist model average estimators. J. Amer. Statist. Assoc. 98 879–899.
• [31] Hsu, D., Kakade, S. M. and Zhang, T. (2014). Random design analysis of ridge regression. Found. Comput. Math. 14 569–600.
• [32] Hurvich, C. M. and Tsai, C. (1990). The impact of model selection on inference in linear regression. Amer. Statist. 44 214–217.
• [33] Javanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res. 15 2869–2909.
• [34] Lee, J. D., Sun, D. L., Sun, Y. and Taylor, J. E. (2016). Exact post-selection inference, with application to the lasso. Ann. Statist. 44 907–927.
• [35] Leeb, H. and Pötscher, B. M. (2008). Can one estimate the unconditional distribution of post-model-selection estimators? Econometric Theory 24 338–376.
• [36] Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J. and Wasserman, L. (2018). Distribution-free predictive inference for regression. J. Amer. Statist. Assoc. 113 1094–1111.
• [37] Li, K.-C. (1989). Honest confidence regions for nonparametric regression. Ann. Statist. 17 1001–1008.
• [38] Lockhart, R., Taylor, J., Tibshirani, R. J. and Tibshirani, R. (2014). A significance test for the lasso. Ann. Statist. 42 413–468.
• [39] Loftus, J. R. and Taylor, J. E. (2015). Selective inference in regression models with groups of variables. Preprint. Available at arXiv:1511.01478.
• [40] Markovic, J. and Taylor, J. (2016). Bootstrap inference after using multiple queries for model selection. Available at arXiv:1612.07811.
• [41] Markovic, J., Xia, L. and Taylor, J. (2017). Comparison of prediction errors: Adaptive p-values after cross-validation. Available at arXiv:1703.06559.
• [42] Meinshausen, N. (2015). Group bound: Confidence intervals for groups of variables in sparse high dimensional regression without assumptions on the design. J. R. Stat. Soc. Ser. B. Stat. Methodol. 77 923–945.
• [43] Meinshausen, N. and Bühlmann, P. (2010). Stability selection. J. R. Stat. Soc. Ser. B. Stat. Methodol. 72 417–473.
• [44] Meinshausen, N., Meier, L. and Bühlmann, P. (2009). $p$-values for high-dimensional regression. J. Amer. Statist. Assoc. 104 1671–1681.
• [45] Mentch, L. and Hooker, G. (2016). Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. J. Mach. Learn. Res. 17 Paper No. 26, 41.
• [46] Miller, A. J. (1990). Subset Selection in Regression. Monographs on Statistics and Applied Probability 40. CRC Press, London.
• [47] Moran, P. A. P. (1973). Dividing a sample into two parts. A statistical dilemma. Sankhyā Ser. A 35 329–333.
• [48] Mosteller, F. and Tukey, J. W. (1977). Data Analysis and Regression: A Second Course in Statistics. Addison-Wesley Series in Behavioral Science: Quantitative Methods.
• [49] Nazarov, F. (2003). On the maximal perimeter of a convex set in ${\mathbb{R}}^{n}$ with respect to a Gaussian measure. In Geometric Aspects of Functional Analysis. Lecture Notes in Math. 1807 169–187. Springer, Berlin.
• [50] Nickl, R. and van de Geer, S. (2013). Confidence sets in sparse regression. Ann. Statist. 41 2852–2876.
• [51] Picard, R. R. and Berk, K. N. (1990). Data splitting. Amer. Statist. 44 140–147.
• [52] Pinelis, I. and Molzon, R. (2016). Optimal-order bounds on the rate of convergence to normality in the multivariate delta method. Electron. J. Stat. 10 1001–1063.
• [53] Portnoy, S. (1987). A central limit theorem applicable to robust regression estimators. J. Multivariate Anal. 22 24–50.
• [54] Pouzo, D. (2015). Bootstrap consistency for quadratic forms of sample averages with increasing dimension. Electron. J. Stat. 9 3046–3097.
• [55] Rinaldo, A., Wasserman, L. and G’Sell, M. (2019). Supplement to “Bootstrapping and sample splitting for high-dimensional, assumption-lean inference.” DOI:10.1214/18-AOS1784SUPP.
• [56] Shah, R. D. and Bühlmann, P. (2018). Goodness-of-fit tests for high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 80 113–135.
• [57] Shah, R. D. and Samworth, R. J. (2013). Variable selection with error control: Another look at stability selection. J. R. Stat. Soc. Ser. B. Stat. Methodol. 75 55–80.
• [58] Shao, J. (1993). Linear model selection by cross-validation. J. Amer. Statist. Assoc. 88 486–494.
• [59] Shao, Q.-M., Zhang, K. and Zhou, W.-X. (2016). Stein’s method for nonlinear statistics: A brief survey and recent progress. J. Statist. Plann. Inference 168 68–89.
• [60] Shorack, G. R. (2000). Probability for Statisticians. Springer Texts in Statistics. Springer, New York.
• [61] Tian, X. and Taylor, J. (2018). Selective inference with a randomized response. Ann. Statist. 46 679–710.
• [62] Tibshirani, R. J., Rinaldo, A., Tibshirani, R. and Wasserman, L. (2018). Uniform asymptotic inference and the bootstrap after model selection. Ann. Statist. 46 1255–1287.
• [63] Tibshirani, R. J., Taylor, J., Lockhart, R. and Tibshirani, R. (2016). Exact post-selection inference for sequential regression procedures. J. Amer. Statist. Assoc. 111 600–620.
• [64] van de Geer, S., Bühlmann, P., Ritov, Y. and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist. 42 1166–1202.
• [65] Wager, S., Hastie, T. and Efron, B. (2014). Confidence intervals for random forests: The jackknife and the infinitesimal jackknife. J. Mach. Learn. Res. 15 1625–1651.
• [66] Wasserman, L. (2014). Discussion: “A significance test for the lasso” [MR3210970]. Ann. Statist. 42 501–508.
• [67] Wasserman, L. and Roeder, K. (2009). High-dimensional variable selection. Ann. Statist. 37 2178–2201.
• [68] Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 217–242.
• [69] Zhang, X. and Cheng, G. (2017). Simultaneous inference for high-dimensional linear models. J. Amer. Statist. Assoc. 112 757–768.

#### Supplemental materials

• Supplement to “Bootstrapping and sample splitting for high-dimensional, assumption-lean inference”. This supplement provides additional material, including numerical examples, comments on other approaches, an alternative bootstrap approach, and algorithmic statements of the studied procedures. The supplement also includes proofs of many of the results stated in this paper.