The Annals of Statistics

Selective inference with a randomized response

Abstract

Inspired by sample splitting and the reusable holdout introduced in the field of differential privacy, we consider selective inference with a randomized response. We discuss two major advantages of using a randomized response for model selection. First, the selectively valid tests are more powerful after randomized selection. Second, it allows consistent estimation and weak convergence of selective inference procedures. Under independent sampling, we prove a selective (or privatized) central limit theorem that transfers procedures valid under asymptotic normality without selection to their corresponding selective counterparts. This allows selective inference in nonparametric settings. Finally, we propose a framework of inference after combining multiple randomized selection procedures. We focus on the classical asymptotic setting, leaving the interesting high-dimensional asymptotic questions for future work.

Article information

Source
Ann. Statist., Volume 46, Number 2 (2018), 679-710.

Dates
Revised: February 2017
First available in Project Euclid: 3 April 2018

https://projecteuclid.org/euclid.aos/1522742433

Digital Object Identifier
doi:10.1214/17-AOS1564

Mathematical Reviews number (MathSciNet)
MR3782381

Zentralblatt MATH identifier
06870276

Subjects
Primary: 62M40: Random fields; image analysis
Secondary: 62J05: Linear regression

Citation

Tian, Xiaoying; Taylor, Jonathan. Selective inference with a randomized response. Ann. Statist. 46 (2018), no. 2, 679--710. doi:10.1214/17-AOS1564. https://projecteuclid.org/euclid.aos/1522742433

References

• Bahadur, R. R. (1966). A note on quantiles in large samples. Ann. Math. Statist. 37 577–580.
• Belloni, A., Chernozhukov, V. and Wang, L. (2011). Square-root lasso: Pivotal recovery of sparse signals via conic programming. Biometrika 98 791–806.
• Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300.
• Benjamini, Y. and Stark, P. B. (1996). Nonequivariant simultaneous confidence intervals less likely to contain zero. J. Amer. Statist. Assoc. 91 329–337.
• Bühlmann, P. (2013). Statistical significance in high-dimensional linear models. Bernoulli 19 1212–1242.
• Chatterjee, S. (2005). A simple invariance theorem. Preprint. Available at arXiv:math/0508213.
• Chung, E. and Romano, J. P. (2013). Exact and asymptotically robust permutation tests. Ann. Statist. 41 484–507.
• Cox, D. R. (1975). A note on data-splitting for the evaluation of significance levels. Biometrika 62 441–444.
• Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O. and Roth, A. (2015). Preserving statistical validity in adaptive data analysis [extended abstract]. In STOC’15—Proceedings of the 2015 ACM Symposium on Theory of Computing 117–126. ACM, New York.
• Fithian, W., Sun, D. and Taylor, J. (2014). Optimal inference after model selection. Available at arXiv:1410.2597.
• Fithian, W., Taylor, J., Tibshirani, R. and Tibshirani, R., (2015). Selective sequential model selection. Available at arXiv:1512.02565.
• Götze, F. (1991). On the rate of convergence in the multivariate CLT. Ann. Probab. 19 724–739.
• Harris, X. T. (2016). Prediction error after model search. Preprint. Available at arXiv:1610.06107.
• Harris, X. T., Panigrahi, S., Markovic, J., Bi, N. and Taylor, J. (2016). Selective sampling after solving a convex problem. Preprint. Available at arXiv:1609.05609.
• Javanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res. 15 2869–2909.
• Lee, J. D., Sun, D. L., Sun, Y. and Taylor, J. E. (2016). Exact post-selection inference, with application to the lasso. Ann. Statist. 44 907–927.
• Lehmann, E. L. (1986). Testing Statistical Hypotheses, 2nd ed. Wiley, New York.
• Lockhart, R., Taylor, J., Tibshirani, R. J. and Tibshirani, R. (2014). A significance test for the lasso. Ann. Statist. 42 413–468.
• Meinshausen, N. and Bühlmann, P. (2010). Stability selection. J. R. Stat. Soc. Ser. B Stat. Methodol. 72 417–473.
• Meinshausen, N., Meier, L. and Bühlmann, P. (2009). $p$-values for high-dimensional regression. J. Amer. Statist. Assoc. 104 1671–1681.
• Pakman, A. and Paninski, L. (2014). Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians. J. Comput. Graph. Statist. 23 518–542.
• Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychol. Bull. 86 638.
• Sun, T. and Zhang, C.-H. (2012). Scaled sparse linear regression. Biometrika 99 879–898.
• Tian, X., Bi, N. and Taylor, J. (2016). Magic: A general, powerful and tractable method for selective inference. Preprint. Available at arXiv:1607.02630.
• Tian, X., Loftus, J. R. and Taylor, J. E. (2015). Selective inference with unknown variance via the square-root lasso. Preprint. Available at arXiv:1504.08031.
• Tian, X. and Taylor, J. (2015). Asymptotics of selective inference. Available at arXiv:1501.03588.
• Tian, X. and Taylor, J. (2018). Supplement to “Selective inference with a randomized response.” DOI:10.1214/17-AOS1564SUPP.
• Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
• Tibshirani, R. J., Rinaldo, A., Tibshirani, R. and Wasserman, L. (2015). Uniform asymptotic inference and the bootstrap after model selection. Preprint. Available at arXiv:1506.06266.
• Tibshirani, R. J., Taylor, J., Lockhart, R. and Tibshirani, R. (2016). Exact post-selection inference for sequential regression procedures. J. Amer. Statist. Assoc. 111 600–620.
• Tukey, J. W. (1980). We need both exploratory and confirmatory. Amer. Statist. 34 23–25.
• van de Geer, S., Bühlmann, P., Ritov, Y. and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist. 42 1166–1202.
• Wasserman, L. and Roeder, K. (2009). High-dimensional variable selection. Ann. Statist. 37 2178–2201.
• Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 217–242.

Supplemental materials

• Supplement to “Selective inference with a randomized response”. We provide additional sampling schemes, technical details for plugin variance estimators and proofs for all the theorems and lemmas in the supplementary materials.