## Annals of Statistics

### A likelihood ratio framework for high-dimensional semiparametric regression

#### Abstract

We propose a new inferential framework for high-dimensional semiparametric generalized linear models. This framework addresses a variety of challenging problems in high-dimensional data analysis, including incomplete data, selection bias and heterogeneity. Our work has three main contributions: (i) We develop a regularized statistical chromatography approach to infer the parameter of interest under the proposed semiparametric generalized linear model without the need of estimating the unknown base measure function. (ii) We propose a new likelihood ratio based framework to construct post-regularization confidence regions and tests for the low dimensional components of high-dimensional parameters. Unlike existing post-regularization inferential methods, our approach is based on a novel directional likelihood. (iii) We develop new concentration inequalities and normal approximation results for U-statistics with unbounded kernels, which are of independent interest. We further extend the theoretical results to the problems of missing data and multiple datasets inference. Extensive simulation studies and real data analysis are provided to illustrate the proposed approach.

#### Article information

Source
Ann. Statist., Volume 45, Number 6 (2017), 2299-2327.

Dates
Revised: May 2016
First available in Project Euclid: 15 December 2017

https://projecteuclid.org/euclid.aos/1513328574

Digital Object Identifier
doi:10.1214/16-AOS1483

Mathematical Reviews number (MathSciNet)
MR3737893

Zentralblatt MATH identifier
06838134

Subjects
Primary: 62E20: Asymptotic distribution theory 62G20: Asymptotic properties
Secondary: 62G10: Hypothesis testing

#### Citation

Ning, Yang; Zhao, Tianqi; Liu, Han. A likelihood ratio framework for high-dimensional semiparametric regression. Ann. Statist. 45 (2017), no. 6, 2299--2327. doi:10.1214/16-AOS1483. https://projecteuclid.org/euclid.aos/1513328574

#### References

• [1] Argyriou, A., Evgeniou, T. and Pontil, M. (2008). Convex multi-task feature learning. Mach. Learn. 73 243–272.
• [2] Belloni, A., Chen, D., Chernozhukov, V. and Hansen, C. (2012). Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80 2369–2429.
• [3] Belloni, A., Chernozhukov, V. and Kato, K. (2015). Uniform post-selection inference for least absolute deviation regression and other Z-estimation problems. Biometrika 102 77–94.
• [4] Belloni, A., Chernozhukov, V. and Wei, Y. (2016). Post-selection inference for generalized linear models with many controls. J. Bus. Econom. Statist. 34 606–619.
• [5] Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732.
• [6] Bunea, F., Tsybakov, A. and Wegkamp, M. (2007). Sparsity oracle inequalities for the Lasso. Electron. J. Stat. 1 169–194.
• [7] Chan, K. C. G. (2013). Nuisance parameter elimination for proportional likelihood ratio models with nonignorable missingness and random truncation. Biometrika 100 269–276.
• [8] Chen, Y., Ning, Y., Hong, C. and Wang, S. (2014). Semiparametric tests for identifying differentially methylated loci with case-control designs using Illumina arrays. Genet. Epidemiol. 38 42–50.
• [9] Chernozhukov, V., Chetverikov, D. and Kato, K. (2013). Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann. Statist. 41 2786–2819.
• [10] de la Peña, V. H. and Giné, E. (1999). Decoupling: From Dependence to Independence. Springer, New York.
• [11] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
• [12] Fan, J., Xue, L. and Zou, H. (2014). Strong oracle optimality of folded concave penalized estimation. Ann. Statist. 42 819–849.
• [13] Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. Ann. Math. Stat. 19 293–325.
• [14] Hoff, P. D. (2007). Extending the rank likelihood for semiparametric copula estimation. Ann. Appl. Stat. 1 265–283.
• [15] Javanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res. 15 2869–2909.
• [16] Kalbfleisch, J. D. (1978). Likelihood methods and nonparametric tests. J. Amer. Statist. Assoc. 73 167–170.
• [17] Lawless, J. F., Kalbfleisch, J. D. and Wild, C. J. (1999). Semiparametric methods for response-selective and missing data problems in regression. J. R. Stat. Soc. Ser. B. Stat. Methodol. 61 413–438.
• [18] Liang, K.-Y. and Qin, J. (2000). Regression analysis under non-standard situations: A pairwise pseudolikelihood approach. J. R. Stat. Soc. Ser. B. Stat. Methodol. 62 773–786.
• [19] Lockhart, R., Taylor, J., Tibshirani, R. J. and Tibshirani, R. (2014). A significance test for the lasso. Ann. Statist. 42 413–468.
• [20] Loh, P.-L. and Wainwright, M. J. (2012). High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. Ann. Statist. 40 1637–1664.
• [21] Loh, P.-L. and Wainwright, M. J. (2015). Regularized $M$-estimators with nonconvexity: Statistical and algorithmic theory for local optima. J. Mach. Learn. Res. 16 559–616.
• [22] Lounici, K., Pontil, M., van de Geer, S. and Tsybakov, A. B. (2011). Oracle inequalities and optimal inference under group sparsity. Ann. Statist. 39 2164–2204.
• [23] Luo, X. and Tsai, W. Y. (2012). A proportional likelihood ratio model. Biometrika 99 211–222.
• [24] Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462.
• [25] Meinshausen, N. and Bühlmann, P. (2010). Stability selection. J. R. Stat. Soc. Ser. B. Stat. Methodol. 72 417–473.
• [26] Meinshausen, N., Meier, L. and Bühlmann, P. (2009). $p$-values for high-dimensional regression. J. Amer. Statist. Assoc. 104 1671–1681.
• [27] Meinshausen, N. and Yu, B. (2009). Lasso-type recovery of sparse representations for high-dimensional data. Ann. Statist. 37 246–270.
• [28] Nickl, R. and van de Geer, S. (2013). Confidence sets in sparse regression. Ann. Statist. 41 2852–2876.
• [29] Ning, Y., Zhao, T. and Liu, H. (2017). Supplement to “A likelihood ratio framework for high dimensional semiparametric regression.” DOI:10.1214/16-AOS1483SUPP.
• [30] Pettitt, A. N. (1982). Inference for the linear model using a likelihood based on ranks. J. R. Stat. Soc. Ser. B. Stat. Methodol. 44 234–243.
• [31] Severini, T. A. and Wong, W. H. (1992). Profile likelihood and conditionally parametric models. Ann. Statist. 20 1768–1802.
• [32] Shah, R. D. and Samworth, R. J. (2013). Variable selection with error control: Another look at stability selection. J. R. Stat. Soc. Ser. B. Stat. Methodol. 75 55–80.
• [33] Srivastava, V. K. and Giles, D. E. A. (1987). Seemingly Unrelated Regression Equations Models: Estimation and Inference. Statistics: Textbooks and Monographs 80. Dekker, New York.
• [34] Städler, N. and Bühlmann, P. (2012). Missing values: Sparse inverse covariance estimation and an extension to sparse regression. Stat. Comput. 22 219–235.
• [35] Taylor, J., Lockhart, R., Tibshirani, R. J. and Tibshirani, R. (2014). Post-selection adaptive inference for least angle regression and the lasso. arXiv preprint arXiv:1401.3889.
• [36] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol. 58 267–288.
• [37] van de Geer, S., Bühlmann, P., Ritov, Y. and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist. 42 1166–1202.
• [38] Wang, L., Kim, Y. and Li, R. (2013). Calibrating nonconvex penalized regression in ultra-high dimension. Ann. Statist. 41 2505–2536.
• [39] Wasserman, L. and Roeder, K. (2009). High-dimensional variable selection. Ann. Statist. 37 2178–2201.
• [40] Zahn, J. M., Poosala, S., Owen, A. B., Ingram, D. K., Lustig, A., Carter, A., Weeraratna, A. T., Taub, D. D., Gorospe, M., Mazan-Mamczarz, K. et al. (2007). AGEMAP: A gene expression database for aging in mice. PLoS Genet. 3 2326–2337.
• [41] Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 217–242.
• [42] Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res. 7 2541–2563.

#### Supplemental materials

• Supplement for “A likelihood ratio framework for high-dimensional semiparametric regression”. The supplementary material contain additional technical details, simulation results and proofs.