Bernoulli

  • Bernoulli
  • Volume 19, Number 1 (2013), 252-274.

Simultaneous variable selection and estimation in semiparametric modeling of longitudinal/clustered data

Shujie Ma, Qiongxia Song, and Li Wang

Full-text: Open access

Abstract

We consider the problem of simultaneous variable selection and estimation in additive, partially linear models for longitudinal/clustered data. We propose an estimation procedure via polynomial splines to estimate the nonparametric components and apply proper penalty functions to achieve sparsity in the linear part. Under reasonable conditions, we obtain the asymptotic normality of the estimators for the linear components and the consistency of the estimators for the nonparametric components. We further demonstrate that, with proper choice of the regularization parameter, the penalized estimators of the non-zero coefficients achieve the asymptotic oracle property. The finite sample behavior of the penalized estimators is evaluated with simulation studies and illustrated by a longitudinal CD4 cell count data set.

Article information

Source
Bernoulli, Volume 19, Number 1 (2013), 252-274.

Dates
First available in Project Euclid: 18 January 2013

Permanent link to this document
https://projecteuclid.org/euclid.bj/1358531749

Digital Object Identifier
doi:10.3150/11-BEJ386

Mathematical Reviews number (MathSciNet)
MR3019494

Zentralblatt MATH identifier
1259.62021

Keywords
additive partially linear model clustered data longitudinal data model selection penalized least squares spline

Citation

Ma, Shujie; Song, Qiongxia; Wang, Li. Simultaneous variable selection and estimation in semiparametric modeling of longitudinal/clustered data. Bernoulli 19 (2013), no. 1, 252--274. doi:10.3150/11-BEJ386. https://projecteuclid.org/euclid.bj/1358531749


Export citation

References

  • [1] Antoniadis, A. (1997). Wavelets in statistics: A review (with discussion). Italian Jour. Statist. 6, 97–144.
  • [2] Cai, J., Fan, J., Li, R. and Zhou, H. (2005). Variable selection for multivariate failure time data. Biometrika 92 303–316.
  • [3] Carroll, R.J., Maity, A., Mammen, E. and Yu, K. (2009). Nonparametric additive regression for repeatedly measured data. Biometrika 96 383–398.
  • [4] Chiou, J.M. and Müller, H.G. (2005). Estimated estimating equations: Semiparametric inference for clustered and longitudinal data. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 531–553.
  • [5] Diggle, P.J., Heagerty, P.J., Liang, K.Y. and Zeger, S.L. (2002). Analysis of Longitudinal Data, 2nd ed. Oxford Statistical Science Series 25. Oxford: Oxford Univ. Press.
  • [6] Fan, J., Feng, Y. and Song, R. (2010). Nonparametric independence screening in sparse ultra-high dimensional additive models. J. Amer. Statist. Assoc. 106 544–557.
  • [7] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
  • [8] Fan, J. and Li, R. (2002). Variable selection for Cox’s proportional hazards model and frailty model. Ann. Statist. 30 74–99.
  • [9] Fan, Y. and Li, Q. (2003). A kernel-based method for estimating additive partially linear models. Statist. Sinica 13 739–762.
  • [10] Fu, W.J. (2003). Penalized estimating equations. Biometrics 59 126–132.
  • [11] Hall, P., Müller, H.G. and Wang, J.L. (2006). Properties of principal component methods for functional and longitudinal data analysis. Ann. Statist. 34 1493–1517.
  • [12] Härdle, W., Liang, H. and Gao, J. (2000). Partially Linear Models: Contributions to Statistics. Heidelberg: Physica-Verlag.
  • [13] Huang, J.Z., Wu, C.O. and Zhou, L. (2004). Polynomial spline estimation and inference for varying coefficient models with longitudinal data. Statist. Sinica 14 763–788.
  • [14] Huang, J.Z., Zhang, L. and Zhou, L. (2007). Efficient estimation in marginal partially linear models for longitudinal/clustered data using splines. Scand. J. Statist. 34 451–477.
  • [15] Li, Q. (2000). Efficient estimation of additive partially linear models. Internat. Econom. Rev. 41 1073–1092.
  • [16] Li, R. and Liang, H. (2008). Variable selection in semiparametric regression modeling. Ann. Statist. 36 261–286.
  • [17] Liang, H. and Li, R. (2009). Variable selection for partially linear models with measurement errors. J. Amer. Statist. Assoc. 104 234–248.
  • [18] Liang, H., Thurston, S.W., Ruppert, D., Apanasovich, T. and Hauser, R. (2008). Additive partial linear models with measurement errors. Biometrika 95 667–678.
  • [19] Liang, K.Y. and Zeger, S.L. (1986). Longitudinal data analysis using generalized linear models. Biometrika 73 13–22.
  • [20] Lin, X. and Carroll, R.J. (2001). Semiparametric regression for clustered data. Biometrika 88 1179–1185.
  • [21] Liu, X., Wang, L. and Liang, H. (2011). Estimation and variable selection for semiparametric additive partial linear models. Statist. Sinica. 21 1225–1248.
  • [22] Liu, Y. and Wu, Y. (2007). Variable selection via a combination of the $L_{0}$ and $L_{1}$ penalties. J. Comput. Graph. Statist. 16 782–798.
  • [23] Ma, S., Song, Q. and Wang, L. (2011). Supplement to “Simultaneous variable selection and estimation in semiparametric modeling of longitudinal/clustered data”. DOI:10.3150/11-BEJ386SUPP.
  • [24] Ma, S. and Yang, L. (2011). Spline-backfitted kernel smoothing of partially linear additive model. J. Statist. Plann. Inference 141 204–219.
  • [25] Ma, Y. and Li, R. (2010). Variable selection in measurement error models. Bernoulli 16 274–300.
  • [26] Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462.
  • [27] Opsomer, J.D. and Ruppert, D. (1999). A root-n consistent backfitting estimator for semiparametric additive modelling. J. Comput. Graph. Statist. 8 715–734.
  • [28] Pan, W. and Connett, J.E. (2002). Selecting the working correlation structure in generalized estimating equations with application to the lung health study. Statist. Sinica 12 475–490.
  • [29] Stone, C.J. (1985). Additive regression and other nonparametric models. Ann. Statist. 13 689–705.
  • [30] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • [31] Tibshirani, R. (1997). The lasso method for variable selection in the Cox model. Stat. Med. 16 385–395.
  • [32] Wang, J.L., Xue, L., Zhu, L. and Chong, Y.S. (2010). Estimation for a partial-linear single-index model. Ann. Statist. 38 246–274.
  • [33] Wang, L. and Yang, L. (2007). Spline-backfitted kernel smoothing of nonlinear additive autoregression model. Ann. Statist. 35 2474–2503.
  • [34] Wang, N. (2003). Marginal nonparametric kernel regression accounting for within-subject correlation. Biometrika 90 43–52.
  • [35] Wang, N., Carroll, R.J. and Lin, X. (2005). Efficient semiparametric marginal estimation for longitudinal/clustered data. J. Amer. Statist. Assoc. 100 147–157.
  • [36] Wu, Y. and Liu, Y. (2009). Variable selection in quantile regression. Statist. Sinica 19 801–817.
  • [37] Xue, L. (2009). Consistent variable selection in additive models. Statist. Sinica 19 1281–1296.
  • [38] Xue, L., Qu, A. and Zhou, J. (2010). Consistent model selection for marginal generalized additive model for correlated data. J. Amer. Statist. Assoc. 105 1518–1530.
  • [39] Xue, L. and Yang, L. (2006). Additive coefficient modeling via polynomial spline. Statist. Sinica 16 1423–1446.
  • [40] Yang, Y. (2008). Localized model selection for regression. Econometric Theory 24 472–492.
  • [41] Yuan, M. and Lin, Y. (2007). On the non-negative garrote estimator. J. R. Stat. Soc. Ser. B Stat. Methodol. 69 143–161.
  • [42] Zeger, S.L. and Diggle, P.J. (1994). Semiparametric models for longitudinal data with application to CD4 cell numbers in HIV seroconverters. Biometrics 50 689–699.

Supplemental materials

  • Supplementary material: Supplement to “Simultaneous variable selection and estimation in semiparametric modeling of longitudinal/clustered data”. We provide detailed proofs of Lemmas A.2 to A.7 stated in the Appendix.