The Annals of Applied Statistics

Biased sampling designs to improve research efficiency: Factors influencing pulmonary function over time in children with asthma

Jonathan S. Schildcrout, Paul J. Rathouz, Leila R. Zelnick, Shawn P. Garbett, and Patrick J. Heagerty

Full-text: Open access


Substudies of the Childhood Asthma Management Program [ Control. Clin. Trials 20 (1999) 91–120; N. Engl. J. Med. 343 (2000) 1054–1063] seek to identify patient characteristics associated with asthma symptoms and lung function. To determine if genetic measures are associated with trajectories of lung function as measured by forced vital capacity (FVC), children in the primary cohort study retrospectively had candidate loci evaluated. Given participant burden and constraints on financial resources, it is often desirable to target a subsample for ascertainment of costly measures. Methods that can leverage the longitudinal outcome on the full cohort to selectively measure informative individuals have been promising, but have been restricted in their use to analysis of the targeted subsample. In this paper we detail two multiple imputation analysis strategies that exploit outcome and partially observed covariate data on the nonsampled subjects, and we characterize alternative design and analysis combinations that could be used for future studies of pulmonary function and other outcomes. Candidate predictor (e.g., IL10 cytokine polymorphisms) associations obtained from targeted sampling designs can be estimated with very high efficiency compared to standard designs. Further, even though multiple imputation can dramatically improve estimation efficiency for covariates available on all subjects (e.g., gender and baseline age), relatively modest efficiency gains were observed in parameters associated with predictors that are exclusive to the targeted sample. Our results suggest that future studies of longitudinal trajectories can be efficiently conducted by use of outcome-dependent designs and associated full cohort analysis.

Article information

Ann. Appl. Stat., Volume 9, Number 2 (2015), 731-753.

Received: December 2013
Revised: March 2015
First available in Project Euclid: 20 July 2015

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Biased sampling childhood asthma conditional likelihood epidemiological study design forced vital capacity linear mixed effect models longitudinal data analysis multiple imputation outcome dependent sampling time-dependent covariates


Schildcrout, Jonathan S.; Rathouz, Paul J.; Zelnick, Leila R.; Garbett, Shawn P.; Heagerty, Patrick J. Biased sampling designs to improve research efficiency: Factors influencing pulmonary function over time in children with asthma. Ann. Appl. Stat. 9 (2015), no. 2, 731--753. doi:10.1214/15-AOAS826.

Export citation


  • Bates, D. and Maechler, M. (2010). lme4: Linear mixed-effects models using S4 classes. R package version 0.999375-34.
  • Breslow, N. E., Lumley, T., Ballantyne, C. M., Chambless, L. E. and Kulich, M. (2009a). Improved Horvitz–Thompson estimation of model parameters from two-phase stratified samples: Applications in epidemiology. Stat. Biosci. 1 32.
  • Breslow, N. E., Lumley, T., Ballantyne, C. M., Chambless, L. E. and Kulich, M. (2009b). Using the whole cohort in the analysis of case-cohort data. Am. J. Epidemiol. 169 1398–1405.
  • Bůžková, P. and Lumley, T. (2009). Semiparametric modeling of repeated measurements under outcome-dependent follow-up. Stat. Med. 28 987–1003.
  • Carroll, R. J., Ruppert, D., Stefanski, L. A. and Crainiceanu, C. M. (2006). Measurement Error in Nonlinear Models: A Modern Perspective, 2nd ed. Monographs on Statistics and Applied Probability 105. Chapman & Hall/CRC, Boca Raton, FL.
  • Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics 7 179–188.
  • CAMP Research Group (1999). The childhood asthma management program (CAMP): Design, rationale, and methods. Childhood asthma management program research group. Control. Clin. Trials 20 91–120.
  • CAMP Research Group (2000). Long-term effects of budesonide or nedocrimil in children with asthma. N. Engl. J. Med. 343 1054–1063.
  • Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. J. Amer. Statist. Assoc. 47 663–685.
  • Kish, L. (1965). Survey Sampling. Wiley, New York.
  • Korn, E. L. and Graubard, B. I. (2011). Analysis of Health Surveys. Wiley, New York.
  • Laird, N. M. and Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics 38 963–974.
  • Lawless, J. F., Kalbfleisch, J. D. and Wild, C. J. (1999). Semiparametric methods for response-selective and missing data problems in regression. J. R. Stat. Soc. Ser. B. Stat. Methodol. 61 413–438.
  • Lin, D. Y. and Ying, Z. (2001). Semiparametric and nonparametric regression analysis of longitudinal data. J. Amer. Statist. Assoc. 96 103–126.
  • Lipsitz, S. R., Fitzmaurice, G. M., Ibrahim, J. G., Gelber, R. and Lipshultz, S. (2002). Parameter estimation in longitudinal studies with outcome-dependent follow-up. Biometrics 58 621–630.
  • Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data, 2nd ed. Wiley, Hoboken, NJ.
  • Little, R. J. A. and Schluchter, M. D. (1985). Maximum likelihood estimation for mixed continuous and categorical data with missing values. Biometrika 72 497–512.
  • Lyon, H., Lange, C., Lake, S., Silverman, E. K., Randolph, A. G., Kwiatkowski, D., Raby, B. A., Lazarus, R., Weiland, K. M., Laird, N. and Weiss, S. T. (2004). IL10 gene polymorphisms are associated with asthma phenotypes in children. Genet. Epidemiol. 26 155–165.
  • Marti, H. and Chavance, M. (2011). Multiple imputation analysis of case-cohort studies. Stat. Med. 30 1595–1607.
  • Neuhaus, J., Scott, A. J. and Wild, C. J. (2002). The analysis of retrospective family studies. Biometrika 89 23–37.
  • Neuhaus, J. M., Scott, A. J. and Wild, C. J. (2006). Family-specific approaches to the analysis of case–control family data. Biometrics 62 488–494.
  • Neuhaus, J. M., Scott, A. J., Wild, C. J., Jiang, Y., McCulloch, C. E. and Boylan, R. (2014). Likelihood-based analysis of longitudinal data from outcome-related sampling designs. Biometrics 70 44–52.
  • R Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  • Raghunathan, T. E., Lepkowski, J. M., Hoewyk, J. V. and Solenberger, P. (2001). A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv. Methodol. 27 85–95.
  • Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. J. Amer. Statist. Assoc. 89 846–866.
  • Rubin, D. B. (1976). Inference and missing data. Biometrika 63 581–592.
  • Schafer, J. L. (2010). Analysis of Incomplete Multivariate Data. CRC Press, Boca Raton, FL.
  • Schafer, J. L. and Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychol. Methods 7 147–177.
  • Schildcrout, J. S., Garbett, S. P. and Heagerty, P. J. (2013). Outcome vector dependent sampling with longitudinal continuous response data: Stratified sampling based on summary statistics. Biometrics 69 405–416.
  • Schildcrout, J. S. and Heagerty, P. J. (2008). On outcome-dependent sampling designs for longitudinal binary response data with time-varying covariates. Biostatistics 9 735–749.
  • Schildcrout, J. S. and Heagerty, P. J. (2011). Outcome-dependent sampling from existing cohorts with longitudinal binary response data: Study planning and analysis. Biometrics 67 1583–1593.
  • Schildcrout, J. S. and Rathouz, P. J. (2010). Longitudinal studies of binary response data following case–control and stratified case–control sampling: Design and analysis. Biometrics 66 365–373.
  • Schildcrout, J. S., Mumford, S. L., Chen, Z., Heagerty, P. J. and Rathouz, P. J. (2012). Outcome-dependent sampling for longitudinal binary response data based on a time-varying auxiliary variable. Stat. Med. 31 2441–2456.
  • Schildcrout, J. S., Rathouz, P. J., Zelnick, L. R., Garbett, S. P. and Heagerty, P. J. (2015). Supplement to “Biased sampling designs to improve research efficiency: Factors influencing pulmonary function over time in children with asthma.” DOI:10.1214/15-AOAS826SUPPA, DOI:10.1214/15-AOAS826SUPPB.
  • Van Buuren, S. (2012). Flexible Imputation of Missing Data. CRC Press, Boca Raton, FL.
  • Weaver, M. A. and Zhou, H. (2005). An estimated likelihood method for continuous outcome regression models with outcome-dependent sampling. J. Amer. Statist. Assoc. 100 459–469.
  • White, I. R., Royston, P. and Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Stat. Med. 30 377–399.
  • Zhou, H., Weaver, M. A., Qin, J., Longnecker, M. P. and Wang, M. C. (2002). A semiparametric empirical likelihood method for data from an outcome-dependent sampling scheme with a continuous outcome. Biometrics 58 413–421.
  • Zhou, H., Chen, J., Rissanen, T. H., Korrick, S. A., Hu, H., Salonen, J. T. and Longnecker, M. P. (2007). Outcome-dependent sampling: An efficient sampling and inference procedure for studies with a continuous outcome. Epidemiology 18 461–468.
  • Zhou, H., Wu, Y., Liu, Y. and Cai, J. (2011). Semiparametric inference for a 2-stage outcome-auxiliary-dependent sampling design with continuous outcome. Biostatistics 12 521–534.

Supplemental materials