Statistical Science

Fractional Imputation in Survey Sampling: A Comparative Review

Shu Yang and Jae Kwang Kim

Full-text: Open access

Abstract

Fractional imputation (FI) is a relatively new method of imputation for handling item nonresponse in survey sampling. In FI, several imputed values with their fractional weights are created for each record with missing items. Each fractional weight represents the conditional probability of the imputed value given the observed data, and the parameters in the conditional probabilities are often computed by an iterative method such as the EM algorithm. The underlying model for FI can be fully parametric, semiparametric or nonparametric, depending on the plausibility of assumptions and the data structure.

In this paper, we give an overview of FI, introduce key ideas and methods to readers who are new to the FI literature, and highlight some new developments. We also provide guidance on practical implementation of FI and valid inferential tools after imputation. We demonstrate the empirical performance of FI with respect to multiple imputation using a pseudo finite population generated from a sample from the Monthly Retail Trade Survey conducted by the US Census Bureau.

Article information

Source
Statist. Sci., Volume 31, Number 3 (2016), 415-432.

Dates
First available in Project Euclid: 27 September 2016

Permanent link to this document
https://projecteuclid.org/euclid.ss/1475001236

Digital Object Identifier
doi:10.1214/16-STS569

Mathematical Reviews number (MathSciNet)
MR3552742

Zentralblatt MATH identifier
06946233

Keywords
Item nonresponse missing at random Monte Carlo EM multiple imputation synthetic imputation

Citation

Yang, Shu; Kim, Jae Kwang. Fractional Imputation in Survey Sampling: A Comparative Review. Statist. Sci. 31 (2016), no. 3, 415--432. doi:10.1214/16-STS569. https://projecteuclid.org/euclid.ss/1475001236


Export citation

References

  • Akaike, H. (1998). Information theory and an extension of the maximum likelihood principle. In Selected Papers of Hirotugu Akaike 199–213. Springer, Berlin.
  • Andridge, R. R. and Little, R. J. (2010). A review of hot deck imputation for survey non-response. Int. Stat. Rev. 78 40–64.
  • Bang, H. and Robins, J. M. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics 61 962–972.
  • Beaumont, J.-F. and Bocci, C. (2009). Variance estimation when donor imputation is used to fill in missing values. Canad. J. Statist. 37 400–416.
  • Beaumont, J.-F., Haziza, D. and Bocci, C. (2011). On variance estimation under auxiliary value imputation in sample surveys. Statist. Sinica 21 515–537.
  • Berg, E., Kim, J. K. and Skinner, C. (2016). Imputation under informative sampling. Surv. Methodol. To appear.
  • Binder, D. A. and Patak, Z. (1994). Use of estimating functions for estimation from complex surveys. J. Amer. Statist. Assoc. 89 1035–1043.
  • Binder, D. A. and Sun, W. (1996). Frequency valid multiple imputation for surveys with a complex design. In Proceedings of the Survey Research Methods Section of the American Statistical Association 281–286. Amer. Statist. Assoc., Alexandria, VA.
  • Cao, W., Tsiatis, A. A. and Davidian, M. (2009). Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika 96 723–734.
  • Chauvet, G., Deville, J.-C. and Haziza, D. (2011). On balanced random imputation in surveys. Biometrika 98 459–471.
  • Chen, J. and Shao, J. (2001). Jackknife variance estimation for nearest-neighbor imputation. J. Amer. Statist. Assoc. 96 260–269.
  • Durrant, G. B. (2005). Imputation methods for handling item-nonresponse in the social sciences: a methodological review. ESRC National Centre for Research Methods and Southampton Stat. Sci. Research Institute. NCRM Methods Review Papers NCRM/002.
  • Durrant, G. B. (2009). Imputation methods for handling item-nonresponse in practice: Methodological issues and recent debates. International Journal of Social Research Methodology 12 293–304.
  • Durrant, G. B. and Skinner, C. (2006). Using missing data methods to correct for measurement error in a distribution function. Surv. Methodol. 32 25–36.
  • Fay, R. E. (1992). When are inferences from multiple imputation valid? In Proceedings of the Survey Research Methods Section of the American Statistical Association 81 227–332. Amer. Statist. Assoc., Alexandria, VA.
  • Fay, R. E. (1996). Alternative paradigms for the analysis of imputed survey data. J. Amer. Statist. Assoc. 91 490–498.
  • Fuller, W. A. (2003). Estimation for multiple phase samples. In Analysis of Survey Data (Southampton, 1999) (R. L. Chambers and C. J. Skinner, eds.) 307–322. Wiley, Chichester.
  • Fuller, W. A. and Kim, J. K. (2005). Hot deck imputation for the response model. Surv. Methodol. 31 139–149.
  • Godambe, V. P. and Thompson, M. E. (1986). Parameters of superpopulation and survey population: Their relationships and estimation. Int. Stat. Rev. 54 127–138.
  • Hastings, W. K. (1970). Monte Carlo sampling methods using Markov Chains and their applications. Biometrika 57 97–109.
  • Haziza, D. (2009). Imputation and inference in the presence of missing data. In Sample Surveys: Design, Methods and Applications (C. R. Rao and D. Pfeffermann, eds.). Handbook of Statist. 29 215–246. Elsevier, Amsterdam.
  • Ibrahim, J. G. (1990). Incomplete data in generalized linear models. J. Amer. Statist. Assoc. 85 765–769.
  • Kalton, G. and Kish, L. (1984). Some efficient random imputation methods. Comm. Statist. Theory Methods 13 1919–1939.
  • Kang, J. D. Y. and Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statist. Sci. 22 523–539.
  • Kim, J. K. (2011). Parametric fractional imputation for missing data analysis. Biometrika 98 119–132.
  • Kim, J. K. and Fuller, W. (2004). Fractional hot deck imputation. Biometrika 91 559–578.
  • Kim, J. K., Fuller, W. A. and Bell, W. R. (2011). Variance estimation for nearest neighbor imputation for US census long form data. Ann. Appl. Stat. 5 824–842.
  • Kim, J. K. and Haziza, D. (2014). Doubly robust inference with missing data in survey sampling. Statist. Sinica 24 375–394.
  • Kim, J. K. and Hong, M. (2012). Imputation for statistical inference with coarse data. Canad. J. Statist. 40 604–618.
  • Kim, J. Y. and Kim, J. K. (2012). Parametric fractional imputation for nonignorable missing data. J. Korean Statist. Soc. 41 291–303.
  • Kim, J. K., Navarro, A. and Fuller, W. A. (2006). Replication variance estimation for two-phase stratified sampling. J. Amer. Statist. Assoc. 101 312–320.
  • Kim, J. K. and Rao, J. N. K. (2012). Combining data from two independent surveys: A model-assisted approach. Biometrika 99 85–100.
  • Kim, J. K. and Shao, J. (2014). Statistical Methods for Handling Incomplete Data. Chapman & Hall, Raton, FL.
  • Kim, J. K. and Yang, S. (2014). Fractional hot deck imputation for robust inference under item nonresponse in survey sampling. Surv. Methodol. 40 211–230.
  • Kim, J. K. and Yu, C. L. (2011a). Replication variance estimation under two-phase sampling. Surv. Methodol. 37 67–74.
  • Kim, J. K. and Yu, C. L. (2011b). A semiparametric estimation of mean functionals with nonignorable missing data. J. Amer. Statist. Assoc. 106 157–165.
  • Kim, J. K., Brick, J. M., Fuller, W. A. and Kalton, G. (2006). On the bias of the multiple-imputation variance estimator in survey sampling. J. R. Stat. Soc. Ser. B. Stat. Methodol. 68 509–521.
  • Kitamura, Y., Tripathi, G. and Ahn, H. (2004b). Empirical likelihood-based inference in conditional moment restriction models. Econometrika 72 1667–1714.
  • Kott, P. (1995). A paradox of multiple imputation. In Proceedings of the Survey Research Methods Section of the American Statistical Association 384–389.
  • Legg, J. C. and Fuller, W. A. (2009). Two-phase sampling. In Sample Surveys: Design, Methods and Applications. Handbook of Statist. 29 55–70. Elsevier, Amsterdam.
  • Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data, 2nd ed. Wiley, Hoboken, NJ.
  • Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm. J. Roy. Statist. Soc. Ser. B 44 226–233.
  • Meng, X.-L. (1994). Multiple-imputation inferences with uncongenial sources of input. Statist. Sci. 9 538–558.
  • Meng, X.-L. and Romero, M. (2003). Discussion: Efficiency and self-efficiency with multiple imputation inference. Int. Stat. Rev. 71 607–618.
  • Mulry, M. H., Oliver, B. E. and Kaputa, S. J. (2014). Detecting and treating verified influential values in a monthly retail trade survey. J. Off. Stat. 30 721–747.
  • Nadaraya, E. A. (1964). On estimating regression. Theory Probab. Appl. 9 141–142.
  • Nielsen, S. F. (2003). Proper and improper multiple imputation. Int. Stat. Rev. 71 593–607.
  • Pfeffermann, D., Skinner, C. J., Holmes, D. J., Goldstein, H. and Rasbash, J. (1998). Weighting for unequal selection probabilities in multilevel models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 60 23–56.
  • Rao, J. N. K. (1973). On double sampling for stratification and analytical surveys. Biometrika 60 125–133.
  • Rao, J. N. K. and Shao, J. (1992). Jackknife variance estimation with survey data under hot deck imputation. Biometrika 79 811–822.
  • Rao, J. N. K., Yung, W. and Hidiroglou, M. A. (2002). Estimating equations for the analysis of survey data using poststratification information. Sankhya, Ser. A 64 364–378.
  • Reiter, J. P., Raghunathan, T. E. and Kinney, S. K. (2006). The importance of modeling the sampling design in multiple imputation for missing data. Surv. Methodol. 32 143.
  • Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. J. Amer. Statist. Assoc. 89 846–866.
  • Rubin, D. B. (1976). Inference and missing data. Biometrika 63 581–592.
  • Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley, New York.
  • Rubin, D. B. (1996). Multiple imputation after 18+ years. J. Amer. Statist. Assoc. 91 473–489.
  • SAS Institute Inc (2015). SAS/STAT 14.1 User’s Guide—the SURVEYIMPUTE Procedure. SAS Institue Inc., Cary, NC.
  • Schafer, J. L. (1997). Imputation of missing covariates under a multivariate linear mixed model. Unpublished technical report.
  • Schenker, N. and Raghunathan, T. E. (2007). Combining information from multiple surveys to enhance estimation of measures of health. Stat. Med. 26 1802–1811.
  • Schenker, N., Raghunathan, T. E., Chiu, P.-L., Makuc, D. M., Zhang, G. and Cohen, A. J. (2006). Multiple imputation of missing income data in the National Health interview survey. J. Amer. Statist. Assoc. 101 924–933.
  • Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461–464.
  • Tan, Z. (2006). A distributional approach for causal inference using propensity scores. J. Amer. Statist. Assoc. 101 1619–1637.
  • Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. J. Amer. Statist. Assoc. 82 528–550.
  • Vink, G., Frank, L. E., Pannekoek, J. and van Buuren, S. (2014). Predictive mean matching imputation of semicontinuous variables. Stat. Neerl. 68 61–90.
  • Wang, D. and Chen, S. X. (2009). Empirical likelihood for estimating equations with missing values. Ann. Statist. 37 490–517.
  • Wang, N. and Robins, J. M. (1998). Large-sample theory for parametric multiple imputation procedures. Biometrika 85 935–948.
  • Wei, G. C. and Tanner, M. A. (1990). A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J. Amer. Statist. Assoc. 85 699–704.
  • Yang, S. and Kim, J. K. (2016). A semiparametric inference to regression analysis with missing covariates in survey data. Statist. Sinica. To appear.
  • Yang, S. and Kim, J. K. (2016a). Likelihood-based inference with missing data under missing-at-random. Scand. J. Stat. 43 436–454.
  • Yang, S. and Kim, J. K. (2016b). A note on multiple imputation for method of moments estimation. Biometrika 103 244–251.
  • Yang, S., Kim, J.-K. and Zhu, Z. (2013). Parametric fractional imputation for mixed models with nonignorable missing data. Stat. Interface 6 339–347.