The Annals of Statistics

Semiparametric efficiency in GMM models with auxiliary data

Xiaohong Chen, Han Hong, and Alessandro Tarozzi

Full-text: Open access


We study semiparametric efficiency bounds and efficient estimation of parameters defined through general moment restrictions with missing data. Identification relies on auxiliary data containing information about the distribution of the missing variables conditional on proxy variables that are observed in both the primary and the auxiliary database, when such distribution is common to the two data sets. The auxiliary sample can be independent of the primary sample, or can be a subset of it. For both cases, we derive bounds when the probability of missing data given the proxy variables is unknown, or known, or belongs to a correctly specified parametric family. We find that the conditional probability is not ancillary when the two samples are independent. For all cases, we discuss efficient semiparametric estimators. An estimator based on a conditional expectation projection is shown to require milder regularity conditions than one based on inverse probability weighting.

Article information

Ann. Statist. Volume 36, Number 2 (2008), 808-843.

First available in Project Euclid: 13 March 2008

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62H12: Estimation 62D05: Sampling theory, sample surveys
Secondary: 62F12: Asymptotic properties of estimators 62G20: Asymptotic properties

Semiparametric efficiency bounds GMM measurement error missing data auxiliary data sieve estimation


Chen, Xiaohong; Hong, Han; Tarozzi, Alessandro. Semiparametric efficiency in GMM models with auxiliary data. Ann. Statist. 36 (2008), no. 2, 808--843. doi:10.1214/009053607000000947.

Export citation


  • Ai, C. and Chen, X. (2003). Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica 71 1795–1843.
  • Begun, J., Hall, W., Huang, W. and Wellner, J. (1983). Information and asymptotic efficiency in parametric-nonparametric models. Ann. Statist. 11 432–452.
  • Bickel, P. J., Klaassen, C. A., Ritov, Y. and Wellner, J. A. (1993). Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins Univ. Press, Baltimore, MD.
  • Breslow, N. E., McNeney, B. and Wellner, J. A. (2003). Large sample theory for semiparametric regression models with two-phase, outcome dependent sampling. Ann. Statist. 31 1110–1139.
  • Breslow, N. E., Robins, J. M. and Wellner, J. A. (2000). On the semiparametric efficiency of logistic regression under case-control sampling. Bernoulli 6 447–455.
  • Carroll, R., Ruppert, D. and Stefanski, L. (1995). Measurement Error in Nonlinear Models. Chapman and Hall, New York.
  • Carroll, R. J. and Wand, M. P. (1991). Semiparametric estimation in logistic measurement error models. J. Roy. Statist. Soc. Ser. B 53 573–585.
  • Chen, J. B. and Breslow, N. E. (2004). Semiparametric efficient estimation for the auxiliary outcome problem with the conditional mean model. Canad. J. Statist. 32 359–372.
  • Chen, X., Hong, H. and Tamer, E. (2005). Measurement error models with auxiliary data. Rev. Economic Studies 72 343–366.
  • Chen, X., Linton, O. and van Keilegom, I. (2003). Estimation of semiparametric models when the criterion function is not smooth. Econometrica 71 1591–1608.
  • Chen, X. and Shen, X. (1998). Sieve extremum estimates for weakly dependent data. Econometrica 66 289–314.
  • Clogg, C., Rubin, D., Schenker, N., Schultz, B. and Weidman, L. (1991). Multiple imputation of industry and occupation codes in census public-use samples using Bayesian logistic regression. J. Amer. Statist. Assoc. 86 68–78.
  • Deaton, A. (2003). Adjusted Indian poverty estimates for 1999–2000. Economic and Political Weekly 38 322–326.
  • Deaton, A. and Drèze, J. (2002). Poverty and inequality in India, a re-examination. Economic and Political Weekly 37 3729–3748.
  • Deaton, A. and Kozel, V., eds. (2005). Data and Dogma: The Great Indian Poverty Debate. MacMillian, New Delhi, India.
  • Gallant, A. R. and Nychka, D. W. (1987). Semi-nonparametric maximum likelihood estimation. Econometrica 55 363–390.
  • Hahn, J. (1998). On the role of propensity score in efficient semiparametric estimation of average treatment effects. Econometrica 66 315–332.
  • Heckman, J., LaLonde, R. and Smith, J. (1999). The economics and econometrics of active labor market programs. In Handbook of Labor Economics 3A (O. Ashenfelter and D. Card, eds.) 1865–2097. North-Holland, Amsterdam.
  • Hirano, K., Imbens, G. and Ridder, G. (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71 1161–1189.
  • Ibragimov, I. A. and Has’minskii, R. Z. (1981). Statistical Estimation: Asymptotic Theory. Springer, New York.
  • Imbens, G., Newey, W. and Ridder, G. (2005). Mean-squared-error calculations for average treatment effects. Working paper.
  • Lee, L. and Sepanski, J. (1995). Estimation of linear and nonlinear errors-in-variables models using validation data. J. Amer. Statist. Assoc. 90 130–140.
  • Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data, 2nd ed. Wiley, Hoboken, NJ.
  • Newey, W. (1990). Semiparametric efficiency bounds. J. Appl. Econometrics 5 99–135.
  • Newey, W. (1994). The asymptotic variance of semiparametric estimators. Econometrica 62 1349–82.
  • Robins, J., Mark, S. and Newey, W. (1992). Estimating exposure effects by modelling the expectation of exposure conditional on confounders. Biometrics 48 479–495.
  • Robins, J. M. and Rotnitzky, A. (1995). Semiparametric efficiency in multivariate regression models with missing data. J. Amer. Statist. Assoc. 90 122–129.
  • Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. J. Amer. Statist. Assoc. 89 846–866.
  • Rotnitzky, A. and Robins, J. (1995). Semiparametric regression estimation in the presence of dependent censoring. Biometrika 82 805–820.
  • Schenker, N. (2003). Assessing variability due to race bridging: Application to census counts and vital rates for the year 2000. J. Amer. Statist. Assoc. 98 818–828.
  • Sepanski, J. and Carroll, R. (1993). Semiparametric quasi-likelihood and variance estimation in measurement error models. J. Econometrics 58 223–256.
  • Shen, X. (1997). On methods of sieves and penalization. Ann. Statist. 25 2555–2591.
  • Shen, X. and Wong, W. (1994). Convergence rates of sieve estimates. Ann. Statist. 22 580–615.
  • Tarozzi, A. (2007). Calculating comparable statistics from incomparable surveys, with an application to poverty in India. J. Business and Economic Statistics 25 314–336.
  • Wang, Q., Linton, O. and Hardle, W. (2004). Semiparametric regression analysis for missing response data. J. Amer. Statist. Assoc. 99 334–345.
  • Wooldridge, J. (2002). Inverse probability weighted M-estimators for sample selection, attrition and stratification. Portuguese Economic J. 1 117–139.
  • Wooldridge, J. (2003). Inverse probability weighted estimation for general missing data problems. Manuscript, Michigan State Univ.