## The Annals of Applied Statistics

### People born in the Middle East but residing in the Netherlands: Invariant population size estimates and the role of active and passive covariates

#### Abstract

Including covariates in loglinear models of population registers improves population size estimates for two reasons. First, it is possible to take heterogeneity of inclusion probabilities over the levels of a covariate into account; and second, it allows subdivision of the estimated population by the levels of the covariates, giving insight into characteristics of individuals that are not included in any of the registers. The issue of whether or not marginalizing the full table of registers by covariates over one or more covariates leaves the estimated population size estimate invariant is intimately related to collapsibility of contingency tables [Biometrika 70 (1983) 567–578]. We show that, with information from two registers, population size invariance is equivalent to the simultaneous collapsibility of each margin consisting of one register and the covariates. We give a short path characterization of the loglinear model which describes when marginalizing over a covariate leads to different population size estimates. Covariates that are collapsible are called passive, to distinguish them from covariates that are not collapsible and are termed active. We make the case that it can be useful to include passive covariates within the estimation model, because they allow a finer description of the population in terms of these covariates. As an example we discuss the estimation of the population size of people born in the Middle East but residing in the Netherlands.

#### Article information

Source
Ann. Appl. Stat., Volume 6, Number 3 (2012), 831-852.

Dates
First available in Project Euclid: 31 August 2012

https://projecteuclid.org/euclid.aoas/1346418564

Digital Object Identifier
doi:10.1214/12-AOAS536

Mathematical Reviews number (MathSciNet)
MR3012511

Zentralblatt MATH identifier
06096512

#### Citation

van der Heijden, Peter G. M.; Whittaker, Joe; Cruyff, Maarten; Bakker, Bart; van der Vliet, Rik. People born in the Middle East but residing in the Netherlands: Invariant population size estimates and the role of active and passive covariates. Ann. Appl. Stat. 6 (2012), no. 3, 831--852. doi:10.1214/12-AOAS536. https://projecteuclid.org/euclid.aoas/1346418564

#### References

• Asmussen, S. and Edwards, D. (1983). Collapsibility and response variables in contingency tables. Biometrika 70 567–578.
• Baker, S. (1990). A simple EM algorithm for capture–recapture data with categorical covariates (with discussion). Biometrics 46 1193–1197.
• Bartolucci, F. and Forcina, A. (2001). Analysis of capture–recapture data with a Rasch-type model allowing for conditional dependence and multidimensionality. Biometrics 57 714–719.
• Bishop, Y. M. M., Fienberg, S. E. and Holland, P. W. (1975). Discrete Multivariate Analysis: Theory and Practice. The MIT Press, Cambridge, MA.
• Buckland, S. and Garthwire, P. (1991). Quantifying precision of mark-recapture estimates using the bootstrap and related methods. Biometrics 47 255–268.
• Chao, A., Tsay, P. K., Lin, S. H., Shau, W. Y. and Chao, D. Y. (2001). The applications of capture–recapture models to epidemiological data. Stat. Med. 20 3123–3157.
• Cormack, R. (1989). Log-linear models for capture–recapture. Biometrics 45 395–413.
• Fienberg, S. E. (1972). The multiple recapture census for closed populations and incomplete $2^{k}$ contingency tables. Biometrika 59 591–603.
• Fienberg, S., Johnson, M. and Junker, B. (1999). Classical multilevel and Bayesian approaches to population size estimation using multiple lists. J. Roy. Statist. Soc. Ser. A 162 383–406.
• Hessen, D. J. (2011). Loglinear representations of multivariate Bernoulli Rasch models. British J. Math. Statist. Psych. 64 337–354.
• Hickman, L. J. and Suttorp, M. J. (2008). Are deportable aliens a unique threat to public safety? Comparing the recidivism of deportable and nondeportable aliens. Crime and Public Policy 7 59–82.
• IWGDMF: International Working Group for Disease Monitoring and Forecasting (1995). Capture–recapture and multiple record systems estimation. Part i. History and theoretical development. American Journal of Epidemiology 142 1059–1068.
• Kim, S.-H. and Kim, S.-H. (2006). A note on collapsibility in DAG models of contingency tables. Scand. J. Stat. 33 575–590.
• Little, R. J. A. and Rubin, D. B. (1987). Statistical Analysis with Missing Data. Wiley, New York.
• Meng, X. L. and Rubin, D. B. (1991). IPF for contingency tables with missing data via the ECM algorithm. In Proceedings of the Statistical Computing Section of the American Statistical Association 244–247. Amer. Statist. Assoc., Washington, DC.
• Pollock, K. H. (2002). The use of auxiliary variables in capture–recapture modelling: An overview. J. Appl. Stat. 29 85–106.
• Schafer, J. L. (1997a). Analysis of Incomplete Multivariate Data. Monographs on Statistics and Applied Probability 72. Chapman & Hall, London.
• Schafer, J. (1997b). Imputation of missing covariates under a general linear mixed model. Dept. Statistics, Penn State Univ.
• Sutherland, J. M., Schwarz, C. J. and Rivest, L.-P. (2007). Multilist population estimation with incomplete and partial stratification. Biometrics 63 910–916.
• Valente, P. (2010). Main results of the UNECE/UNSD survey on the 2010/2011 round of censuses in the UNECE region. Eurostat, Luxembourg.
• van der Heijden, P. G. M., Zwane, E. and Hessen, D. (2009). Structurally missing data problems in multiple list capture–recapture data. AStA Adv. Stat. Anal. 93 5–21.
• van der Heijden, P. G. M., Whittaker, J., Cruyff, M., Bakker, B. and van der Vliet, R. (2012). Supplement to “People born in the Middle East but residing in the Netherlands: Invariant population size estimates and the role of active and passive covariates.” DOI:10.1214/12-AOAS536SUPP.
• Whittaker, J. (1990). Graphical Models in Applied Multivariate Statistics. Wiley, Chichester.
• Zwane, E. N. and van der Heijden, P. G. M. (2007). Analysing capture–recapture data when some variables of heterogeneous catchability are not collected or asked in all registrations. Stat. Med. 26 1069–1089.
• Zwane, E., van der Pal, K. and van der Heijden, P. G. M. (2004). The multiple-record systems estimator when registrations refer to different but overlapping populations. Stat. Med. 23 2267–2281.

#### Supplemental materials

• Supplementary material: Estimation in R. We make use of the CAT-procedure in R (Meng and Rubin (1991); Schafer [(1997a), Chapters 7 and 8], (1997b)). The CAT-procedure is a routine for the analysis of categorical variable data sets with missing values. We describe our application of this procedure in detail in the supplemental article [van der Heijden et al. (2012)].