Sequential category aggregation and partitioning approaches for multi-way contingency tables based on survey and census data



The Annals of Applied Statistics

Sequential category aggregation and partitioning approaches for multi-way contingency tables based on survey and census data

L. Fraser Jackson, Alistair G. Gray, and Stephen E. Fienberg

Source: Ann. Appl. Stat. Volume 2, Number 3 (2008), 955-981.

Abstract

Large contingency tables arise in many contexts but especially in the collection of survey and census data by government statistical agencies. Because the vast majority of the variables in this context have a large number of categories, agencies and users need a systematic way of constructing tables which are summaries of such contingency tables. We propose such an approach in this paper by finding members of a class of restricted log-linear models which maximize the likelihood of the data and use this to find a parsimonious means of representing the table. In contrast with more standard approaches for model search in hierarchical log-linear models (HLLM), our procedure systematically reduces the number of categories of the variables. Through a series of examples, we illustrate the extent to which it can preserve the interaction structure found with HLLMs and be used as a data simplification procedure prior to HLL modeling. A feature of the procedure is that it can easily be applied to many tables with millions of cells, providing a new way of summarizing large data sets in many disciplines. The focus is on information and description rather than statistical testing. The procedure may treat each variable in the table in different ways, preserving full detail, treating it as fully nominal, or preserving ordinality.

Related Works:

Keywords: Collapsibility; Kullback–Leibler distance; level merging; log-linear modeling; partitioning information; reducing dimensionality

Full-text: Access denied (no subscription detected)

In 2007, access to the Annals of Applied Statistics was open. Beginning in 2008, you must hold a subscription or be a member of the IMS to view the full journal. For more information on subscribing, please visit: http://imstat.org/orders.
If you are already an IMS member, you may need to update your Euclid profile following the instructions here: http://imstat.org/publications/eaccess.htm.
This document is available for purchase at a cost of $15. Select the "buy article" button below to make a credit card purchase of this document through a secure payment site.
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aoas/1223908047
Digital Object Identifier: doi:10.1214/08-AOAS175

References

Asmussen, S. and Edwards, D. (1983). Collapsibility and response variables in contingency tables. Biometrika 70 567–578.
Birch, M. W. (1963). Maximum likelihood in three-way contingency tables. J. Roy. Statist. Soc. Ser. B 25 220–233.
Bishop, Y. M. M., Fienberg, S. E. and Holland, P. W. (1975). Discrete Multivariate Analysis: Theory and Practice. MIT Press, Cambridge, MA.
Christensen, R. (1997). Log-Linear Models and Logistic Regression, 2nd ed. Springer, New York.
Dellaportas, P. and Tarantola, C. (2005). Model determination for categorical data with factor level merging. J. Roy. Statist. Soc. Ser. B 67 269–283.
Dobra, A., Fienberg, S. E., Rinaldo, A., Slavkovic, A. B. and Zhou, Y. (2008). Algebraic statistics and contingency table problems: Estimations and disclosure limitation. In Emerging Applications of Algebraic Geometry (M. Putinar and S. Sullivant, eds.). IMA Series in Applied Mathematics. Springer, New York. In press.
Fienberg, S. E. (1980). The Analysis of Cross Classified Categorical Data, 2nd ed. MIT Press, Cambridge, MA.
Fienberg, S. E. and Rinaldo, A. (2007). Three centuries of categorical data analysis: Log-linear models and maximum likelihood estimation. J. Statist. Plann. Inference 137 3430–3445.
Gilula, Z. and Haberman, S. J. (1998). Chi-square, partition of. In Encyclopedia of Biostatistics (P. Armitage and T. Colton, eds.). Wiley, Chichester.
Gokhale, D. V. and Kullback, S. (1978). The Information in Contingency Tables. Dekker, New York.
Goodman, L. A. (1968). The analysis of cross-classified data: Independence, quasi-independence, and interaction in contingency tables, with or without missing entries. J. Amer. Statist. Assoc. 63 1091–1131.
Goodman, L. A. (1970). How to ransack social mobility tables and other kinds of cross classification tables. Amer. J. Sociol. 75 1–40.
Goodman, L. A. (1971). The analysis of multi-dimensional contingency tables: Stepwise procedures and direct estimation methods for building models for multiple classifications. Technometrics 13 33–61.
Goodman, L. A. (1981). Criteria for determining whether certain categories in a cross-classification table should be combined with special reference to occupational categories in an occupational mobility table. Amer. J. Sociol. 87 612–650.
Goodman, L. A. (1996). A single general method for the analysis of cross-classified data: Reconciliation and synthesis of some methods of Pearson, Yule, and Fisher, and also some methods of correspondence analysis and association analysis. J. Amer. Statist. Assoc. 91 408–427.
Haberman, S. J. (1978). Analysis of Qualitative Data: Introductory Topics. Academic Press, New York.
Heitjan, D. F. and Rubin, D. B. (1991). Ignorability and categorical data. Ann. Statist. 19 2244–2253.
Jackson, L. F., Gray, A. G. and Fienberg, S. E. (2007). Impacts of global recoding to preserve confidentiality on information loss and statistical validity of subsequent data analysis. Official Statistics Research Series 2. Available at http://www.statisphere.govt.nz/official-statistics-research/series/vol-2.htm.
Jackson, L. F., Gray, A. G. and Fienberg, S. E. (2008a). Simpson’s paradox and information loss: An empirical perspective. In preparation.
Jackson, L. F., Gray, A. G. and Fienberg, S. E. (2008b). Tools for construction and comparison of PCC and HLL models. Supplement to “Sequential category aggregation and partitioning approaches for the multi-way contingency tables based on survey and census data.” DOI: 10.1214/08-AOAS175SUPP.
Jaeger, M. (2005). Ignorability for categorical data. Ann. Statist. 33 1964–1981.
Khamaladze, E. V. (1988). The statistical analysis of a large number of rare events. Research Report MS-R8804, Centre for Mathematics and Computer Science, Amsterdam, The Netherlands.
Kreiner, S. E. (2003). Introduction to DIGRAM. Technical report, Dept. Biostatistics, Univ. Copenhagen, Copenhagen, Denmark.
Kullback, S. (1959). Information Theory and Statistics. Wiley, New York.
Lancaster, H. O. (1949). The derivation and partition of χ2 in certain discrete distributions. Biometrika 36 117–129.
Lancaster, H. O. (1951). Complex contingency tables treated by the partition of chi-squared. J. Roy. Statist. Soc. Ser. B 13 242–249.
Lauritzen, S. L. (1996). Graphical Models. Oxford Univ. Press, Oxford.
Simpson, E. H. (1951). The interpretation of interaction in contingency tables. J. Roy. Statist. Soc. Ser. B 13 238–241.
Wermuth, N. and Cox, D. R. (1998). On the application of conditional independence to ordinal data. Internat. Statist. Rev. 66 181–199.
Whittaker, J. (1990). Graphical Models in Applied Multivariate Statistics. Wiley, New York.
Whittemore, A. S. (1978). Collapsibility of multi-dimensional contingency tables. J. Roy. Statist. Soc. Ser. B 40 328–340.
Willenborg, L. and de Waal, T. (2000). Elements of Statistical Disclosure Control. Lecture Notes in Statist. 155. Springer, New York.
Yule, G. U. (1903). Notes on the theory of association of attributes in statistics. Biometrika 2 121–134.

2009 © Institute of Mathematical Statistics