The Annals of Applied Statistics

A smoothing approach for masking spatial data

Yijie Zhou, Francesca Dominici, and Thomas A. Louis

Full-text: Open access


Individual-level health data are often not publicly available due to confidentiality; masked data are released instead. Therefore, it is important to evaluate the utility of using the masked data in statistical analyses such as regression. In this paper we propose a data masking method which is based on spatial smoothing techniques. The proposed method allows for selecting both the form and the degree of masking, thus resulting in a large degree of flexibility. We investigate the utility of the masked data sets in terms of the mean square error (MSE) of regression parameter estimates when fitting a Generalized Linear Model (GLM) to the masked data. We also show that incorporating prior knowledge on the spatial pattern of the exposure into the data masking may reduce the bias and MSE of the parameter estimates. By evaluating both utility and disclosure risk as functions of the form and the degree of masking, our method produces a risk-utility profile which can facilitate the selection of masking parameters. We apply the method to a study of racial disparities in mortality rates using data on more than 4 million Medicare enrollees residing in 2095 zip codes in the Northeast region of the United States.

Article information

Ann. Appl. Stat., Volume 4, Number 3 (2010), 1451-1475.

First available in Project Euclid: 18 October 2010

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Statistical disclosure limitation data masking data utility disclosure risk spatial smoothing


Zhou, Yijie; Dominici, Francesca; Louis, Thomas A. A smoothing approach for masking spatial data. Ann. Appl. Stat. 4 (2010), no. 3, 1451--1475. doi:10.1214/09-AOAS325.

Export citation


  • Armstrong, M. P., Rushton, G. and Zimmerman, D. L. (1999). Geographically masking health data to preserve confidentiality. Stat. Med. 18 497–525.
  • Bowman, A. W. and Azzalini, A. (1997). Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S–Plus Illustrations. Oxford Univ. Press, Oxford.
  • Casella, G. and Berger, R. L. (2002). Statistical Inference. Duxbury Press, North Scituate, MA.
  • Cox, L. H. (1994). Matrix masking methods for disclosure limitation in microdata. Survey Methodology 20 165–169.
  • Cox, L. H. (1995). Network models for complementary cell suppression. J. Amer. Statist. Assoc. 90 1453–1462.
  • Dalenius, T. and Reiss, S. P. (1982). Data-swapping: A technique for disclosure control. J. Statist. Plann. Inference 6 73–85.
  • Dobra, A., Fienberg, S. E. and Trottini, M. (2003). Assessing the risk of disclosure of confidential categorical data. In Bayesian Statistics 7, Proceedings of the Seventh Valencia International Meeting on Bayesian Statistics (J. Bernardo et al., eds) 125–144. Oxford Univ. Press, Oxford.
  • Dobra, A., Karr, A., Sanil, A. and Fienberg, S. (2002). Software systems for tabular data releases. Internat. J. Uncertain. Fuzziness Knowledge-Based Systems 10 529–544.
  • Duncan, G. T. and Lambert, D. (1986). Disclosure-limited data dissemination. J. Amer. Statist. Assoc. 81 10–28.
  • Duncan, G. T. and Pearson, R. W. (1991). Rejoinder: “Enhancing access to microdata while protecting confidentiality: Prospects for the future.” Statist. Sci. 6 237–239.
  • Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Ann. Statist. 7 1–26.
  • Efron, B. and Tibshirani, R. (1993). An Introduction to the Bootstrap. Chapman & Hall, New York.
  • Fienberg, S. E. (1994). Conflicts between the needs for access to statistical information and demands for confidentiality. Journal of Official Statistics 10 115–132.
  • Fienberg, S. E., Makov, U. E. and Steele, R. J. (1998). Disclosure limitation using perturbation and related methods for categorical data (with discussion). Journal of Official Statistics 14 485–511.
  • Fienberg, S. E. and McIntyre, J. (2005). Data swapping: Variations on a theme by Dalenius and Reiss. Journal of Official Statistics 21 309–323.
  • Fienberg, S. E. and Slavkovic, A. B. (2004). Making the release of confidential data from multi-way tables count. Chance 17 5–10.
  • Fienberg, S. E. and Willenborg, L. C. R. J. (1998). Introduction to the special issue: Disclosure limitation methods for protecting the confidentiality of statistical data. Journal of Official Statistics 14 337–345.
  • Fuller, W. A. (1993). Masking procedures for microdata disclosure limitation (with discussion). Journal of Official Statistics 9 383–406, 455–474.
  • Gomatam, S., Karr, A. F. and Sanil, A. P. (2005). Data swapping as a decision problem. Journal of Official Statistics 21 635–655.
  • Hastie, T., Tibshirani, R. and Friedman, J. H. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction: With 200 Full-Color Illustrations. Springer, New York.
  • Karr, A. F., Kohnen, C. N., Oganian, A., Reiter, J. P. and Sanil, A. P. (2006). A framework for evaluating the utility of data altered to protect confidentiality. Amer. Statist. 60 224–232.
  • Kim, J. (1986). A method for limiting disclosure in microdata based on random noise and transformation. In American Statistical Association, Proceedings of the Section on Survey Research Methods 370–374. Amer. Statist. Assoc., Alexandria, VA.
  • Raghunathan, T. E., Reiter, J. P. and Rubin, D. B. (2003). Multiple imputation for statistical disclosure limitation. Journal of Official Statistics 19 1–16.
  • Reiter, J. P. (2003). Inference for partially synthetic, public use microdata sets. Survey Methodology 29 181–188.
  • Reiter, J. P. (2005a). Estimating risks of identification disclosure in microdata. J. Amer. Statist. Assoc. 100 1103–1112.
  • Reiter, J. P. (2005b). Releasing multiply imputed, synthetic public use microdata: An illustration and empirical study. J. Roy. Statist. Soc. Ser. A 168 185–205.
  • Rubin, D. B. (1993). Comment on “Statistical disclosure limitation.” Journal of Official Statistics 9 461–468.
  • Ruppert, D., Wand, M. and Carroll, R. (2003). Semiparametric Regression. Cambridge Series in Statistical and Probabilistic Mathematics 12. Cambridge Univ. Press, Cambridge.
  • Simonoff, J. S. (1996). Smoothing Methods in Statistics. Springer, New York.
  • Slavkovic, A. and Lee, J. (2010). Synthetic two-way contingency tables that preserve conditional frequencies. Stat. Methodol. DOI: 10.1016/j.stamet.2009.11.002. To appear.
  • Sullivan, G. and Fuller, W. A. (1989). The use of measurement error to avoid disclosure. In American Statistical Association, Proceedings of the Section on Survey Research Methods 802–807. Amer. Statist. Assoc., Alexandria, VA.
  • Trottini, M. and Fienberg, S. E. (2002). Modelling user uncertainty for disclosure risk and data utility. Internat. J. Uncertain. Fuzziness Knowledge-Based Systems 10 511–528.
  • Trottini, M., Fienberg, S., Makov, U. E. and Meyer, M. (2004). Additive noise and multiplicative bias as disclosure limitation techniques for continuous microdata: A simulation study. Journal of Computational Methods for Science and Engineering 4 5–16.
  • Wieland, S. C., Cassa, C. A., Mandl, K. D. and Berger, B. (1998). Revealing the spatial distribution of a disease while preserving privacy. Proc. Natl. Acad. Sci. USA 105 17608–17613.
  • Willenborg, L. C. R. J. and Waal, T. D. (1996). Statistical Disclosure Control in Practice. Lecture Notes in Statistics 111. Springer, New York.
  • Willenborg, L. C. R. J. and Waal, T. D. (2001). Elements of Statistical Disclosure Control. Springer, New York.
  • Woo, M.-J., Reiter, J. P., Oganian, A. and Karr, A. F. (2009). Global measures of data utility for microdata masked for disclosure limitation. The Journal of Privacy and Confidentiality 1 111–124.
  • Zhou, Y., Dominici, F. and Louis, T. A. (2010a). Racial disparities in risks of mortality in a sample of the U.S. medicare population. J. Roy. Statist. Soc. Ser. C 59 319–339.
  • Zhou, Y., Dominici, F. and Louis, T. A. (2010b). Supplement to “A smoothing approach for masking spatial data.” DOI: 10.1214/09-AOAS325SUPP.

Supplemental materials

  • Supplementary material: R code. We provide the R code for (1) the simulation study utility part of the three examples, (2) the function to compute the disclosure risk, and (3) the calculation of the bivariate normal density kernel weight matrix.