## The Annals of Applied Statistics

### Providing access to confidential research data through synthesis and verification: An application to data on employees of the U.S. federal government

#### Abstract

Data stewards seeking to provide access to large-scale social science data face a difficult challenge. They have to share data in ways that protect privacy and confidentiality, are informative for many analyses and purposes, and are relatively straightforward to use by data analysts. One approach suggested in the literature is that data stewards generate and release synthetic data, that is, data simulated from statistical models, while also providing users access to a verification server that allows them to assess the quality of inferences from the synthetic data. We present an application of the synthetic data plus verification server approach to longitudinal data on employees of the U.S. federal government. As part of the application, we present a novel model for generating synthetic career trajectories, as well as strategies for generating high dimensional, longitudinal synthetic datasets. We also present novel verification algorithms for regression coefficients that satisfy differential privacy. We illustrate the integrated use of synthetic data plus verification via analysis of differentials in pay by race. The integrated system performs as intended, allowing users to explore the synthetic data for potential pay differentials and learn through verifications which findings in the synthetic data hold up and which do not. The analysis on the confidential data reveals pay differentials across races not documented in published studies.

#### Article information

Source
Ann. Appl. Stat., Volume 12, Number 2 (2018), 1124-1156.

Dates
Revised: June 2018
First available in Project Euclid: 28 July 2018

https://projecteuclid.org/euclid.aoas/1532743488

Digital Object Identifier
doi:10.1214/18-AOAS1194

Mathematical Reviews number (MathSciNet)
MR3834297

Zentralblatt MATH identifier
06980487

#### Citation

Barrientos, Andrés F.; Bolton, Alexander; Balmat, Tom; Reiter, Jerome P.; de Figueiredo, John M.; Machanavajjhala, Ashwin; Chen, Yan; Kneifel, Charley; DeLong, Mark. Providing access to confidential research data through synthesis and verification: An application to data on employees of the U.S. federal government. Ann. Appl. Stat. 12 (2018), no. 2, 1124--1156. doi:10.1214/18-AOAS1194. https://projecteuclid.org/euclid.aoas/1532743488

#### References

• Abowd, J. M. and Schmutte, I. M. (2017). Revisting the economics of privacy: Population statistics and confidentiality protection as public goods. Technical report, Working paper 17-37, Center for Economic Studies, U.S. Census Bureau.
• Abowd, J., Stinson, M. and Benedetto, G. (2006). Final report to the Social Security Administration on the SIPP/SSA/IRS Public Use File Project. Technical report, U.S. Census Bureau Longitudinal Employer–Household Dynamics Program. Available at http://www.census.gov/sipp/synth_data.html.
• Abowd, J. and Vilhuber, L. (2008). How protective are synthetic data? In Privacy in Statistical Databases (J. Domingo-Ferrer and Y. Saygun, eds.) 239–246. Springer, New York.
• Altonji, J. G. and Blank, R. M. (1999). Race and gender in the labor market. In Handbook of Labor Economics 3 3143–3259. Elsevier, Amsterdam.
• Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F. and Talwar, K. (2007). Privacy, accuracy, and consistency too: A holistic solution to contingency table release. In Proceedings of the 26th ACM SIGMOD International Conference on Management of Data / Principles of Database Systems 273–282.
• Barrientos, A. F., Bolton, A., Balmat, T., Reiter, J. P., de Figueiredo, J. M., Machanavajjhala, A., Chen, Y., Kneifel, C. and DeLong, M. (2018). Supplement to “Providing access to confidential research data through synthesis and verification: An application to data on employees of the U.S. federal government.” DOI:10.1214/18-AOAS1194SUPPA, DOI:10.1214/18-AOAS1194SUPPB, DOI:10.1214/18-AOAS1194SUPPC.
• Black, D. A., Kolesnikova, N., Sanders, S. G. and Taylor, L. J. (2013). The role of location in evaluating racial wage disparity. IZA J. Labor Econ. 2 2.
• Blau, F. D. and Kahn, L. M. (2017). The gender wage gap: Extent, trends, and expectations. J. Econ. Lit. 55 789–865.
• Blum, A., Ligett, K. and Roth, A. (2008). A learning theory approach to non-interactive database privacy. In Proceedings of the 40th ACM SIGACT Symposium on Theory of Computing 609–618. ACM, New York.
• Bolton, A. and de Figueiredo, J. M. (2016). Why have federal wages risen so rapidly? Technical report, Duke Univ. Law School.
• Bolton, A. and de Figueiredo, J. M. (2017). Measuring and explaining the gender wage gap in the U.S. federal government. Technical report, Duke Univ. Law School.
• Bolton, A., de Figueiredo, J. M. and Lewis, D. E. (2016). Elections, ideology, and turnover in the U.S. federal government. Technical report, National Bureau of Economics Research Working Paper 22932.
• Borjas, G. J. (1980). Wage determination in the federal government: The role of constituents and bureaucrats. J. Polit. Econ. 88 1110–1147.
• Borjas, G. J. (1982). The politics of employment discrimination in the federal bureaucracy. J. Law Econ. 25 271–299.
• Borjas, G. J. (1983). The measurement of race and gender wage differentials: Evidence from the federal sector. Ind. Labor Relat. Rev. 37 79–91.
• Callier, V. (2015). How fake data could protect real people’s privacy. The Atlantic, July 30, 2015.
• Cameron, A. C. and Miller, D. L. (2015). A practitioner’s guide to cluster-robust inference. J. Hum. Resour. 50 317–373.
• Cancio, A. S., Evans, T. D. and Maume, D. J. J. (1996). Reconsidering the declining significance of race: Racial differences in early career wages. Am. Sociol. Rev. 61 541–556.
• Card, D. and Lemieux, T. (1994). Changing wage structure and black–white wage differentials. Am. Econ. Rev. 84 29–33.
• Charest, A. S. (2010). How can we analyze differentially private synthetic datasets. J. Priv. Confid. 2 Article 3.
• Charles, J. (2003). Diversity management: An exploratory assessment of minority group representation in state government. Public Pers. Manag. 32 561–577.
• Chen, Y., Machanavajjhala, A., Reiter, J. P. and Barrientos, A. F. (2016). Differentially private regression diagnostics. In Proceedings of the IEEE 16th International Conference on Data Mining 81–90.
• Commission on Evidence-Based Policymaking (2017). The promise of evidence-based policymaking.
• Drechsler, J. (2011). Synthetic Datasets for Statistical Disclosure Control: Theory and Implementation. Lecture Notes in Statistics 201. Springer, New York.
• Drechsler, J., Dundler, A., Bender, S., Rässler, S. and Zwick, T. (2008). A new approach for disclosure control in the IAB establishment panel—Multiple imputation for a better data access. AStA Adv. Stat. Anal. 92 439–458.
• Dwork, C. (2006). Differential privacy. In Automata, Languages and Programming. Part II. Lecture Notes in Computer Science 4052 1–12. Springer, Berlin.
• Dwork, C. and Roth, A. (2013). The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9 211–487.
• Elringsson, U., Pihur, V. and Korolova, A. (2014). Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security 1054–1067.
• Fienberg, S. E. (1994). A radical proposal for the provision of micro-data samples and the preservation of confidentiality. Technical report, Dept. of Statistics, Carnegie-Mellon Univ.
• Government Accountability Office (2009). Gender pay gap in the federal workforce narrows as differences in occupation, education, and experience diminish. Technical report, Government Accountability Office, Washington, DC.
• Hardt, M., Ligett, K. and McSherry, F. (2012). A simple and practical algorithm for differentially private data release. Adv. Neural Inf. Process. Syst. 25 2348–2356.
• Hardt, M. and Rothblum, G. N. (2010). A multiplicative weights mechanism for privacy-preserving data analysis. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science—FOCS 2010 61–70. IEEE Computer Soc., Los Alamitos, CA.
• Hawala, S. (2008). Producing partially synthetic data to avoid disclosure. In Proceedings of the Joint Statistical Meetings. American Statistical Association, Alexandria, VA.
• Holan, S. H., Toth, D., Ferreira, M. A. R. and Karr, A. F. (2010). Bayesian multiscale multiple imputation with implications for data confidentiality. J. Amer. Statist. Assoc. 105 564–577.
• Hoover, G. A., Compton, R. A. and Giedeman, D. C. (2015). The impact of economic freedom on the black/white income gap. Am. Econ. Rev. Pap. Proc. 105 587–592.
• Karr, A. F. and Reiter, J. P. (2014). Using statistics to protect privacy. In Privacy, Big Data, and the Public Good: Frameworks for Engagement (J. Lane, V. Stodden, S. Bender and H. Nissenbaum, eds.) 276–295. Cambridge Univ. Press, Cambridge.
• Karr, A. F., Fulp, W. J., Vera, F., Young, S. S., Lin, X. and Reiter, J. P. (2007). Secure, privacy-preserving analysis of distributed databases. Technometrics 49 335–345.
• Karwa, V. and Slavković, A. (2016). Inference using noisy degrees: Differentially private $\beta$-model and synthetic graphs. Ann. Statist. 44 87–112.
• Kim, C.-K. (2004). Women and minorities in state government agencies. Public Pers. Manag. 33 165–180.
• Kinney, S. K., Reiter, J. P., Reznek, A. P., Miranda, J., Jarmin, R. S. and Abowd, J. M. (2011). Towards unrestricted public use business microdata: The synthetic Longitudinal Business Database. Int. Stat. Rev. 79 363–384.
• Lewis, G. B. (1998). Continuing progress toward racial and gender pay equality in the federal service: An update. Rev. Public Pers. Adm. 18 23–40.
• Lewis, G. B. and Durst, S. L. (1995). Will locality pay solve recruitment and retention problems in the federal civil service? Public Adm. Rev. 55 371–380.
• Lewis, G. B. and Nice, D. (1994). Race, sex, and occupational segregation in state and local governments. Am. Rev. Public Adm. 24 393–410.
• Little, R. J. A. (1993). Statistical analysis of masked data. J. Off. Stat. 9 407–426.
• Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data, 2nd ed. Wiley-Interscience [John Wiley & Sons], Hoboken, NJ.
• Llorens, J. J., Wenger, J. B. and Kellough, J. E. (2007). Choosing public sector employment: The impact of wages on the representation of women and minorities in state bureaucracies. J. Public Adm. Res. Theory 18 397–413.
• Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J. and Vilhuber, L. (2008). Privacy: Theory meets practice on the map. In IEEE 24th International Conference on Data Engineering 277–286.
• Maxwell, N. L. (1994). The effect of black–white wage differences of differences in the quantity and quality of education. Ind. Labor Relat. Rev. 47 249–264.
• McCabe, B. C. and Stream, C. (2000). Diversity by the numbers: Changes in state and local government workforces, 1980–1995. Public Pers. Manag. 29 93–106.
• McCall, L. (2001). Sources of racial wage inequality in metropolitan labor markets: Racial, ethnic, and gender differences. Am. Sociol. Rev. 66 520–541.
• McClure, D. and Reiter, J. P. (2012). Towards providing automated feedback on the quality of inferences from synthetic datasets. J. Priv. Confid. 4 Article 8.
• McClure, D. and Reiter, J. P. (2016). Assessing disclosure risks for synthetic data with arbitrary intruder knowledge. Stat. J. Int. Assoc. Off. Stat. 32 109–126.
• Mir, D., Isaacman, S., Caceres, R., Martonosi, M. and Wright, R. N. (2013). DP–WHERE: Differentially private modeling of human mobility. In Proceedings of the IEEE Conference on Big Data 580–588.
• Narayanan, A. and Shmatikov, V. (2008). Robust de-anonymization of large sparse datasets. In Proceedings of the IEEE Symposium on Security and Privacy 111–125.
• Neal, D. and Johnson, W. R. (2003). The role of pre-market factors in black–white wage differences. J. Polit. Econ. 87 567–594.
• Nissim, K., Raskhodnikova, S. and Smith, A. (2007). Smooth sensitivity and sampling in private data analysis. In STOC’07—Proceedings of the 39th Annual ACM Symposium on Theory of Computing 75–84. ACM, New York.
• O’Neill, J. (1990). The role of human capital in earnings differences between black and white men. J. Econ. Perspect. 108 937–975.
• Parry, M. (2011). Harvard researchers accused of breaching students’ privacy. The Chronicle of Higher Education, July 11, 2011.
• Raghunathan, T. E., Reiter, J. P. and Rubin, D. B. (2003). Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19 1–16.
• Reiter, J. P. (2005a). Releasing multiply imputed, synthetic public use microdata: An illustration and empirical study. J. Roy. Statist. Soc. Ser. A 168 185–205.
• Reiter, J. P. (2005b). Using CART to generate partially synthetic, public use microdata. J. Off. Stat. 21 441–462.
• Reiter, J. P., Oganian, A. and Karr, A. F. (2009). Verification servers: Enabling analysts to assess the quality of inferences from public use data. Comput. Statist. Data Anal. 53 1475–1482.
• Reiter, J. P. and Raghunathan, T. E. (2007). The multiple adaptations of multiple imputation. J. Amer. Statist. Assoc. 102 1462–1471.
• Reiter, J. P., Wang, Q. and Zhang, B. (2014). Bayesian estimation of disclosure risks in multiply imputed, synthetic data. J. Priv. Confid. 6 Article 2.
• Rubin, D. B. (1993). Discussion: Statistical disclosure limitation. J. Off. Stat. 9 462–468.
• Sakano, R. (2002). Are black and white income distributions converging? Time series analysis. Rev. Black Polit. Econ. 30 91–106.
• Springer, L. M. (2005). Memorandum for chief human capital officers. Office of Personnel Management, November 9, 2005.
• Sweeney, L. (1997). Weaving technology and policy together to maintain confidentiality. J. Law Med. Ethics 25 98–110.
• Sweeney, L. (2015). Only you, your doctor, and many others may know. Technology Science, September 29, 2015.
• Tang, J., Korolova, A., Bai, X., Wang, X. and Wang, X. (2017). Privacy loss in Apple’s implementation of differential privacy on MacOS 10.12. Preprint. Available at 2709.03753.
• Ullman, J. (2015). Private multiplicative weights beyond linear queries. In Proceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems 303–312.
• Vilhuber, L., Abowd, J. A. and Reiter, J. P. (2016). Synthetic establishment microdata around the world. Stat. J. Int. Assoc. Off. Stat. 32 65–68.
• Wang, H. and Reiter, J. P. (2012). Multiple imputation for sharing precise geographies in public use data. Ann. Appl. Stat. 6 229–252.

#### Supplemental materials

• Supplement A to “Providing access to confidential research data through synthesis and verification: An application to data on employees of the U.S. federal government”. This document provides supporting material for aspects of the OPM synthesis plus verification application. In Section 1, we provide a formal description of the three submodels used to model the employee’s career. In Section 2, we discuss the modeling strategies used to synthesize variables in the OPM data. In Section 3, we provide the full list of the synthesized variables along with a brief description of each of them. In Section 4, we present the analyses of wage gaps conditional on six broad categories of occupation rather than the 803 used in the main text. In Section 5, we describe a method for empirical disclosure risk assessment for OPM synthetic data. In Section 6, we formally describe the verification measures for longitudinal trends in regression coefficients. In Section 7, we examine the performance of the $\varepsilon$-differentially private verification measures used in the text, and we present a verification measure that is suitable for analyses where some regression coefficients are nonestimable.
• Supplement B to “Providing access to confidential research data through synthesis and verification: An application to data on employees of the U.S. federal government”. This document provides graphical analyses comparing the OPM synthetic and confidential data used in the main text.
• Supplement C to “Providing access to confidential research data through synthesis and verification: An application to data on employees of the U.S. federal government”. This file contains the code used to generate the synthetic OPM data and compute the verification measures proposed in the main text.