Bayesian Analysis

Bayesian Bootstraps for Massive Data

Andrés F. Barrientos and Víctor Peña

Advance publication

This article is in its final form and can be cited using the date of online publication and the DOI.

Full-text: Open access

Abstract

In this article, we present data-subsetting algorithms that allow for the approximate and scalable implementation of the Bayesian bootstrap. They are analogous to two existing algorithms in the frequentist literature: the bag of little bootstraps (Kleiner et al., 2014) and the subsampled double bootstrap (Sengupta et al., 2016). Our algorithms have appealing theoretical and computational properties that are comparable to those of their frequentist counterparts. Additionally, we provide a strategy for performing lossless inference for a class of functionals of the Bayesian bootstrap and briefly introduce extensions to the Dirichlet Process.

Article information

Source
Bayesian Anal., Advance publication (2018), 26 pages.

Dates
First available in Project Euclid: 10 May 2019

Permanent link to this document
https://projecteuclid.org/euclid.ba/1557475224

Digital Object Identifier
doi:10.1214/19-BA1155

Keywords
bootstrap big data Bayesian nonparametric scalable inference

Rights
Creative Commons Attribution 4.0 International License.

Citation

Barrientos, Andrés F.; Peña, Víctor. Bayesian Bootstraps for Massive Data. Bayesian Anal., advance publication, 10 May 2019. doi:10.1214/19-BA1155. https://projecteuclid.org/euclid.ba/1557475224


Export citation

References

  • Barrientos, A. F. and Peña, V. (2019). “Supplementary material: Bayesian bootstraps for massive data.” Bayesian Analysis.
  • Barrientos, A. F., Bolton, A., Balmat, T., Reiter, J. P., de Figueiredo, J. M., Machanavajjhala, A., Chen, Y., Kneifel, C., and DeLong, M. (2018). “Providing access to confidential research data through synthesis and verification: An application to data on employees of the U.S. federal government.” The Annals of Applied Statistics, 12(2): 1124–1156.
  • Bolton, A. D. and de Figueiredo, J. M. (2016a). “Measuring and explaining the gender wage Gap in the federal government.” Paper presented at the 2016 annual meeting of the American Political Science Association, Philadelphia, Pennsylvania.
  • Bolton, A. D. and de Figueiredo, J. M. (2016b). “Rising wages and human capital in the federal government.” Paper presented at the 2016 annual meeting of the Southern Political Science Association, San Juan, Puerto Rico.
  • Carroll, R. J. and Pederson, S. (1993). “On robustness in the logistic regression model.” Journal of the Royal Statistical Society. Series B (Methodological), 693–706.
  • Castillo, I. and Nickl, R. (2014). “On the Bernstein–von Mises phenomenon for nonparametric Bayes procedures.” The Annals of Statistics, 42(5): 1941–1969.
  • Chamberlain, G. and Imbens, G. W. (2003). “Nonparametric applications of Bayesian inference.” Journal of Business & Economic Statistics, 21(1): 12–18.
  • Choudhuri, N. (1998). “Bayesian bootstrap credible sets for multidimensional mean functional.” The Annals of Statistics, 26(6): 2104–2127.
  • Cifarelli, D. M. and Melilli, E. (2000). “Some new results for Dirichlet priors.” The Annals of Statistics, 28(5): 1390–1413.
  • Clyde, M. and Lee, H. (2001). “Bagging and the Bayesian bootstrap.” In Richardson, T. and Jaakkola, T. (eds.), Artificial Intelligence and Statistics, 169–174.
  • Datta, J. and Ghosh, J. K. (2014). “Bootstrap—an exploration.” Statistical Methodology, 20: 63–72.
  • Dong, Q., Elliott, M. R., and Raghunathan, T. E. (2014). “A nonparametric method to generate synthetic populations to adjust for complex sampling design features.” Survey Methodology, 40(1): 29–46.
  • Efron, B. (1979). “Bootstrap methods: another look at the jackknife.” The Annals of Statistics, 7(1): 1–26.
  • Fushiki, T. (2010). “Bayesian bootstrap prediction.” Journal of Statistical Planning and Inference, 140(1): 65–74.
  • Gasparini, M. (1995). “Exact multivariate Bayesian bootstrap distributions of moments.” The Annals of Statistics, 23(3): 762–768.
  • Graham, D. J., McCoy, E. J., and Stephens, D. A. (2016). “Approximate Bayesian inference for doubly robust estimation.” Bayesian Analysis, 11(1): 47–69.
  • Gu, J., Ghosal, S., and Roy, A. (2008). “Bayesian bootstrap estimation of ROC curve.” Statistics in Medicine, 27(26): 5407–5420.
  • Hahn, J. (1997). “Bayesian bootstrap of the quantile regression estimator: a large sample study.” International Economic Review, 38(4): 795–808.
  • Heckelei, T. and Mittelhammer, R. C. (2003). “Bayesian bootstrap multivariate regression.” Journal of Econometrics, 112(2): 241–264.
  • Ishwaran, H., James, L. F., and Zarepour, M. (2009). “An alternative to the m out of n bootstrap.” Journal of statistical planning and inference, 139(3): 788–801.
  • Ishwaran, H. and Zarepour, M. (2002). “Exact and approximate sum representations for the Dirichlet process.” Canadian Journal of Statistics, 30(2): 269–283.
  • Jacqmin-Gadda, H., Sibillot, S., Proust, C., Molina, J.-M., and Thiébaut, R. (2007). “Robustness of the linear mixed model to misspecified error distribution.” Computational Statistics & Data Analysis, 51(10): 5142–5154.
  • James, L. F. (1997). “A study of a class of weighted bootstraps for censored data.” The Annals of Statistics, 25(4): 1595–1621.
  • James, L. F. (2008). “Large sample asymptotics for the two-parameter Poisson–Dirichlet process.” In Pushing the limits of contemporary statistics: contributions in honor of Jayanta K. Ghosh, 187–199. Institute of Mathematical Statistics.
  • Kim, Y. and Lee, J. (2003). “Bayesian bootstrap for proportional hazards models.” The Annals of Statistics, 31(6): 1905–1922.
  • Kingman, J. F. (1975). “Random discrete distributions.” Journal of the Royal Statistical Society. Series B (Methodological), 1–22.
  • Kleiner, A., Talwalkar, A., Sarkar, P., and Jordan, M. I. (2014). “A scalable bootstrap for massive data.” Journal of the Royal Statistical Society. Series B. Statistical Methodology, 76(4): 795–816.
  • Lee, H. K. H. and Clyde, M. A. (2004). “Lossless online Bayesian bagging.” Journal of Machine Learning Research, 5: 143–151.
  • Li, C., Srivastava, S., and Dunson, D. B. (2017). “Simple, scalable and accurate posterior interval estimation.” Biometrika, 104(3): 665–680.
  • Liang, K.-Y. and Zeger, S. L. (1986). “Longitudinal data analysis using generalized linear models.” Biometrika, 73(1): 13–22.
  • Lo, A. Y. (1983). “Weak convergence for Dirichlet processes.” Sankhyā: The Indian Journal of Statistics, Series A, 105–111.
  • Lo, A. Y. (1987). “A large sample study of the Bayesian bootstrap.” The Annals of Statistics, 15(1): 360–375.
  • Lo, A. Y. (1988). “A Bayesian bootstrap for a finite population.” The Annals of Statistics, 16(4): 1684–1695.
  • Lo, A. Y. (1991). “Bayesian bootstrap clones and a biometry function.” Sankhyā Ser. A, 53(3): 320–333.
  • Lo, A. Y. (1993). “A Bayesian bootstrap for censored data.” The Annals of Statistics, 21(1): 100–123.
  • Lyddon, S., Holmes, C., and Walker, S. (2019). “Generalized Bayesian updating and the loss-likelihood bootstrap.” Biometrika (in press).
  • Meeden, G. (1993). “Noninformative nonparametric Bayesian estimation of quantiles.” Statistics & Probability Letters, 16(2): 103–109.
  • Minsker, S., Srivastava, S., Lin, L., and Dunson, D. B. (2017). “Robust and scalable Bayes via a median of subset posterior measures.” The Journal of Machine Learning Research, 18(1): 4488–4527.
  • Muliere, P. and Secchi, P. (1996). “Bayesian nonparametric predictive inference and bootstrap techniques.” Annals of the Institute of Statistical Mathematics, 48(4): 663–673.
  • Neiswanger, W., Wang, C., and Xing, E. P. (2014). “Asymptotically exact, embarrassingly parallel MCMC.” In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, UAI’14, 623–632. AUAI Press.
  • Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D., and R Core Team (2018). nlme: Linear and nonlinear mixed effects models. R package version 3.1-137. URL https://CRAN.R-project.org/package=nlme
  • Pitman, J. (1995). “Exchangeable and partially exchangeable random partitions.” Probability Theory and Related Fields, 102: 145–158.
  • Pitman, J. (1996). “Some developments of the Blackwell-MacQueen urn scheme.” Lecture Notes-Monograph Series, 245–267.
  • Pustejovsky, J. (2018). clubSandwich: Cluster-robust (Sandwich) variance estimators with small-sample corrections. R package version 0.3.2. URL https://CRAN.R-project.org/package=clubSandwich
  • R Core Team (2015). “R: A language and environment for statistical computing.” URL https://www.R-project.org/
  • Rubin, D. B. (1981). “The Bayesian bootstrap.” The Annals of Statistics, 9(1): 130–134.
  • Rubin, D. B. and Schenker, N. (1986). “Multiple imputation for interval estimation from simple random samples with ignorable nonresponse.” Journal of the American Statistical Association, 81(394): 366–374.
  • Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H. A., George, E. I., and McCulloch, R. E. (2016). “Bayes and big data: The consensus Monte Carlo algorithm.” International Journal of Management Science and Engineering Management, 11(2): 78–88.
  • Sengupta, S., Volgushev, S., and Shao, X. (2016). “A subsampled double Bootstrap for massive data.” Journal of the American Statistical Association, 111(515): 1222–1232.
  • Sethuraman, J. (1994). “A constructive definition of Dirichlet priors.” Statistica Sinica, 2: 639–650.
  • Siddique, J. and Belin, T. R. (2008). “Using an approximate Bayesian bootstrap to multiply impute nonignorable missing data.” Computational Statistics & Data Analysis, 53(2): 405–415.
  • Srivastava, S., Cevher, V., Tran-Dinh, Q., and Dunson, D. B. (2015). “WASP: Scalable Bayes via barycenters of subset posteriors.” In Artificial Intelligence and Statistics.
  • Srivastava, S., Li, C., and Dunson, D. B. (2018). “Scalable Bayes via barycenter in Wasserstein space.” The Journal of Machine Learning Research, 19(1): 312–346.
  • Taddy, M., Chen, C.-S., Yu, J., and Wyle, M. (2015). “Bayesian and empirical Bayesian forests.” In Blei, D. and Bach, F. (eds.), Proceedings of the 32nd International Conference on Machine Learning (ICML-15), 967–976.
  • Taddy, M., Gardner, M., Chen, L., and Draper, D. (2016). “A nonparametric Bayesian analysis of heterogeneous treatment effects in digital experimentation.” Journal of Business & Economic Statistics, 34(4): 661–672.
  • van der Vaart, A. W. and Wellner, J. A. (1996). Weak convergence and empirical processes. Springer Series in Statistics. Springer-Verlag, New York. With applications to statistics.
  • Varron, D. (2014). “Donsker and Glivenko-Cantelli theorems for a class of processes generalizing the empirical process.” Electronic Journal of Statistics, 8(2): 2296–2320.
  • Wang, X. and Dunson, D. B. (2013). “Parallelizing MCMC via Weierstrass sampler.” arXiv preprint arXiv:1312.4605.
  • Wang, X., Guo, F., Heller, K. A., and Dunson, D. B. (2015). “Parallelizing MCMC with random partition trees.” In Advances in Neural Information Processing Systems, 451–459.
  • Welsh, A. and Richardson, A. (1997). “13 Approaches to the robust estimation of mixed models.” Handbook of Statistics, 15: 343–384.
  • Weng, C.-S. (1989). “On a second-order asymptotic property of the Bayesian bootstrap mean.” The Annals of Statistics, 17(2): 705–710.
  • Zhou, H., Elliott, M. R., and Raghunathan, T. E. (2016). “Multiple imputation in two-stage cluster samples using the weighted finite population Bayesian bootstrap.” Journal of Survey Statistics and Methodology, 4(2): 139–170.

Supplemental materials

  • Supplementary material: Bayesian Bootstraps for Massive Data. The supplementary material has 6 sections: the first provides theoretical results for the processes proposed in Sections 2.1, 2.2, and 2.3; the second has a figure which details the Monte Carlo algorithm for performing lossless inference for the class of functionals described in Section 2.3; the third contains a scheme for lossless simulation for the example in Section 2.3 in Chamberlain and Imbens (2003); the fourth part explains how to perform lossless inference for the Dirichlet-Multinomial process; the fifth includes a table with relative and absolute errors related to the linear regression coefficients estimated from the OPM-2011 dataset in Section 4.1; Finally, the sixth assesses the performance of the BLBB, SDBB, ANS, and AN approximating coefficients of a quantile regression fitted to the OPM-2011 dataset.