The Annals of Applied Statistics

Bootstrapping data arrays of arbitrary order

Art B. Owen and Dean Eckles

Full-text: Open access

Abstract

In this paper we study a bootstrap strategy for estimating the variance of a mean taken over large multifactor crossed random effects data sets. We apply bootstrap reweighting independently to the levels of each factor, giving each observation the product of independently sampled factor weights. No exact bootstrap exists for this problem [McCullagh (2000) Bernoulli 6 285–301]. We show that the proposed bootstrap is mildly conservative, meaning biased toward overestimating the variance, under sufficient conditions that allow very unbalanced and heteroscedastic inputs. Earlier results for a resampling bootstrap only apply to two factors and use multinomial weights that are poorly suited to online computation. The proposed reweighting approach can be implemented in parallel and online settings. The results for this method apply to any number of factors. The method is illustrated using a $3$ factor data set of comment lengths from Facebook.

Article information

Source
Ann. Appl. Stat., Volume 6, Number 3 (2012), 895-927.

Dates
First available in Project Euclid: 31 August 2012

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1346418567

Digital Object Identifier
doi:10.1214/12-AOAS547

Mathematical Reviews number (MathSciNet)
MR3012514

Zentralblatt MATH identifier
06096515

Keywords
Bayesian pigeonhole bootstrap online bagging online bootstrap relational data tensor data unbalanced random effects

Citation

Owen, Art B.; Eckles, Dean. Bootstrapping data arrays of arbitrary order. Ann. Appl. Stat. 6 (2012), no. 3, 895--927. doi:10.1214/12-AOAS547. https://projecteuclid.org/euclid.aoas/1346418567


Export citation

References

  • Bennett, J. and Lanning, S. (2007). The Netflix prize. In Proceedings of KDD Cup and Workshop 2007 35. ACM, New York.
  • Brennan, R. L., Harris, D. J. and Hanson, B. A. (1987). The bootstrap and other procedures for examining the variability of estimated variance components. Technical report, ACT.
  • Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Ann. Statist. 7 1–26.
  • Hall, P. (1992). The Bootstrap and Edgeworth Expansion. Springer, New York.
  • Lee, H. K. H. and Clyde, M. A. (2004). Lossless online Bayesian bagging. J. Mach. Learn. Res. 5 143–151.
  • Mammen, E. (1992). When Does Bootstrap Work. Lecture Notes in Statistics 77. Springer, New York.
  • Mammen, E. (1993). Bootstrap and wild bootstrap for high-dimensional linear models. Ann. Statist. 21 255–285.
  • McCarthy, P. J. (1969). Pseudo-replication: Half samples. Review of the International Statistical Institute 37 239–264.
  • McCullagh, P. (2000). Resampling and exchangeable arrays. Bernoulli 6 285–301.
  • Newton, M. A. and Raftery, A. E. (1994). Approximate Bayesian inference with the weighted likelihood bootstrap. J. Roy. Statist. Soc. Ser. B 56 3–48.
  • Owen, A. B. (2007). The pigeonhole bootstrap. Ann. Appl. Stat. 1 386–411.
  • Oza, N. and Russell, S. (2001). Online bagging and boosting. In Artificial Intelligence and Statistics 2001 105–112. Morgan Kaufmann, San Mateo, CA.
  • Rubin, D. B. (1981). The Bayesian bootstrap. Ann. Statist. 9 130–134.
  • Searle, S. R., Casella, G. and McCulloch, C. E. (1992). Variance Components. Wiley, New York.
  • Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P. and Murthy, R. (2009). Hive: A warehousing solution over a map-reduce framework. In Proceedings of the VLDB Endowment, Vol. 2 1626–1629. VLDB Endowment.
  • Wiley, E. W. (2001). Bootstrap strategies for variance component estimation: Theoretical and empirical results. Ph.D. thesis, Stanford Univ.