Distributed statistical inference for massive data

Song Xi Chen; Liuhua Peng

doi:10.1214/21-AOS2062

Abstract

This paper considers distributed statistical inference for general symmetric statistics in the context of massive data with efficient computation. Estimation efficiency and asymptotic distributions of the distributed statistics are provided, which reveal different results between the nondegenerate and degenerate cases, and show the number of the data subsets plays an important role. Two distributed bootstrap methods are proposed and analyzed to approximation the underlying distribution of the distributed statistics with improved computation efficiency over existing methods. The accuracy of the distributional approximation by the bootstrap are studied theoretically. One of the methods, the pseudo-distributed bootstrap, is particularly attractive if the number of datasets is large as it directly resamples the subset-based statistics, assumes less stringent conditions and its performance can be improved by studentization.

Funding Statement

Chen’s research is partially supported by National Natural Science Foundation of China grants 92046021, 12026607, 12071013 and 71973005 and LMEQF at Peking University.

Citation

Download Citation

Song Xi Chen. Liuhua Peng. "Distributed statistical inference for massive data." Ann. Statist. 49 (5) 2851 - 2869, October 2021. https://doi.org/10.1214/21-AOS2062

Information

Received: 1 August 2020; Revised: 1 January 2021; Published: October 2021

First available in Project Euclid: 12 November 2021

MathSciNet: MR4338895

zbMATH: 1486.62123

Digital Object Identifier: 10.1214/21-AOS2062

Subjects:

Primary: 62G09

Secondary: 62G20

Keywords: Distributed bootstrap , distributed statistics , massive data , pseudo-distributed bootstrap

Abstract

Funding Statement

Citation

Information

KEYWORDS/PHRASES

PUBLICATION TITLE:

PUBLICATION YEARS