Open Access
2021 Surprise sampling: Improving and extending the local case-control sampling
Xinwei Shen, Kani Chen, Wen Yu
Author Affiliations +
Electron. J. Statist. 15(1): 2454-2482 (2021). DOI: 10.1214/21-EJS1844

Abstract

Fithian & Hastie [7] proposed a sampling scheme called local case-control (LCC) sampling that achieves stability and efficiency by utilizing a clever adjustment pertained to the logistic model. It is particularly useful for classification with large and imbalanced data. This paper proposes a more general sampling scheme based on a working principle that data points deserve higher sampling probability if they contain more information or appear “surprising” in the sense of, for example, a large error of pilot prediction or a large absolute score. Compared with the relevant existing sampling schemes, as reported in [7] and [1], the proposed one has several advantages. It adaptively gives out the optimal forms to a variety of objectives, including the LCC and [1] as special cases. Under same model specifications, the proposed estimator also performs no worse than those in the literature. The estimation procedure is valid even if the model is misspecified and/or the pilot estimator is inconsistent or dependent on full data. We present theoretical justifications of the claimed advantages and optimality of the estimation and the sampling design. Different from [1], our large sample theory are population-wise rather than data-wise. Moreover, the proposed approach can be applied to unsupervised learning studies, since it essentially only requires a specific loss function and no response-covariate structure of data is needed. Numerical studies are carried out and the evidence in support of the theory is shown.

Funding Statement

Kani Chen was supported by Hong Kong GRF grants 16309816 and 1616212117. Wen Yu was supported by the National Natural Science Foundation of China Grants (12071088).

Acknowledgments

The authors thank Professor Cheng Zhang and Pengfei Ma for providing the micro-blog data, and thank the referee for their constructive comments and suggestions.

Citation

Download Citation

Xinwei Shen. Kani Chen. Wen Yu. "Surprise sampling: Improving and extending the local case-control sampling." Electron. J. Statist. 15 (1) 2454 - 2482, 2021. https://doi.org/10.1214/21-EJS1844

Information

Received: 1 August 2020; Published: 2021
First available in Project Euclid: 3 May 2021

Digital Object Identifier: 10.1214/21-EJS1844

Subjects:
Primary: 62D05
Secondary: 62J12

Keywords: generalized linear models , Horvitz-Thompson estimator , local case-control sampling , model mis-specification , subsampling

Vol.15 • No. 1 • 2021
Back to Top