Electronic Journal of Statistics
- Electron. J. Statist.
- Volume 13, Number 2 (2019), 3775-3804.
A Good-Turing estimator for feature allocation models
Fadhel Ayed, Marco Battiston, Federico Camerlenghi, and Stefano Favaro
Abstract
Feature allocation models generalize classical species sampling models by allowing every observation to belong to more than one species, now called features. Under the popular Bernoulli product model for feature allocation, we assume $n$ observable samples and we consider the problem of estimating the expected number $M_{n}$ of hitherto unseen features that would be observed if one additional individual was sampled. The interest in estimating $M_{n}$ is motivated by numerous applied problems where the sampling procedure is expensive, in terms of time and/or financial resources allocated, and further samples can be only motivated by the possibility of recording new unobserved features. We consider a nonparametric estimator $\hat{M}_{n}$ of $M_{n}$ which has the same analytic form of the popular Good-Turing estimator of the missing mass in the context of species sampling models. We show that $\hat{M}_{n}$ admits a natural interpretation both as a jackknife estimator and as a nonparametric empirical Bayes estimator. Furthermore, we give provable guarantees for the performance of $\hat{M}_{n}$ in terms of minimax rate optimality, and we provide with an interesting connection between $\hat{M}_{n}$ and the Good-Turing estimator for species sampling. Finally, we derive non-asymptotic confidence intervals for $\hat{M}_{n}$, which are easily computable and do not rely on any asymptotic approximation. Our approach is illustrated with synthetic data and SNP data from the ENCODE sequencing genome project.
Article information
Source
Electron. J. Statist., Volume 13, Number 2 (2019), 3775-3804.
Dates
Received: December 2018
First available in Project Euclid: 1 October 2019
Permanent link to this document
https://projecteuclid.org/euclid.ejs/1569895287
Digital Object Identifier
doi:10.1214/19-EJS1614
Subjects
Primary: 62G05: Estimation 62C20: Minimax procedures
Keywords
Bernoulli product model feature allocation model Good-Turing estimator minimax rate optimality missing mass non-asymptotic uncertainty quantification nonparametric empirical Bayes SNP data
Rights
Creative Commons Attribution 4.0 International License.
Citation
Ayed, Fadhel; Battiston, Marco; Camerlenghi, Federico; Favaro, Stefano. A Good-Turing estimator for feature allocation models. Electron. J. Statist. 13 (2019), no. 2, 3775--3804. doi:10.1214/19-EJS1614. https://projecteuclid.org/euclid.ejs/1569895287