## Electronic Journal of Statistics

### A Good-Turing estimator for feature allocation models

#### Abstract

Feature allocation models generalize classical species sampling models by allowing every observation to belong to more than one species, now called features. Under the popular Bernoulli product model for feature allocation, we assume $n$ observable samples and we consider the problem of estimating the expected number $M_{n}$ of hitherto unseen features that would be observed if one additional individual was sampled. The interest in estimating $M_{n}$ is motivated by numerous applied problems where the sampling procedure is expensive, in terms of time and/or financial resources allocated, and further samples can be only motivated by the possibility of recording new unobserved features. We consider a nonparametric estimator $\hat{M}_{n}$ of $M_{n}$ which has the same analytic form of the popular Good-Turing estimator of the missing mass in the context of species sampling models. We show that $\hat{M}_{n}$ admits a natural interpretation both as a jackknife estimator and as a nonparametric empirical Bayes estimator. Furthermore, we give provable guarantees for the performance of $\hat{M}_{n}$ in terms of minimax rate optimality, and we provide with an interesting connection between $\hat{M}_{n}$ and the Good-Turing estimator for species sampling. Finally, we derive non-asymptotic confidence intervals for $\hat{M}_{n}$, which are easily computable and do not rely on any asymptotic approximation. Our approach is illustrated with synthetic data and SNP data from the ENCODE sequencing genome project.

#### Article information

Source
Electron. J. Statist., Volume 13, Number 2 (2019), 3775-3804.

Dates
First available in Project Euclid: 1 October 2019

https://projecteuclid.org/euclid.ejs/1569895287

Digital Object Identifier
doi:10.1214/19-EJS1614

Subjects
Primary: 62G05: Estimation 62C20: Minimax procedures

#### Citation

Ayed, Fadhel; Battiston, Marco; Camerlenghi, Federico; Favaro, Stefano. A Good-Turing estimator for feature allocation models. Electron. J. Statist. 13 (2019), no. 2, 3775--3804. doi:10.1214/19-EJS1614. https://projecteuclid.org/euclid.ejs/1569895287

#### References

• [1] Auton, A. et al. (2015). A global reference for human genetic variation., Nature 526, 68–74.
• [2] Ben-Hamou, A., Boucheron, S. and Ohannessian, M.I. (2017). Concentration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications., Bernoulli 23, 249–287.
• [3] Boucheron, S., Lugosi, G. and Massart, P. (2013)., Concentration inequalities. Oxford University Press.
• [4] Chao, A. and Colwell, R.K. (2017). Thirty years of progeny from Chao’s inequality: estimating and comparing richness with incidence data and incomplete sampling., Statistics and Operation Research Transactions, 41, 3–54.
• [5] Chao, A., Gotelli, N.J., Hsieh, T.C., Sander, E.L., Ma, K.H., Colwell, R.K. and Ellison, A.M. (2014). Rarefaction and extrapolation with Hill numbers: a framework for sampling and estimation in species diversity studies., Ecological Monographs 84, 45–67.
• [6] Colwell, R., Chao, A., Gotelli, N.J., Lin, S., Mao, C.X., Chazdon, R.L. and Longino, J.T. (2012). Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages., Journal of Plant Ecology 5, 3–21.
• [7] Daley, D.J. and Vere-Jones, D. (2008)., An introduction to the theory of point processes. Vol. II. Springer, New York.
• [8] Efron, B. (1987)., The jackknife, the bootstrap, and other resampling plans. CBMS-NSF Regional Conference Series in Applied Mathematics, Society for Industrial and Applied Mathematics.
• [9] Efron, B. and Morris, C (1973). Stein’s estimation rule and its competitors – an empirical Bayes approach., Journal of the American Statistical Association 68, 117–130.
• [10] Good, I.J. (1953). On the population frequencies of species and the estimation of population parameters., Biometrika 40, 237–264.
• [11] Görür, D., Jäkel, F. and Rasmussen, C.E. (2006). A choice model with infinitely many latent features., 23rd International Conference on Machine Learning.
• [12] Gnedin, A., Hansen, B. and Pitman, J. (2007). Notes on the occupancy problem with infinitely many boxes: general asymptotics and power laws., Probability Surveys, 4, 146–171.
• [13] Gravel, S (2014). Predicting discovery rates of genomic features., Genetics 197, 601–610.
• [14] Hjort, N. (1990). Nonparametric Bayes estimators based on Beta processes in models for life history data., The Annals of Statistics 18, 1259–1294.
• [15] Ionita-Laza, I., Lange, C. and Laird, N.M. (2009). Estimating the number of unseen variants in the human genome., Proceeding of the National Academy of Sciences 106, 5008–5013.
• [16] James, L.F. (2017). Bayesian Poisson calculus for latent feature modeling via generalized Indian buffet process priors., The Annals of Statistics 45, 2016–2045.
• [17] Karlin, S. (1967). Central limit theorems for certain infinite urn schemes., Journal of Mathematics and Mechanics, 17, 373–401.
• [18] Meeds, E., Ghahramani, Z., Neal, R. and Rowies, S.T. (2007). Modeling dyadic data with binary latent factors., Advances in Neural Information Processing Systems.
• [19] Miller, K.T., Griffiths, T.L. and Jordan, M.I. (2010). Nonparametric latent feature models for link predictions., Advances in Neural Information Processing Systems.
• [20] Navarro, D.J. and Griffiths, T.L. (2010). A nonparametric Bayesian model for inferring features from similarity judgments., Advances in Neural Information Processing Systems.
• [21] Quenouille, M.H. (1956). Notes on bias in estimation., Biometrika 43, 353–360.
• [22] Rajaraman, N., Thangaraj, A. and Suresh, A.T. (2017) Minimax risk for missing mass estimation., Proceedings of the IEEE International Symposium on Information Theory.
• [23] Robbins, H. (1968). Estimating the total probability of the unobserved outcomes of an experiment., The Annals of Mathematical Statistics 39, 256–257.
• [24] Teh, Y.W. and Görür, D. (2009). Indian buffet processes with power–law behavior., Advances in Neural Information Processing Systems.
• [25] Tukey, J.W. (1958). Bias and confidence in not-quite large samples., The Annals of Mathematical Statistics 29, 614.
• [26] Wood, F. and Griffiths, T.L. (2007). Particle filtering for nonparametric Bayesian matrix factorization., Advances in Neural Information Processing Systems.
• [27] Wood, F., Griffiths, T.L. and Ghahramani, Z. (2006). A non-parametric Bayesian method for inferring hidden causes., 22nd Conference in Uncertainty in Artificial Intelligence.
• [28] Zou, J., Valiant, G., Valiant, P., Karczewski, K., Chan, S.O., Samocha, K., Lek, M., Sunyaev, S., Daly, M. and MacArthur, D.G. (2016). Quantifying the unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects., Nature Communications 7.