The Annals of Statistics

Estimating a probability mass function with unknown labels

Dragi Anevski, Richard D. Gill, and Stefan Zohren

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

In the context of a species sampling problem, we discuss a nonparametric maximum likelihood estimator for the underlying probability mass function. The estimator is known in the computer science literature as the high profile estimator. We prove strong consistency and derive the rates of convergence, for an extended model version of the estimator. We also study a sieved estimator for which similar consistency results are derived. Numerical computation of the sieved estimator is of great interest for practical problems, such as forensic DNA analysis, and we present a computational algorithm based on the stochastic approximation of the expectation maximisation algorithm. As an interesting byproduct of the numerical analyses, we introduce an algorithm for bounded isotonic regression for which we also prove convergence.

Article information

Source
Ann. Statist., Volume 45, Number 6 (2017), 2708-2735.

Dates
Received: May 2016
First available in Project Euclid: 15 December 2017

Permanent link to this document
https://projecteuclid.org/euclid.aos/1513328588

Digital Object Identifier
doi:10.1214/17-AOS1542

Mathematical Reviews number (MathSciNet)
MR3737907

Zentralblatt MATH identifier
06838148

Subjects
Primary: 62G05: Estimation 62G20: Asymptotic properties 65C60: Computational problems in statistics 62P10: Applications to biology and medical sciences

Keywords
NPMLE high profile probability mass function strong consistency sieve ordered monotone rearrangement nonparametric SA-EM rates

Citation

Anevski, Dragi; Gill, Richard D.; Zohren, Stefan. Estimating a probability mass function with unknown labels. Ann. Statist. 45 (2017), no. 6, 2708--2735. doi:10.1214/17-AOS1542. https://projecteuclid.org/euclid.aos/1513328588


Export citation

References

  • [1] Acharya, J., Orlitsky, A. and Pan, S. (2009). The maximum likelihood probability of unique-singleton, ternary, and length-7 patterns. In IIEEE International Symposium on Information Theory 1135–1139.
  • [2] Anevski, D. and Fougères, A.-L. (2007). Limit properties of the monotone rearrangement for density and regression function estimation. Lund University. Available at arXiv:0710.4617v1.
  • [3] Anevski, D., Gill, R. D. and Zohren, S. (2017). Supplement to “Estimating a probability mass function with unknown labels.” DOI:10.1214/17-AOS1542SUPP.
  • [4] Balabdaoui, F., Rufibach, K. and Santambrogio, F. (2010). Least-squares estimation of two-ordered monotone regression curves. J. Nonparametr. Stat. 22 1019–1037.
  • [5] Dvoretzky, A., Kiefer, J. and Wolfowitz, J. (1956). Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Ann. Math. Stat. 27 642–669.
  • [6] Efron, B. and Thisted, R. (1976). Estimating the number of unseen species: How many words did Shakespeare know? Biometrika 63 435–447.
  • [7] Esty, W. W. (1982). Confidence intervals for the coverage of low coverage samples. Ann. Statist. 10 190–196.
  • [8] Esty, W. W. (1983). A normal limit law for a nonparametric estimator of the coverage of a random sample. Ann. Statist. 11 905–912.
  • [9] Fisher, R. A., Corbet, A. S. and Williams, C. B. (1943). The relation between the number of species and the number of individuals in a random sample of an animal population. J. Anim. Ecol. 12.
  • [10] Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika 40 237–264.
  • [11] Good, I. J. and Toulmin, G. H. (1956). The population frequencies of species and the estimation of population parameters. Biometrika 43 45–63.
  • [12] Hardy, G. H., Littlewood, J. E. and Pólya, G. (1952). Inequalities, 2nd ed. Cambridge University Press, Cambridge.
  • [13] Jankowski, H. K. and Wellner, J. A. (2009). Estimation of a discrete monotone distribution. Electron. J. Stat. 3 1567–1605.
  • [14] Lieb, E. H. and Loss, M. (1997). Analysis. Graduate Studies in Mathematics 14. Amer. Math. Soc., Providence, RI.
  • [15] Mao, C. X. and Lindsay, B. G. (2002). A Poisson model for the coverage problem with a genomic application. Biometrika 89 669–681.
  • [16] Massart, P. (1990). The tight constant in the Dvoretzky–Kiefer–Wolfowitz inequality. Ann. Probab. 18 1269–1283.
  • [17] Orlitsky, A. and Pan, S. (2009). The maximum likelihood probability of skewed patterns. In IEEE International Symposium on Information Theory.
  • [18] Orlitsky, A., Sajama, S., Santhanam, N. P., Viswanathan, K. and Zhang, J. (2004). Algorithms for modeling distributions over large alphabets. In Information Theory, 2004. ISIT 2004. Proceedings. International Symposium on Information Theory 304.
  • [19] Orlitsky, A., Sajama, S., Santhanam, N. P., Viswanathan, K. and Zhang, J. (2004). On modeling profiles instead of values. In Proceeding UAI’04 Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence 426–435.
  • [20] Orlitsky, A., Sajama, S., Santhanam, N. P., Viswanathan, K. and Zhang, J. (2005). Convergence of profile based estimators. In Information Theory, 2005. ISIT 2005. Proceedings. International Symposium on Information Theory 1843–1847.
  • [21] Orlitsky, A., Santhanam, N. P., Viswanathan, K. and Zhang, J. (2004). On modeling profiles instead of values. In Proceedings of the Twentieth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-04) 426–435. AUAI Press, Arlington, VA.
  • [22] Ramanujan, S. and Hardy, G. H. (1918). Asymptotic formulae in combinatorial analysis. Proc. Lond. Math. Soc. 17 75–115.
  • [23] Robertson, T., Wright, F. T. and Dykstra, R. L. (1988). Order Restricted Statistical Inference. Wiley, New York.
  • [24] van Eeden, C. (1957). Maximum likelihood estimation of partially or completely ordered parameters. ii. Indag. Math. 60 201–211.
  • [25] Vontobel, P. O. (2012). The Bethe permanent of a non-negative matrix. IEEE Trans. Inform. Theory 59 1866–1901.
  • [26] Vontobel, P. O. (2014). The Bethe and Sinkhorn approximations of the pattern maximum likelihood estimate and their connections to the Valiant–Valiant estimate. In Proceedings of Information Theory and Applications Workshop (ITA), 914 Feb.
  • [27] Zhang, C.-H. and Zhang, Z. (2009). Asymptotic normality of a nonparametric estimator of sample coverage. Ann. Statist. 37 2582–2595.

Supplemental materials

  • Supplement to “Estimating a probability mass function with unknown labels”. Supplement consisted of Supplement A: Existence of the PML; Supplement B: Computation of the PML, and Supplement C: An algorithm for estimating a decreasing multinomial probability with lower bound.