Determining Bounds on Integrals with Applications to Cataloging Problems

Bernard Harris

doi:10.1214/aoms/1177706266

June, 1959 Determining Bounds on Integrals with Applications to Cataloging Problems

Bernard Harris

Ann. Math. Statist. 30(2): 521-548 (June, 1959). DOI: 10.1214/aoms/1177706266

Abstract

Assume that a random sample of size $N$ has been drawn from a multinomial population with an unknown and perhaps countably infinite number of classes. Hence, if $X_j$ is the $j$th observation, and $M_i$ the $i$th class, then $$P\{X_j \varepsilon M_i\} = p_i \geqq 0\quad i = 1,2, \cdots; \text{for all} j,$$ and $\sum^\infty_1 p_i = 1$. If the number of classes is finite, then $p_i = 0$, for all $i > S$, where $S$ is the number of classes. We do not suppose the classes to have a natural ordering, since the classes may be species of insects, or chess openings. Let $n_r$ be the number of classes which occur exactly $r$ times in the sample. Then $$\sum^infty_{r = 0} rn_r = N \quad\text{and}\quad d = \sum^N_{r = 1} n_r$$ where $d$ is the number of distinct classes observed in the sample. It is the purpose of this paper to present some techniques to aid the experimenter in answering the following kinds of questions. 1) Prediction of the number of distinct classes that will be observed in a second sample of size $\alpha N, \alpha \geqq 1$. 2) Prediction of the number of additional classes that will be observed when the sample size is increased by $(\alpha - 1)N$ additional observations. 3) Estimation of the coverage of the sample, where coverage, denoted by $C$, is defined as follows: \begin{equation*}\tag{1}C = \sum_j p_j.\end{equation*} The sum is to be taken over those classes for which at least one representative has been observed. 4) Prediction of the coverage of a second sample of size $\alpha N$. 5) Prediction of the increased coverage to be obtained when the sample size is augmented by $(\alpha - 1)N$ additional observations. Let $d(\alpha)$ and $C(\alpha)$ denote the number of classes and coverage obtained from $\alpha N$ observations, either in the case of a second sample, or an augmented sample. Subsequently, we will show that \begin{equation*}\tag{2} Ed(\alpha) \backsim d + n_1 E\Big\{\frac{1 - e^{-(\alpha-1)x}}{x}\Big\}\end{equation*} and \begin{equation*}\tag{3}EC(\alpha) \backsim 1 - \frac{n_1}{N} + \frac{n_1}{N} E\{1 - e^{-(\alpha-1)x}\}\end{equation*} where the expectations on the right hand side of (2) and (3) are taken with respect to $F(x)$, a constructed cumulative distribution function on $\lbrack 0, \infty)$, which is unknown to the experimenter, but estimates of the moments of this distribution are available. It will be shown that \begin{equation*}\tag{4}\mu_r \backsim \frac{(r + 1)!(n_{r+1})}{n_1}\end{equation*} where $\mu_r = \int^\infty_0 x^r dF(x)$. Hence, a reasonable procedure is to compute upper (lower) predictors for $d(\alpha)$ and $C(\alpha)$ by computing the supremum (infimum) of (2) and (3) respectively, where the supremum (infimum) is taken over all cumulative distribution functions whose first $k$ moments are specified by (4). It will be shown that $$\varphi(x) = \frac{1 - e^{-(\alpha-1)x}}{x} \quad\text{and}\quad\psi(x) = 1 - e^{-(\alpha-1)x}$$ may be treated identically in the computation of extrema with the observation that the computation of the supremum of $E\{\varphi(x)\}$ will be identical with the computation of the infimum of $E\{\psi(x)\}$; and similarly for the computation of the infimum of $E\{\varphi(x)\}$. Thus, we will restrict the discussion to $\varphi(x)$. The solution will be of the form \begin{equation*}\tag{5}\sup(\inf) E\{\varphi(x)\} = \sum^r_{i = 1} \lambda_i\varphi(x_i)\end{equation*} where $\lambda_i \geqq 0, \sum^r_{i = 1} \lambda_i = 1$ and $x_i$ belongs to the extended non-negative real numbers. To determine the $x_i$ and $\lambda_i$, the following system of equations must be solved. \begin{equation*}\tag{6}\begin{split}\lambda_1x_1 + \lambda_2x_2 + \cdots + \lambda_rx_r = \mu_1\\ \lambda_1x^2_1 + \lambda_2x^2_2 + \cdots + \lambda_rx^2_r = \mu_2\\ ...........................\\ \lambda_1x^k_1 + \lambda_2x^k_2 + \cdots + \lambda_rx^k_r = \mu_k\end{split}\end{equation*} where $x_1 \leqq x_2 \leqq \cdots \leqq x_r$. If $k = 2c, c$ an integer, the supremum will be obtained by solving (6), with $x_1 = 0, r \leqq (k + 2)/2$, and $x_2, \cdots, x_r$ interior points of $\lbrack 0, \infty)$. If $k = 2c + 1$, the solution of (6) will coincide with that for $k - 1$ with the addition of a mass point at infinity, with mass tending to zero at a rate which will satisfy the $k$th moment constraint. Since $\varphi(\infty) = 0$, no change in (5) is obtained by the addition of the $k$th moment constraint. If $k = 2c + 1$, the infimum is obtained by solving (6) with $r \leqq (k + 1)/2$ and the $x_i$ are all interior points of $\lbrack 0, \infty)$. If $k = 2c, c > 0$, the solution of (6) is obtained from that for $k - 1$ by the addition of a mass point at infinity with mass tending to zero at a rate which will satisfy the last moment constraint. Explicit solutions are computed for $k \leqq 3$, and applied to several examples. In addition, the low order moments of $d, n_r$, and $C$ are computed, and the asymptotic sampling error of $\hat C(1) = 1 - (n_1/N)$ as an estimate of sample coverage is given. It will be convenient before proceeding to the general problem to compute the low order moments of $d, n_r, C$.

Citation

Download Citation

Bernard Harris. "Determining Bounds on Integrals with Applications to Cataloging Problems." Ann. Math. Statist. 30 (2) 521 - 548, June, 1959. https://doi.org/10.1214/aoms/1177706266