Large-scale multiple testing problems require the simultaneous assessment of many p-values. This paper compares several methods to assess the evidence in multiple binomial counts of p-values: the maximum of the binomial counts after standardization (the “higher-criticism statistic”), the maximum of the binomial counts after a log-likelihood ratio transformation (the “Berk–Jones statistic”), and a newly introduced average of the binomial counts after a likelihood ratio transformation. Simulations show that the higher criticism statistic has a superior performance to the Berk–Jones statistic in the case of very sparse alternatives (sparsity coefficient $\beta \gtrapprox 0.75$), while the situation is reversed for $\beta \lessapprox 0.75$. The average likelihood ratio is found to combine the favorable performance of higher criticism in the very sparse case with that of the Berk–Jones statistic in the less sparse case and thus appears to dominate both statistics. Some asymptotic optimality theory is considered but found to set in too slowly to illuminate the above findings, at least for sample sizes up to one million. In contrast, asymptotic approximations to the critical values of the Berk–Jones statistic that have been developed by [In High Dimensional Probability III (2003) 321–332 Birkhäuser] and [ Ann. Statist. 35 (2007) 2018–2053] are found to give surprisingly accurate approximations even for quite small sample sizes.
Digital Object Identifier: 10.1214/12-IMSCOLL923