## The Annals of Statistics

### Ball Divergence: Nonparametric two sample test

#### Abstract

In this paper, we first introduce Ball Divergence, a novel measure of the difference between two probability measures in separable Banach spaces, and show that the Ball Divergence of two probability measures is zero if and only if these two probability measures are identical without any moment assumption. Using Ball Divergence, we present a metric rank test procedure to detect the equality of distribution measures underlying independent samples. It is therefore robust to outliers or heavy-tail data. We show that this multivariate two sample test statistic is consistent with the Ball Divergence, and it converges to a mixture of $\chi^{2}$ distributions under the null hypothesis and a normal distribution under the alternative hypothesis. Importantly, we prove its consistency against a general alternative hypothesis. Moreover, this result does not depend on the ratio of the two imbalanced sample sizes, ensuring that can be applied to imbalanced data. Numerical studies confirm that our test is superior to several existing tests in terms of Type I error and power. We conclude our paper with two applications of our method: one is for virtual screening in drug development process and the other is for genome wide expression analysis in hormone replacement therapy.

#### Article information

Source
Ann. Statist., Volume 46, Number 3 (2018), 1109-1137.

Dates
Revised: February 2017
First available in Project Euclid: 3 May 2018

Permanent link to this document
https://projecteuclid.org/euclid.aos/1525313077

Digital Object Identifier
doi:10.1214/17-AOS1579

Mathematical Reviews number (MathSciNet)
MR3797998

Zentralblatt MATH identifier
1395.62101

Subjects
Primary: 62H15: Hypothesis testing
Secondary: 62G10: Hypothesis testing

#### Citation

Pan, Wenliang; Tian, Yuan; Wang, Xueqin; Zhang, Heping. Ball Divergence: Nonparametric two sample test. Ann. Statist. 46 (2018), no. 3, 1109--1137. doi:10.1214/17-AOS1579. https://projecteuclid.org/euclid.aos/1525313077

#### References

• [1] Andersen, L., Friis, S., Hallas, J., Ravn, P., Schrøder, H. D. and Gaist, D. (2014). Hormone replacement therapy increases the risk of cranial meningioma. Neurology 82 P3.325.
• [2] Bai, Z. and Saranadasa, H. (1996). Effect of high dimension: By an example of a two sample problem. Statist. Sinica 6 311–329.
• [3] Bogachev, V. I. (2007). Measure Theory, Vol. I. Springer, Berlin.
• [4] Chen, L., Dou, W. W. and Qiao, Z. (2013). Ensemble subsampling for imbalanced multivariate two-sample tests. J. Amer. Statist. Assoc. 108 1308–1323.
• [5] Chen, S. X. and Qin, Y.-L. (2010). A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Statist. 38 808–835.
• [6] Chiu, S. N. and Liu, K. I. (2009). Generalized Cramér–von Mises goodness-of-fit tests for multivariate distributions. Comput. Statist. Data Anal. 53 3817–3834.
• [7] Denti, L. (2009). The hormone replacement therapy (HRT) of menopause: Focus on cardiovascular implications. Acta Biomed. Atenei Parmensis 81 73–76.
• [8] Dumeaux, V., Johansen, J., Borresendale, A. L. and Lund, E. (2006). Gene expression profiling of whole-blood samples from women exposed to hormone replacement therapy. Mol. Cancer Ther. 5 868–876.
• [9] Gehan, E. A. (1965). A generalized two-sample Wilcoxon test for doubly censored data. Biometrika 52 650–653.
• [10] Gretton, A., Borgwardt, K. M., Rasch, M., Schölkopf, B. and Smola, A. J. (2006). A kernel method for the two-sample-problem. In Advances in Neural Information Processing Systems 513–520.
• [11] Hou, N., Hong, S., Wang, W., Olopade, O. I., Dignam, J. J. and Huo, D. (2013). Hormone replacement therapy and breast cancer: Heterogeneous risks by race, weight, and breast density. J. Natl. Cancer Inst. 105 1365–1372.
• [12] Jackson, S. and Mauldin, R. D. (1999). On the $\sigma$-class generated by open balls. Math. Proc. Cambridge Philos. Soc. 127 99–108.
• [13] Justel, A., Peña, D. and Zamar, R. (1997). A multivariate Kolmogorov–Smirnov test of goodness of fit. Statist. Probab. Lett. 35 251–259.
• [14] Kosorok, M. R. and Ma, S. (2007). Marginal asymptotics for the “large $p$, small $n$” paradigm: With applications to microarray data. Ann. Statist. 35 1456–1486.
• [15] Lee, A. J. (1990). $U$-Statistics: Theory and Practice. Statistics: Textbooks and Monographs 110. Dekker, Inc., New York.
• [16] Neuhaus, G. (1977). Functional limit theorems for $U$-statistics in the degenerate case. J. Multivariate Anal. 7 424–439.
• [17] Preiss, D. and Tišer, J. (1991). Measures in Banach spaces are determined by their values on balls. Mathematika 38 391–397.
• [18] Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P. J., Lander, E. S., Mitzenmacher, M. and Sabeti, P. C. (2011). Detecting novel associations in large data sets. Science 334 1518–1524.
• [19] Schierz, A. C. (2009). Virtual screening of bioassay data. J. Cheminform. 1 21.
• [20] Schoenberg, I. J. (1937). On certain metric spaces arising from Euclidean spaces by a change of metric and their imbedding in Hilbert space. Ann. of Math. (2) 38 787–793.
• [21] Schoenberg, I. J. (1938). Metric spaces and positive definite functions. Trans. Amer. Math. Soc. 44 522–536.
• [22] Sejdinovic, D., Sriperumbudur, B., Gretton, A. and Fukumizu, K. (2013). Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Ann. Statist. 41 2263–2291.
• [23] Székely, G. J. and Rizzo, M. L. (2004). Testing for equal distributions in high dimension. InterStat 5.
• [24] Van Der Laan, M. J. and Bryan, J. (2001). Gene expression analysis with the parametric bootstrap. Biostatistics 2 445–461.
• [25] Zhang, Q., Pan, W. and Wang, X. (2017). Distribution free multiple change point detection in multivariate time series. Preprint.