Statistics Surveys

Measuring multivariate association and beyond

Julie Josse and Susan Holmes

Full-text: Open access

Abstract

Simple correlation coefficients between two variables have been generalized to measure association between two matrices in many ways. Coefficients such as the RV coefficient, the distance covariance (dCov) coefficient and kernel based coefficients are being used by different research communities. Scientists use these coefficients to test whether two random vectors are linked. Once it has been ascertained that there is such association through testing, then a next step, often ignored, is to explore and uncover the association’s underlying patterns.

This article provides a survey of various measures of dependence between random vectors and tests of independence and emphasizes the connections and differences between the various approaches. After providing definitions of the coefficients and associated tests, we present the recent improvements that enhance their statistical properties and ease of interpretation. We summarize multi-table approaches and provide scenarii where the indices can provide useful summaries of heterogeneous multi-block data. We illustrate these different strategies on several examples of real data and suggest directions for future research.

Article information

Source
Statist. Surv., Volume 10 (2016), 132-167.

Dates
Received: December 2015
First available in Project Euclid: 17 November 2016

Permanent link to this document
https://projecteuclid.org/euclid.ssu/1479351622

Digital Object Identifier
doi:10.1214/16-SS116

Mathematical Reviews number (MathSciNet)
MR3573303

Zentralblatt MATH identifier
1359.62212

Keywords
measures of association between matrices RV coefficient dCov coefficient $k$ nearest-neighbor graph HHG test distance matrix tests of independence permutation tests multi-block data analyses

Citation

Josse, Julie; Holmes, Susan. Measuring multivariate association and beyond. Statist. Surv. 10 (2016), 132--167. doi:10.1214/16-SS116. https://projecteuclid.org/euclid.ssu/1479351622


Export citation

References

  • [1] H. Abdi. Congruence: Congruence coefficient, RV coefficient, and Mantel coefficient. In N. J. Salkind, D. M. Dougherty, and B. Frey (Eds.), Encyclopedia of Research Design, pages 222–229. Thousand Oaks (CA): Sage, 2010.
  • [2] E. Acar and B. Yener. Unsupervised multiway data analysis: A literature survey. Knowledge and Data Engineering, IEEE Transactions on, 21(1):6–20, 2009.
  • [3] J. Allaire and Y. Lepage. On a likelihood ratio test for independence. Statistics & Probability Letters, 11(5):449–452, 1991.
  • [4] T. W. Anderson. An Introduction to Multivariate Statistical Analysis, 3rd edition. Wiley, 2003.
  • [5] D. E. Barton and F. N. David. Randomization bases for multivariate tests. I. The bivariate case. Randomness of n points in a plane. Bulletin of the international statistical institute, page i39, 1962.
  • [6] R. Beran, M. Bilodeau, and P. Lafaye de Micheaux. Nonparametric tests of independence between random vectors. Journal of Multivariate Analysis, 98(9):1805–1824, 2007.
  • [7] W. Bergsma and A. Dassios. A consistent test of independence based on a sign covariance related to kendall’s tau. Bernoulli, 20(2):1006–1028, 2014.
  • [8] P. J. Bickel and E. Levina. Regularized estimation of large covariance matrices. The Annals of Statistics, 36(1):199–227, 2008.
  • [9] M. Bilodeau and P. Lafaye de Micheaux. A multivariate empirical characteristic function test of independence with normal marginals. Journal of Multivariate Analysis, 95:345–369, 2005.
  • [10] I. Borg and P. J. F. Groenen. Modern Multidimensional Scaling: Theory and Applications. Springer, 2005.
  • [11] R. S. Cadena, A. G. Cruz, R. R. Netto, W. F. Castro, J.-d.-A. F. Faria, and H. M. A. Bolini. Sensory profile and physicochemical characteristics of mango nectar sweetened with high intensity sweeteners throughout storage time. Food Research International, 2013.
  • [12] F. Cailliez. The analytical solution of the additive constant problem. Psychometrika, 48(2):305–308, 1983.
  • [13] S. Chatterjee. Matrix estimation by universal singular value thresholding. The Annals of Statistics, 43(1):177–214, 2014.
  • [14] R. Cléroux and G. R. Ducharme. Vector correlation for elliptical distribution. Communications in Statistics Theory and Methods, 18:1441–1454, 1989.
  • [15] R. Cléroux, A. Lazraq, and Y. Lepage. Vector correlation based on ranks and a non parametric test of no association between vectors. Communications in Statistics Theory and Methods, 24:713–733, 1995.
  • [16] E. M. Cramer and W. A. Nicewander. Some symmetric, invariant measures of mutivariate association. Psychometrika, 44(1):43–54, 1979.
  • [17] N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. Kandola. On kernel-target alignment. NIPS, 2001.
  • [18] A. Culhane, G. Perrière, and D. Higgins. Cross-platform comparison and visualisation of gene expression data using co-inertia analysis. BMC bioinformatics, 4(1):59, 2003.
  • [19] F. N. David and D. E. Barton. Combinatorial chance. Griffin London, 1962.
  • [20] F. N. David and D. E. Barton. Two space-time interaction tests for epidemicity. British Journal of Preventive & Social Medicine, 20(1):44–48, 1966.
  • [21] M. de Tayrac, S. Le, M. Aubry, J. Mosser, and F. Husson. Simultaneous analysis of distinct omics data sets with integration of biological knowledge: Multiple factor analysis approach. BMC Genomics, 10(1):32–52, 2009.
  • [22] S. Dray. The ade4 package: implementing the duality diagram for ecologists. Journal of Statistical Software, 22 (4):1–20, 2007.
  • [23] S. Dray, D. Chessel, and J. Thioulouse. Procrustean co-inertia analysis for the linking of multivariate datasets. Ecoscience, 10:110–119, 2003.
  • [24] B. Escofier and J. Pagès. Multiple factor analysis (afmult package). Computational Statistics & Data Analysis, 18(1):121–140, 1994.
  • [25] Y. Escoufier. Echantillonnage dans une population de variables aléatoires réelles. Department de math.; Univ. des sciences et techniques du Languedoc, 1970.
  • [26] Y. Escoufier. Le traitement des variables vectorielles. Biometrics, 29:751–760, 1973.
  • [27] Y. Escoufier. Three-mode data analysis: the STATIS method. In Method for multidimensional analysis, pages 153–170. Lecture notes from the European Course in Statistic, 1987.
  • [28] Y. Escoufier. Operator related to a data matrix: a survey. In Compstat 2006-Proceedings in Computational Statistics, pages 285–297. Springer, 2006.
  • [29] C. Foth, P. Bona, and J. B. Desojo. Intraspecific variation in the skull morphology of the black caiman melanosuchus niger (alligatoridae, caimaninae). Acta Zoologica, 2013.
  • [30] J. H. Friedman and L. C. Rafsky. Graph-theoretic measures of multivariate association and prediction. Annals of Statistics, 11 (2):377–391, 1983.
  • [31] C. Fruciano, P. Franchini, and A. Meyer. Resampling-based approaches to study variation in morphological modularity. PLoS ONE, 8:e69376, 2013.
  • [32] M. Génard, M. Souty, S. Holmes, M. Reich, and L. Breuils. Correlations among quality parameters of peach fruit. Journal of the Science of Food and Agriculture, 66(2):241–245, 1994.
  • [33] D. Giacalone, L. M. Ribeiro, and M. B. Frøst. Consumer-based product profiling: Application of partial napping® for sensory characterization of specialty beers by novices and experts. Journal of Food Products Marketing, 19(3):201–218, 2013.
  • [34] S. C. Goslee and D. L. Urban. The ecodist package for dissimilarity-based analysis of ecological data. Journal of Statistical Software, 22:1–19, 2007.
  • [35] J. C. Gower. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53:325–338, 1966.
  • [36] J. C. Gower. Statistical methods of comparing different multivariate analyses of the same data. In F. R. Hodson, D. G. Kendall, and P. Tautu (Eds.), Mathematics in the archaeological and historical sciences, pages 138–149. Edinburgh University Press, 1971.
  • [37] M. J. Greenacre. Correspondence analysis of multivariate categorical data by weighted least-squares. Biometrika, 75:457–477, 1988.
  • [38] M. J. Greenacre. Multiple and joint correspondence analysis. In J. Blasius and M. J. Greenacre (Eds.), Correspondence Analysis in the social science, pages 141–161. London: Academic Press, 1994.
  • [39] M. J. Greenacre and J. Blasius. Multiple Correspondence Analysis and Related Methods. Chapman & Hall/CRC, 2006.
  • [40] A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Schoelkopf. Kernel methods for measuring independence. Journal of Machine Learning Research, 6:2075–2129, 2005.
  • [41] G. Guillot and F. Rousset. Dismantling the Mantel tests. Methods in Ecology and Evolution, 2013.
  • [42] R. Heller, M. Gorfine, and Y. Heller. A class of multivariate distribution-free tests of independence based on graphs. Journal of Statistical Planning and Inference, 142(12):3097–3106, 2012.
  • [43] R. Heller, Y. Heller, and M. Gorfine. A consistent multivariate test of association based on ranks of distances. Biometrika, 100(2):503–510, 2013.
  • [44] S. Holmes. Multivariate data analysis: the French way. Probability and Statistics: Essays in Honor of David A. Freedman. Institute of Mathematical Statistics, Beachwood, Ohio, pages 219–233, 2008.
  • [45] H Hotelling. Relations between two sets of variants. Biometrika, 28:321–377, 1936.
  • [46] F. Husson, J. Josse, S. Le, and J. Mazet. FactoMineR: Multivariate Exploratory Data Analysis and Data Mining with R, 2013. URL http://CRAN.R-project.org/package=FactoMineR. R package version 1.24.
  • [47] D. A. Jackson. Protest: a procustean randomization test of community environment concordance. Ecosciences, 2:297–303, 1995.
  • [48] J. Josse, J. Pagès, and F. Husson. Testing the significance of the RV coefficient. Computational Statistics and Data Analysis, 53:82–91, 2008.
  • [49] S. Kaufman. HHG: Heller-Heller-Gorfine Tests of Independence, 2014. URL http://CRAN.R-project.org/package=HHG. R package version 1.4.
  • [50] F. Kazi-Aoual, S. Hitier, R. Sabatier, and J. D. Lebreton. Refined approximations to permutation tests for multivariate inference. Computational Statistics and Data Analysis, 20:643–656, 1995.
  • [51] C. P. Klingenberg. Morphometric integration and modularity in configurations of landmarks: tools for evaluating a priori hypotheses. Evolution & Development, 11:405–421, 2009.
  • [52] E. G. Knox. The detection of space-time interactions. Journal of the Royal Statistical Society. Series C (Applied Statistics), 13(1):25–30, 1964.
  • [53] I. Kojadinovic and M. Holmes. Tests of independence among continuous random vectors based on cramér-von mises functionals of the empirical copula process. Journal of Multivariate Analysis, 100(6):1137–1154, 2009.
  • [54] P. M. Kroonenberg. Applied Multiway Data Analysis. Wiley series in probability and statistics, 2008.
  • [55] C. Lavit, Y. Escoufier, R. Sabatier, and P. Traissac. The ACT (STATIS method). Computational Statistics & Data Analysis, 18(1):97–119, 1994.
  • [56] A. Lazraq and R. Cleroux. Statistical inference concerning several redundancy indices. Journal of Multivariate Analysis, 79(1):71–88, 2001.
  • [57] A. Lazraq and C. Robert. Etude comparative de diffèrentes mesures de liaison entre deux vecteurs aléatoires et tests d’indépendance. Statistique et analyse de données, 1:15–38, 1988.
  • [58] A. Lazraq, R. Cléroux, and H. A. L. Kiers. Mesures de liaison vectorielle et généralisation de l’analyse canonique. Statistique et analyse de données, 40(1):23–35, 1992.
  • [59] S. Lê, J. Josse, and F. Husson. Factominer: An r package for multivariate analysis. Journal of Statistical Software, 25(1):1–18, 3 2008.
  • [60] O. Ledoit and M. Wolf. Nonlinear shrinkage estimation of large-dimensional covariance matrices. The Annals of Statistics, 40(2):1024–1060, 2012.
  • [61] P. Legendre and M. Fortin. Comparison of the Mantel test and alternative approaches for detecting complex multivariate relationships in the spatial analysis of genetic data. Molecular Ecology Resources, 10:831–844, 2010.
  • [62] J. C. Lingoes and P. H. Schönemann. Comparison of the Mantel test and alternative approaches for detecting complex multivariate relationships in the spatial analysis of genetic data. Psychometrika, 39:423–427, 1974.
  • [63] J. C. Lingoes. Some boundary conditions for a monotone analysis of symmetric matrices. Psychometrika, 36(2):195–203, 1971.
  • [64] D. Lopez-Paz, P. Hennig, and B. Schölkopf. The Randomized Dependence Coefficient. NIPS, 2013.
  • [65] N. Mantel. The detection of disease clustering and a generalized regression approach. Cancer research, 27(2 Part 1):209–220, 1967.
  • [66] K. V. Mardia, J. T. Kent, and J. M. Bibby. Multivariate analysis. Academic press, 1980.
  • [67] MATLAB. MATLAB and Statistics Toolbox Release. The MathWorks, Inc., Natick, Massachusetts, United States, 2012. URL http://www.mathworks.com/products/matlab/.
  • [68] C.-D. Mayer, T. Lorent, and G. W. Horgan. Exploratory analysis of multiples omics datasets using the adjusted RV coefficient. Statistical applications in genetics and molecular biology, 10, 2011.
  • [69] C. Minas, E. Curry, and G. Montana. A distance-based test of association between paired heterogeneous genomic data. Bioinformatics, 29 (22):2555–2563, 2013.
  • [70] R. B. Nelsen. An Introduction to Copulas, 2nd Edition. Springer Science+Buisness, New York, 2006.
  • [71] M. A. Newton. Introducing the discussion paper by szekely and rizzo. The Annals of Applied Statistics, 3(4):1233–1235, 2009.
  • [72] A. C. Noble and S. E. Ebeler. Use of multivariate statistics in understanding wine flavor. Food Reviews International, 18(1):1–20, 2002.
  • [73] J. Oksanen, F. G. Blanchet, R. Kindt, P. Legendre, P. R. Minchin, R. B. O’Hara, G. L. Simpson, P. Solymos, M. H. Stevens, and H. Wagner. vegan: Community Ecology Package, 2013. URL http://CRAN.R-project.org/package=vegan. R package version 2.0-9.
  • [74] A. F. Olshan, A. F. Siegel, and D. R. Swindler. Robust and least-squares orthogonal mapping: Methods for the study of cephalofacial form and growth. American Journal of Physical Anthropology, 59(2):131–137, 1982. ISSN 1096-8644.
  • [75] M. Omelka and S. Hudecová. A comparison of the mantel test with a generalised distance covariance test. Environmetrics, 2013.
  • [76] J. Pagès. Collection and analysis of perceived product inter-distances using multiple factor analysis; application to the study of ten white from the loire valley. Food quality and preference, 16:642–649, 2005.
  • [77] J. Pagès. Multiple Factor Analysis with R. Chapman & Hall/CRC, 2014.
  • [78] J. Pagès and F. Husson. Multiple factor analysis with confidence ellipses: A methodology to study the relationships between sensory and instrumental data. Journal of Chemometrics, 19:138–144, 2005.
  • [79] P. R. Peres-Neto and D. A. Jackson. How well do multivariate data sets match? the advantages of a procrustean superimposition approach over the mantel test. Oecologia, 129:169–178, 2001.
  • [80] E. Purdom. Multivariate kernel methods in the analysis of graphical structures. PhD thesis, University of Stanford, 2006.
  • [81] M. L. Puri and P. K. Sen. Nonparametric Methods in Multivariate Analysis. John Wiley & Sons, New York, 1974.
  • [82] J. F. Quessy. Applications and asymptotic power of marginal-free tests of stochastic vectorial independence. Journal of Statistical Planning and Inference, 140(11):3058–3075, 2010.
  • [83] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2013. URL http://www.R-project.org/.
  • [84] J. O. Ramsay, J. ten Berge, and G. P. H. Styan. Matrix correlation. Psychometrika, 49(3):403–423, 1984.
  • [85] M. Reimherr and D. L. Nicolae. On quantifying dependence: A framework for developing interpretable measures. Statistical Science, 28(1):116–139, 2013.
  • [86] A. Renyi. On measures of dependence. Acta Mathematica Academiae Scientiarum Hungarica, 10(3–4):441–451, 1959.
  • [87] E. Risvik, J. A. McEwan, and M. Rødbotten. Evaluation of sensory profiling and projective mapping data. Food quality and preference, 8(1):63–71, 1997.
  • [88] M. L. Rizzo and G. J. Szekely. energy: E-statistics (energy statistics), 2013. URL http://CRAN.R-project.org/package=energy. R package version 1.6.0.
  • [89] P. Robert and Y. Escoufier. A unifying tool for linear multivariate statistical methods: The RV- coefficient. Journal of the Royal Statistical Society. Series C (Applied Statistics), 3:257–265, 1976.
  • [90] P. Robert, R. Cléroux, and N. Ranger. Some results on vector correlation. Computational Statistics and Data Analysis, 3:25–32, 1985.
  • [91] F. J. Rohlf and D. Slice. Extensions of the procrustes method for the optimal superimposition of landmarks. Systematic Biology, 39(1):40–59, 1990.
  • [92] J. P. Romano. A bootstrap revival of some nonparametric distance tests. Journal of the American Statistical Association, 83(403):698–708, 1988.
  • [93] J. P. Romano. Bootstrap and randomization tests of some nonparametric hypotheses. The Annals of Statistics, 17(1):141–159, 1989.
  • [94] S. E. Santana and S. E. Lofgren. Does nasal echolocation influence the modularity of the mammal skull? Journal of evolutionary biology, 26(11):2520–2526, 2013.
  • [95] P. Schlich. Defining and validating assessor compromises about product distances and attribute correlations. Data handling in science and technology, 16:259–306, 1996.
  • [96] I. J. Schoenberg. Remarks to maurice fréchet’s article “sur la définition axiomatique d’une classe d’espace distancié vectoriellement applicable sur l’espace de hilbert. Annals of Mathematics, 36(2):724–732, 1935.
  • [97] B. Scholkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001.
  • [98] D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu. Equivalence of distance-based and rkhs-based statistics in hypothesis testing. Annals of Statistics, 41:2263–2291, 2013.
  • [99] S. V. Shinkareva, R. A. Mason, V. L. Malave, W. Wang, T. M. Mitchell, and M. A. Just. Using fmri brain activation to identify cognitive states associated with perception of tools and dwellings. PLoS One, 3(1):e1394, 2008.
  • [100] A. K. Smilde, H. A. L. Kiers, S. Bijlsma, C. M. Rubingh, and M. J. van Erk. Matrix correlations for high-dimensional data: the modified RV-coefficient. Bioinformatics, 25:401–405, 2009.
  • [101] P. E. Smouse, J. C. Long, and R. R. Sokal. Multiple regression and correlation extensions of the mantel test of matrix correspondence. Systematic zoology, 35(4):627–632, 1986.
  • [102] P. H. A. Sneath and R. R. Sokal. Numerical taxonomy. The principles and practice of numerical classification. 1973.
  • [103] L. Song, A. Smola, A. Gretton, J. Bedo, and K. Borgwardt. Feature selection via dependence maximization. Journal of Machine Learning Research, 13:1393–1434, 2012.
  • [104] G. J. Szekely and M. L. Rizzo. Energy statistics: A class of statistics based on distances. Journal of statistical planning and inference, 143:1249–1272, 2013a.
  • [105] G. J. Szekely and M. L. Rizzo. The distance correlation t-test of independence in high dimension. Journal of Multivariate Analysis, 117:193–213, 2013b.
  • [106] G. J. Szekely, M. L. Rizzo, and N. K. Bakirov. Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6):2769–2794, 2007.
  • [107] G. J. Székely and M. L. Rizzo. Partial distance correlation with methods for dissimilarities. The Annals of Statistics, 42(6):2382–2412, 2014.
  • [108] Y.Fan, S. Penev, D. Salopek, and P. Lafaye de Micheaux. Multivariate nonparametric test of independence. Submitted, 2016.