Electronic Journal of Statistics

Detecting column dependence when rows are correlated and estimating the strength of the row correlation

Omkar Muralidharan

Full-text: Open access


Microarray experiments often yield a normal data matrix X whose rows correspond to genes and columns to samples. We commonly calculate test statistics Z=Xw, where Zi is a test statistic for the ith gene, and apply false discovery rate (FDR) controlling methods to find interesting genes. For example, Z could measure the difference in expression levels between treatment and control groups and we could seek differentially expressed genes. The empirical cdf of Z is important for FDR methods, since its mean and variance determine the bias and variance of FDR estimates. Efron (2009b) has shown that if the columns of X are independent, the variance of the empirical cdf of Z only depends on the mean-squared row correlation.

Microarray data, however, frequently shows signs of column dependence. In this paper, we show that Efron’s result still holds under column dependence, and give a conservative (upwardly biased) estimator for the mean-squared row correlation. We show Fisher’s transformation for sample correlations is still normalizing and variance stabilizing under column dependence, and use it to construct a permutation-invariant test of column independence. Finally, we argue that estimating the mean-squared row correlation under column dependence is impossible in general. Code to perform our test is available in the R package “colcor,” available on CRAN.

Article information

Electron. J. Statist., Volume 4 (2010), 1527-1546.

First available in Project Euclid: 23 December 2010

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Fisher transformation sample correlation column dependence root mean squared correlation matrix normal


Muralidharan, Omkar. Detecting column dependence when rows are correlated and estimating the strength of the row correlation. Electron. J. Statist. 4 (2010), 1527--1546. doi:10.1214/10-EJS592. https://projecteuclid.org/euclid.ejs/1293113417

Export citation


  • Genevera Allen and Robert Tibshirani. Inference with transposable data: Modeling the effects of row and column correlations., 2010.
  • James J. Chen, Robert R. Delongchamp, Chen-An Tsai, Huey-miin Hsueh, Frank Sistare, Karol L. Thompson, Varsha G. Desai, and James C. Fuscoe. Analysis of variance components in gene expression data., Bioinformatics, 20(9) :1436–1446, 2004. DOI 10.1093/bioinformatics/bth118. URL http://bioinformatics.oxfordjournals.org/content/20/9/1436.abstract.
  • Elissa Cosgrove, Timothy Gardner, and Eric Kolaczyk. On the choice and number of microarrays for transcriptional regulatory network inference., BMC Bioinformatics, 11(1):454, 2010. ISSN 1471-2105. DOI 10.1186/1471-2105-11-454. URL http://www.biomedcentral.com/1471-2105/11/454.
  • David Donoho and Jiashun Jin. Higher criticism for detecting sparse heterogeneous mixtures., The Annals of Statistics, 32(3):962–994, 2004. ISSN 00905364. URL http://www.jstor.org/stable/3448581.
  • Bradley Efron. Are a set of microarrays independent of each other?, Annals of Applied Statistics, 3, 2009a.
  • Bradley Efron. Correlated z-values and the accuracy of large-scale statistical estimates., 2009b.
  • T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring., Science, 286 (5439):531–537, 1999. DOI 10.1126/science.286.5439.531. URL http://www.sciencemag.org/cgi/content/abstract/286/5439/531.
  • I. Hedenfalk, D. Duggan, Y.D. Chen, M. Radmacher, M. Bittner, R. Simon, P. Meltzer, B. Gusterson, M. Esteller, O.P. Kallioniemi, B. Wilfond, A. Borg, J. Trent, M. Raffeld, Z. Yakhini, A. Ben-Dor, E. Dougherty, J. Kononen, L. Bubendorf, W. Fehrle, S. Pittaluga, and Gruvberg. Gene-expression profiles in hereditary breast cancer., New England Journal of Medicine, 344:539–548, 2001. URL http://dx.doi.org/10.1056/NEJM200102223440801.
  • L. Isserlis. On a formula for the product-moment coefficient of any order of a normal frequency distribution in any number of variables., Biometrika, 12(1/2):134–139, 1918. ISSN 00063444. URL http://www.jstor.org/stable/2331932.
  • Ingram Olkin and John W. Pratt. Unbiased estimation of certain correlation coefficients., Annals of Mathematical Statistics, 29(1):201–211, 1958.
  • Richard A. Olshen and Bala Rajaratnam. Successive normalization of rectangular arrays., The Annals of Statistics, 38 :1638–1664, 2010.
  • Art B. Owen. Variance of the number of false discoveries., Journal of the Royal Statistical Society, Series B, 67:411–426, 2005.
  • Matthew D. W. Piper, Pascale Daran-Lapujade, Christoffer Bro, Birgitte Regenberg, Steen Knudsen, Jens Nielsen, and Jack T. Pronk. Reproducibility of oligonucleotide microarray transcriptome analyses., Journal of Biological Chemistry, 277(40) :37001–37008, 2002. DOI 10.1074/jbc.M204490200. URL http://www.jbc.org/content/277/40/37001.abstract.
  • Xing Qiu, Andrew Brooks, Lev Klebanov, and Andrei Yakovlev. The effects of normalization on the correlation structure of microarray data., BMC Bioinformatics, 6(1):120, 2005. ISSN 1471-2105. DOI 10.1186/1471-2105-6-120. URL http://www.biomedcentral.com/1471-2105/6/120.
  • Dinesh Singh, Phillip G. Febbo, Kenneth Ross, Donald G. Jackson, Judith Manola, Christine Ladd, Pablo Tamayo, Andrew A. Renshaw, Anthony V. D’Amico, Jerome P. Richie, Eric S. Lander, Massimo Loda, Philip W. Kantoff, Todd R. Golub, and William R. Sellers. Gene expression correlates of clinical prostate cancer behavior., Cancer Cell, 1(2):203–209, 2002. ISSN 1535-6108. DOI 10.1016/S1535-6108(02)00030-2. URL http://www.sciencedirect.com/science/article/B6WWK-45J85YN-F/2/b0c5e920001196813bd1821d0191f4b9.