The Annals of Applied Statistics

Multiple hypothesis testing adjusted for latent variables, with an application to the AGEMAP gene expression data

Yunting Sun, Nancy R. Zhang, and Art B. Owen

Full-text: Open access

Abstract

In high throughput settings we inspect a great many candidate variables (e.g., genes) searching for associations with a primary variable (e.g., a phenotype). High throughput hypothesis testing can be made difficult by the presence of systemic effects and other latent variables. It is well known that those variables alter the level of tests and induce correlations between tests. They also change the relative ordering of significance levels among hypotheses. Poor rankings lead to wasteful and ineffective follow-up studies. The problem becomes acute for latent variables that are correlated with the primary variable. We propose a two-stage analysis to counter the effects of latent variables on the ranking of hypotheses. Our method, called LEAPP, statistically isolates the latent variables from the primary one. In simulations, it gives better ordering of hypotheses than competing methods such as SVA and EIGENSTRAT. For an illustration, we turn to data from the AGEMAP study relating gene expression to age for 16 tissues in the mouse. LEAPP generates rankings with greater consistency across tissues than the rankings attained by the other methods.

Article information

Source
Ann. Appl. Stat., Volume 6, Number 4 (2012), 1664-1688.

Dates
First available in Project Euclid: 27 December 2012

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1356629055

Digital Object Identifier
doi:10.1214/12-AOAS561

Mathematical Reviews number (MathSciNet)
MR3058679

Zentralblatt MATH identifier
1257.62115

Keywords
EIGENSTRAT empirical null surrogate variable analysis

Citation

Sun, Yunting; Zhang, Nancy R.; Owen, Art B. Multiple hypothesis testing adjusted for latent variables, with an application to the AGEMAP gene expression data. Ann. Appl. Stat. 6 (2012), no. 4, 1664--1688. doi:10.1214/12-AOAS561. https://projecteuclid.org/euclid.aoas/1356629055


Export citation

References

  • Allen, G. I. and Tibshirani, R. J. (2010). Inference with transposable data: Modeling the effects of row and column correlations. Technical report, Stanford Univ., Dept. Statistics.
  • Bai, J. (2003). Inferential theory for factor models of large dimensions. Econometrica 71 135–171.
  • Balding, D. and Nicols, R. (1995). A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96 3–12.
  • Broder, A. Z. (1997). On the resemblance and containment of documents. In Compression and Complexity of Sequences 1997. Proceedings 21–29. IEEE Comput. Soc., Los Alamitos.
  • Buja, A. and Eyuboglu, N. (1992). Remarks on parallel analysis. Multivariate Behavioral Research 27 509–540.
  • Candès, E. J. and Randall, P. A. (2006). Highly robust error correction by convex programming. IEEE Trans. Inform. Theory 54 2829–2840.
  • Carvalho, C. M., Chang, J., Lucas, J. E., Nevins, J. R., Wang, Q. and West, M. (2008). High-dimensional sparse factor modeling: Applications in gene expression genomics. J. Amer. Statist. Assoc. 103 1438–1456.
  • Chen, J. and Chen, Z. (2008). Extended Bayesian information criterion. Biometrika 94 759–771.
  • Diskin, S. J., Li, M., Hou, C., Yang, S., Glessner, J., Hakonarson, H., Bucan, M., Maris, J. M. and Wang, K. (2008). Adjustment of genomic waves in signal intensities from whole-genome SNP genotyping platforms. Nucleic Acids Res. 36 e126.
  • Dudoit, S. and van der Laan, M. J. (2008). Multiple Testing Procedures with Applications to Genetics. Springer, New York.
  • Efron, B. (2007). Size, power and false discovery rates. Ann. Statist. 35 1351–1377.
  • Efron, B. (2008). Microarrays, empirical Bayes and the two-groups model. Statist. Sci. 23 1–22.
  • Efron, B. (2010). Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Institute of Mathematical Statistics Monographs 1. Cambridge Univ. Press, Cambridge.
  • Friguet, C., Kloareg, M. and Causeur, D. (2009). A factor model approach to multiple testing under dependence. J. Amer. Statist. Assoc. 104 1406–1415.
  • Gabriel, K. R. and Zamir, S. (1979). Lower rank approximation of matrices by least squares with any choice of weights. Technometrics 21 489–498.
  • Hedenfalk, I. (2001). Gene-expression profiles in hereditary breast cancer. N. Engl. J. Med. 344 539–548.
  • Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29 295–327.
  • Kim, S. K. (2007). Common aging pathways in worms, flies, mice and humans. J. Exp. Biol. 210 1607–1612.
  • Kim, S. K. (2008). Genome-wide views of aging gene networks. In Molecular Biology of Aging 215–235. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY.
  • Leek, J. T. and Storey, J. D. (2008). A general framework for multiple testing dependence. Proc. Natl. Acad. Sci. USA 105 18718–18723.
  • Leek, J. T., Scharpf, R. B., Corrada-Bravo, H., Simcha, D., Langmead, B., Johnson, W. E., Geman, D., Baggerley, K. and Irizarry, R. A. (2010). Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics 11 733–739.
  • Lucas, J. E., Kung, H. N. and Chi, J. T. A. (2010). Latent factor analysis to discover pathway-associated putative segmental aneuploidies in human cancers. PLoS Comput. Biol. 6 e100920:1–15.
  • Olshen, A. B., Venkatraman, E. S., Lucito, R. and Wigler, M. (2004). Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5 557–572.
  • Owen, A. B. and Perry, P. O. (2009). Bi-cross-validation of the SVD and the non-negative matrix factorization. Ann. Appl. Stat. 3 564–594.
  • Patterson, N. J., Price, A. L. and Reich, D. (2006). Population structure and eigenanalysis. PLoS Genetics 2 2074–2093.
  • Perry, P. O. (2009). Cross-validation for unsupervised learning. Ph.D. thesis, Stanford Univ.
  • Perry, P. O. and Owen, A. B. (2010). A rotation test to verify latent structure. J. Mach. Learn. Res. 11 603–624.
  • Price, A. L., Patterson, N. J., Plengt, R. M., Weinblatt, M. E., Shadick, N. A. and Reich, D. (2006). Principal components ananysis corrects for stratification in genome-wide association studies. Nature Genetics 38 904–909.
  • Rodwell, G., Sonu, R., Zahn, J. M., Lund, J., Wilhelmy, J., Wang, L., Xiao, W., Mindrinos, M., Crane, E., Segal, E., Myers, B., Davis, R., Higgins, J., Owen, A. B. and Kim, S. K. (2004). A transcriptional profile of aging in the human kidney. PLoS Biology 2 2191–2201.
  • She, Y. and Owen, A. B. (2011). Outlier identification using nonconvex penalized regression. J. Amer. Statist. Assoc. 106 626–639.
  • Storey, J. D., Akey, J. M. and Kruglyak, L. (2005). Multiple locus linkage analysis of genomewide expression in yeast. PLoS Biology 3 1380–1390.
  • Sun, Y. (2011). On latent systemic effects in multiple hypotheses. Ph.D. thesis, Stanford Univ.
  • Tracy, C. A. and Widom, H. (1994). Level-spacing distributions and the Airy kernel. Comm. Math. Phys. 159 151–174.
  • Zahn, J. M., Sonu, R., Vogel, H., Crane, E., Mazan-Mamczarz, K., Rabkin, R., Davis, R. W., Becker, K. G., Owen, A. B. and Kim, S. K. (2006). Transcriptional profiling of aging in human muscle reveals a common aging signature. PLoS Genetics 2 1058–1069.
  • Zahn, J. M., Poosala, S., Owen, A. B., Ingram, D. K., Lustig, A., Carter, A., Weeratna, A. T., Taub, D. D., Gorospe, M., Mazan-Mamczarz, K., Lakatta, E. G., Boheler, K. R., Xu, X., Mattson, M. P., Falco, G., Ko, M. S. H., Schlessinger, D., Firman, J., Kummerfeld, S. K., III, W. H. W., Zonderman, A. B., Kim, S. K. and Becker, K. G. (2007). AGEMAP: A gene expression database for aging in mice. PLoS Genetics 3 2326–2337.