The Annals of Applied Statistics

Joint and individual variation explained (JIVE) for integrated analysis of multiple data types

Eric F. Lock, Katherine A. Hoadley, J. S. Marron, and Andrew B. Nobel

Full-text: Open access

Abstract

Research in several fields now requires the analysis of data sets in which multiple high-dimensional types of data are available for a common set of objects. In particular, The Cancer Genome Atlas (TCGA) includes data from several diverse genomic technologies on the same cancerous tumor samples. In this paper we introduce Joint and Individual Variation Explained (JIVE), a general decomposition of variation for the integrated analysis of such data sets. The decomposition consists of three terms: a low-rank approximation capturing joint variation across data types, low-rank approximations for structured variation individual to each data type, and residual noise. JIVE quantifies the amount of joint variation between data types, reduces the dimensionality of the data and provides new directions for the visual exploration of joint and individual structures. The proposed method represents an extension of Principal Component Analysis and has clear advantages over popular two-block methods such as Canonical Correlation Analysis and Partial Least Squares. A JIVE analysis of gene expression and miRNA data on Glioblastoma Multiforme tumor samples reveals gene–miRNA associations and provides better characterization of tumor types.

Data and software are available at https://genome.unc.edu/jive/.

Article information

Source
Ann. Appl. Stat., Volume 7, Number 1 (2013), 523-542.

Dates
First available in Project Euclid: 9 April 2013

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1365527209

Digital Object Identifier
doi:10.1214/12-AOAS597

Mathematical Reviews number (MathSciNet)
MR3086429

Zentralblatt MATH identifier
06171282

Keywords
Data integration multi-block data principal component analysis data fusion

Citation

Lock, Eric F.; Hoadley, Katherine A.; Marron, J. S.; Nobel, Andrew B. Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. Ann. Appl. Stat. 7 (2013), no. 1, 523--542. doi:10.1214/12-AOAS597. https://projecteuclid.org/euclid.aoas/1365527209


Export citation

References

  • Adourian, A., Jennings, E., Balasubramanian, R., Hines, W., Damian, D., Plasterer, T., Clish, C., Stroobant, P., McBurney, R., Verheij, E., Bobeldijk, I., Greef, J., Lindberg, J., Kenne, K., Andersson, U., Hellmold, H., Nilsson, K., Salterd, H. and Schuppe-Koistinenc, I. (2008). Correlation network analysis for data integration and biomarker selection. Molecular BioSystems 4 249–259.
  • Bekaert, G., Hodrick, R. and Zhang, X. (2009). International stock return comovements. J. Finance 64 2591–2626.
  • Bredel, M., Scholtens, D. M., Harsh, G. R., Bredel, C., Chandler, J. P., Renfrow, J. J., Yadav, A. K., Vogel, H., Scheck, A. C., Tibshirani, R. and Sikic, B. I. (2009). A network model of a cooperative genetic landscape in brain tumors. JAMA 302 261–275.
  • Cabanski, C. R., Qi, Y., Yin, X., Bair, E., Hayward, M. C., Fan, C., Li, J., Wilkerson, M. D., Marron, J. S., Perou, C. M. and Hayes, D. N. (2010). SWISS MADE: Standardized within class sum of squares to evaluate methodologies and dataset elements. PLoS ONE 5 e9905.
  • Cancer Genome Atlas Research Network (2008). Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455 1061–1068.
  • Candes, E., Li, X., Ma, Y. and Wright, J. (2009). Robust principal component analysis? Available at arXiv:0912.3599.
  • Di, C.-Z., Crainiceanu, C. M., Caffo, B. S. and Punjabi, N. M. (2009). Multilevel functional principal component analysis. Ann. Appl. Stat. 3 458–488.
  • Dweep, H., Sticht, C., Pandey, P. and Gretz, N. (2011). miRWalk-database: Prediction of possible miRNA binding sites by “walking” the genes of three genomes. J. Biomed. Inform. 44 839–847.
  • Fowler, A., Thompson, D., Giles, K., Maleki, S., Mreich, E., Wheeler, H., Leedman, P., Biggs, M., Cook, R., Little, N., Robinson, B. and McDonald, K. (2011). miR-124a is frequently down-regulated in glioblastoma and is involved in migration and invasion. European Journal of Cancer 47 953–963.
  • Galberin, M. and Cochrane, G. (2011). The 2011 nucleic acids research database issue and the online molecular biology database collection. Nucleic Acids Res. 39 D1–D6.
  • Gilad, Y., Rifkin, S. A. and Pritchard, J. K. (2008). Revealing the architecture of gene regulation: The promise of eQTL studies. Trends Genet. 24 408–415.
  • Gillan, L., Matei, D., Fishman, D., Gerbin, C., Karlan, B. and Chang, D. (2002). Periostin secreted by epithelial ovarian carcinoma is a ligand for alpha(V)beta(3) and alpha(V)beta(5) integrins and promotes cell motility. Cancer Research 62 5358–5364.
  • Hotelling, H. (1936). Relations between two sets of variates. Biometrika 28 321–377.
  • Lê Cao, K.-A., Rossouw, D., Robert-Granié, C. and Besse, P. (2008). A sparse PLS for variable selection when integrating omics data. Stat. Appl. Genet. Mol. Biol. 7 Art. 35, 31.
  • Lee, M., Shen, H., Huang, J. Z. and Marron, J. S. (2010). Biclustering via sparse singular value decomposition. Biometrics 66 1087–1095.
  • Lock, E., Hoadley, K., Marron, J. and Nobel, A. (2012). Supplement to “Joint and individual variation explained (JIVE) for integrated analysis of multiple data types”. DOI:10.1214/12-AOAS597SUPP.
  • Parkhomenko, E., Tritchler, D. and Beyene, J. (2009). Sparse canonical correlation analysis with application to genomic data integration. Stat. Appl. Genet. Mol. Biol. 8 Art. 1, 36.
  • Parkinson, H., Kapushesky, M., Kolesnikov, N., Rustici, G., Shojatalab, M., Abeygunawardena, N., Berube, H., Dylag, M., Emam, I., Farne, A., Holloway, E., Lukk, M., Malone, J., Mani, R., Pilicheva, E., Rayner, T., Rezwan, F., Sharma, A., Williams, E., Bradley, X., Adamusiak, T., Brandizi, M., Burdett, T., Coulson, R., Krestyaninova, M., Kurnosov, P., Maguire, E., Neogi, S., Rocca-Serra, P., Sansone, S., Sklyar, N., Zhao, M., Sarkans, U. and Brazma, A. (2009). ArrayExpress update—from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res. 37 868–872.
  • Peter, M. E. (2010). Targeting of mRNAs by multiple miRNAs: The next step. Oncogene 29 2161–2164.
  • Rhead, B., Karolchik, D., Kuhn, R., Hinrichs, A., Zweig, A., Fujita, P., Diekhans, M., Smith, K., Rosenbloom, K., Raney, B., Pohl, A., Pheasant, M., Meyer, L., Learned, K., Hsu, F., Hillman-Jackson, J., Harte, R., Giardine, B., Dreszer, T., Clawson, H., Barber, G., Haussler, D. and Kent, W. (2010). The UCSC genome browser database: Update 2010. Nucleic Acids Res. 38 613–619.
  • Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461–464.
  • Shen, H. and Huang, J. Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation. J. Multivariate Anal. 99 1015–1034.
  • Shen, R., Olshen, A. B. and Ladanyi, M. (2009). Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 25 2906–2912.
  • Sporns, O., Tononi, G. and Kötter, R. (2005). The human connectome: A structural description of the human brain. PLoS Comput. Biol. 1 e42.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Trygg, J. and Wold, S. (2003). O2-PLS, a two-block (X-Y) latent variable regression (LVR) method with an integral OSC filter. Journal of Chemometrics 17 53–64.
  • Verhaak, R. G. W., Hoadley, K. A., Purdom, E., Wang, V., Qi, Y., Wilkerson, M. D., Miller, C. R., Ding, L., Golub, T., Mesirov, J. P., Alexe, G., Lawrence, M., O’Kelly, M., Tamayo, P., Weir, B. A., Gabriel, S., Winckler, W., Gupta, S., Jakkula, L., Feiler, H. S., Hodgson, J. G., James, C. D., Sarkaria, J. N., Brennan, C., Kahn, A., Spellman, P. T., Wilson, R. K., Speed, T. P., Gray, J. W., Meyerson, M., Getz, G., Perou, C. M., Hayes, D. N. and Cancer Genome Atlas Research Network (2010). Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell 17 98–110.
  • Westerhuis, J., Kourti, T. and MacGregor, J. (1998). Analysis of multiblock and hierarchical PCA and PLS models. Journal of Chemometrics 12 301–321.
  • Witten, D. M. and Tibshirani, R. J. (2009). Extensions of sparse canonical correlation analysis with applications to genomic data. Stat. Appl. Genet. Mol. Biol. 8 Art. 28, 29.
  • Wold, H. (1985). Partial Least Squares. In Encyclopedia of Statistical Sciences (Vol. 6) (S. Kotz and N. Johnson, eds.) 581–591. Wiley, New York.
  • Wold, S., Kettaneh, N. and Tjessem, K. (1996). Hierarchical multiblock PLS and PC models for easier model interpretation and as an alternative to variable selection. Journal of Chemometrics 10 463–482.
  • Zinn, P., Majadan, B., Sathyan, P., Singh, K., Majumder, S., Jolesz, F. and Colen, R. (2011). Radiogenomic mapping of edema/cellular invasion MRI-phenotypes in glioblastoma multiforme. PLoS ONE 6 e25451.

Supplemental materials

  • Supplementary material: Additional Material. The supplementary article Lock et al. (2012) provides additional details and further validation of the JIVE method. This includes: A proof concerning the existence and uniqueness of the decomposition. A description of the permutation approach to rank selection. Pseudocode for the algorithm. A discussion of computing time and efficiency. A discussion of invariance properties. Results from the application of JIVE to many diverse simulated data sets.