The Annals of Applied Statistics

A Bayesian feature allocation model for tumor heterogeneity

Juhee Lee, Peter Müller, Kamalakar Gulukota, and Yuan Ji

Full-text: Open access

Abstract

We develop a feature allocation model for inference on genetic tumor variation using next-generation sequencing data. Specifically, we record single nucleotide variants (SNVs) based on short reads mapped to human reference genome and characterize tumor heterogeneity by latent haplotypes defined as a scaffold of SNVs on the same homologous genome. For multiple samples from a single tumor, assuming that each sample is composed of some sample-specific proportions of these haplotypes, we then fit the observed variant allele fractions of SNVs for each sample and estimate the proportions of haplotypes. Varying proportions of haplotypes across samples is evidence of tumor heterogeneity since it implies varying composition of cell subpopulations. Taking a Bayesian perspective, we proceed with a prior probability model for all relevant unknown quantities, including, in particular, a prior probability model on the binary indicators that characterize the latent haplotypes. Such prior models are known as feature allocation models. Specifically, we define a simplified version of the Indian buffet process, one of the most traditional feature allocation models. The proposed model allows overlapping clustering of SNVs in defining latent haplotypes, which reflects the evolutionary process of subclonal expansion in tumor samples.

Article information

Source
Ann. Appl. Stat., Volume 9, Number 2 (2015), 621-639.

Dates
Received: July 2014
Revised: January 2015
First available in Project Euclid: 20 July 2015

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1437397104

Digital Object Identifier
doi:10.1214/15-AOAS817

Mathematical Reviews number (MathSciNet)
MR3371328

Zentralblatt MATH identifier
06499923

Keywords
Haplotypes feature allocation models Indian buffet process Markov chain Monte Carlo next-generation sequencing random binary matrices variant calling

Citation

Lee, Juhee; Müller, Peter; Gulukota, Kamalakar; Ji, Yuan. A Bayesian feature allocation model for tumor heterogeneity. Ann. Appl. Stat. 9 (2015), no. 2, 621--639. doi:10.1214/15-AOAS817. https://projecteuclid.org/euclid.aoas/1437397104


Export citation

References

  • Broderick, T., Pitman, J. and Jordan, M. I. (2013). Feature allocations, probability functions, and paintboxes. Bayesian Anal. 8 801–836.
  • Broderick, T., Jordan, M. I. and Pitman, J. (2013). Clusters and features from combinatorial stochastic processes. Statist. Sci. 28 289–312.
  • Casella, G. and Moreno, E. (2006). Objective Bayesian variable selection. J. Amer. Statist. Assoc. 101 157–167.
  • Church, D. M., Schneider, V. A., Graves, T., Auger, K., Cunningham, F., Bouk, N., Chen, H.-C., Agarwala, R., McLaren, W. M., Ritchie, G. R. S. et al. (2011). Modernizing reference genome assemblies. PLoS Biol. 9 e1001091.
  • Engle, L. J., Simpson, C. L. and Landers, J. E. (2006). Using high-throughput SNP technologies to study cancer. Oncogene 25 1594–1601.
  • Erichsen, H. and Chanock, S. (2004). SNPs in cancer research and treatment. British Journal of Cancer 90 747–751.
  • Gerlinger, M., Rowan, A. J., Horswell, S., Larkin, J., Endesfelder, D., Gronroos, E., Martinez, P., Matthews, N., Stewart, A., Tarpey, P., Varela, I., Phillimore, B., Begum, S., McDonald, N. Q., Butler, A., Jones, D., Raine, K., Latimer, C., Santos, C. R., Nohadani, M., Eklund, A. C., Spencer-Dene, B., Clark, G., Pickering, L., Stamp, G., Gore, M., Szallasi, Z., Downward, J., Futreal, P. A. and Swanton, C. (2012). Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N. Engl. J. Med. 366 883–892.
  • Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82 711–732.
  • Griffiths, T. and Ghahramani, Z. (2005). Infinite latent feature models and the Indian buffet process. Technical Report 2005-001, Gatsby Computational Neuroscience Unit, 2005.
  • Ji, Y., Xu, Y., Zhang, Q., Tsui, K.-W., Yuan, Y., Norris, C. Jr., Liang, S. and Liang, H. (2011). BM-map: Bayesian mapping of multireads for next-generation sequencing data. Biometrics 67 1215–1224.
  • Kanehisa, M., Goto, S., Furumichi, M., Tanabe, M. and Hirakawa, M. (2010). KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 38 D355–D360.
  • Landau, D. A., Carter, S. L., Stojanov, P., McKenna, A., Stevenson, K., Lawrence, M. S., Sougnez, C., Stewart, C., Sivachenko, A., Wang, L., Wan, Y., Zhang, W., Shukla, S. A., Vartanov, A., Fernandes, S. M., Saksena, G., Cibulskis, K., Tesar, B., Gabriel, S., Hacohen, N., Meyerson, M., Lander, E. S., Neuberg, D., Brown, J. R., Getz, G. and Wu, C. J. (2013). Evolution and impact of subclonal mutations in chronic lymphocytic leukemia. Cell 152 714–726.
  • Larson, N. B. and Fridley, B. L. (2013). PurBayes: Estimating tumor cellularity and subclonality in next-generation sequencing data. Bioinformatics 29 1888–1889.
  • Lee, J., Müller, P., Gulukota, K. and Ji, Y. (2015). Supplement to “A Bayesian feature allocation model for tumor heterogeneity.” DOI:10.1214/15-AOAS817SUPP.
  • Li, H. and Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25 1754–1760.
  • Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R. and 1000 Genome Project Data Processing Subgroup (2009). The sequence Alignment/Map format and SAMtools. Bioinformatics 25 2078–2079.
  • Marusyk, A. and Polyak, K. (2010). Tumor heterogeneity: Causes and consequences. Biochim. Biophys. Acta. 1085 1.
  • McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M. and DePristo, M. A. (2010). The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20 1297–1303.
  • Navin, N., Krasnitz, A., Rodgers, L., Cook, K., Meth, J., Kendall, J., Riggs, M., Eberling, Y., Troge, J., Grubor, V. et al. (2010). Inferring tumor progression from genomic heterogeneity. Genome Res. 20 68–80.
  • Ng, P. C. and Kirkness, E. F. (2010). Whole genome sequencing. In Genetic Variation 215–226. Springer, New York.
  • O’Hagan, A. (1995). Fractional Bayes factors for model comparison. J. R. Stat. Soc. Ser. B. Stat. Methodol. 57 99–138.
  • Roth, A., Khattra, J., Yap, D., Wan, A., Laks, E., Biele, J., Ha, G., Aparicio, S., Bouchard-Côté, A. and Shah, S. P. (2014). Pyclone: Statistical inference of clonal population structure in cancer. Nature Methods 11 396–398.
  • Russnes, H. G., Navin, N., Hicks, J. and Borresen-Dale, A.-L. (2011). Insight into the heterogeneity of breast cancer through next-generation sequencing. J. Clin. Invest. 121 3810–3818.
  • Su, X., Zhang, L., Zhang, J., Meric-Bernstam, F. and Weinstein, J. N. (2012). PurityEst: Estimating purity of human tumor samples using next-generation sequencing data. Bioinformatics 28 2265–2266.
  • Teh, Y. W., Görür, D. and Ghahramani, Z. (2007). Stick-breaking construction for the Indian buffet process. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Vol. 11. The Society for Artificial Intelligence and Statistics, NJ.
  • Wersto, R. P., Liblit, R. L., Deitch, D. and Koss, L. G. (1991). Variability in DNA measurements in multiple tumor samples of human colonic carcinoma. Cancer 67 106–115.
  • Wheeler, D. A., Srinivasan, M., Egholm, M., Shen, Y., Chen, L., McGuire, A., He, W., Chen, Y.-J., Makhijani, V., Roth, G. T. et al. (2008). The complete genome of an individual by massively parallel DNA sequencing. Nature 452 872–876.

Supplemental materials