The Annals of Applied Statistics

TreeClone: Reconstruction of tumor subclone phylogeny based on mutation pairs using next generation sequencing data

Tianjian Zhou, Subhajit Sengupta, Peter Müller, and Yuan Ji

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


We present TreeClone, a latent feature allocation model to reconstruct tumor subclones subject to phylogenetic evolution that mimics tumor evolution. Similar to most current methods, we consider data from next-generation sequencing of tumor DNA. Unlike most methods that use information in short reads mapped to single nucleotide variants (SNVs), we consider subclone phylogeny reconstruction using pairs of two proximal SNVs that can be mapped by the same short reads. As part of the Bayesian inference model, we construct a phylogenetic tree prior. The use of the tree structure in the prior greatly strengthens inference. Only subclones that can be explained by a phylogenetic tree are assigned non-negligible probabilities. The proposed Bayesian framework implies posterior distributions on the number of subclones, their genotypes, cellular proportions and the phylogenetic tree spanned by the inferred subclones. The proposed method is validated against different sets of simulated and real-world data using single and multiple tumor samples. An open source software package is available at

Article information

Ann. Appl. Stat., Volume 13, Number 2 (2019), 874-899.

Received: October 2017
Revised: August 2018
First available in Project Euclid: 17 June 2019

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Latent feature model mutation pair NGS data phylogenetic tree subclone tumor heterogeneity


Zhou, Tianjian; Sengupta, Subhajit; Müller, Peter; Ji, Yuan. TreeClone: Reconstruction of tumor subclone phylogeny based on mutation pairs using next generation sequencing data. Ann. Appl. Stat. 13 (2019), no. 2, 874--899. doi:10.1214/18-AOAS1224.

Export citation


  • Adams, R. P., Ghahramani, Z. and Jordan, M. I. (2010). Tree-structured stick breaking for hierarchical data. In Advances in Neural Information Processing Systems 19–27.
  • Aparicio, S. and Caldas, C. (2013). The implications of clonal genome evolution for cancer medicine. N. Engl. J. Med. 368 842–851.
  • Bafna, V., Gusfield, D., Lancia, G. and Yooseph, S. (2003). Haplotyping as perfect phylogeny: A direct approach. J. Comput. Biol. 10 323–340.
  • Bonavia, R., Cavenee, W. K., Furnari, F. B. et al. (2011). Heterogeneity maintenance in glioblastoma: A social network. Cancer Res. 71 4055–4060.
  • Brocks, D., Assenov, Y., Minner, S., Bogatyrova, O., Simon, R., Koop, C., Oakes, C., Zucknick, M., Lipka, D. B., Weischenfeldt, J. et al. (2014). Intratumor DNA methylation heterogeneity reflects clonal evolution in aggressive prostate cancer. Cell Rep. 8 798–806.
  • Broderick, T., Kulis, B. and Jordan, M. (2013). MAD-Bayes: MAP-based asymptotic derivations from Bayes. In Proceedings of the 30th International Conference on Machine Learning 226–234.
  • Carter, S. L., Cibulskis, K., Helman, E., McKenna, A., Shen, H., Zack, T., Laird, P. W., Onofrio, R. C., Winckler, W. (2012). Absolute quantification of somatic DNA alterations in human cancer. Nat. Biotechnol. 30 413–421.
  • Chipman, H. A., George, E. I. and McCulloch, R. E. (1998). Bayesian CART model search. J. Amer. Statist. Assoc. 93 935–948.
  • Dagum, L. and Menon, R. (1998). OpenMP: An industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5 46–55.
  • Denison, D. G. T., Mallick, B. K. and Smith, A. F. M. (1998). A Bayesian CART algorithm. Biometrika 85 363–377.
  • Deshwar, A. G., Vembu, S., Yung, C. K., Jang, G. H., Stein, L. and Morris, Q. (2015). PhyloWGS: Reconstructing subclonal composition and evolution from whole-genome sequencing of tumors. Genome Biol. 16 35.
  • Fan, X., Zhou, W., Chong, Z., Nakhleh, L. and Chen, K. (2014). Towards accurate characterization of clonal heterogeneity based on structural variation. BMC Bioinform. 15 1.
  • Fischer, A., Vázquez-García, I., Illingworth, C. J. R. and Mustonen, V. (2014). High-definition reconstruction of clonal composition in cancer. Cell Rep. 7 1740–1752.
  • Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statist. Sci. 457–472.
  • Geyer, C. J. (1991). Markov chain Monte Carlo maximum likelihood. In Computing Science and Statistics, Proceedings of the 23rd Symposium on the Interface 156–163. Interface Foundation of North America, Fairfax Station, VA.
  • Giordano, R. J., Broderick, T. and Jordan, M. I. (2015). Linear response methods for accurate covariance estimates from mean field variational Bayes. In Advances in Neural Information Processing Systems 1441–1449.
  • Jiao, W., Vembu, S., Deshwar, A. G., Stein, L. and Morris, Q. (2014). Inferring clonal evolution of tumors from single nucleotide somatic mutations. BMC Bioinform. 15 35.
  • Johnson, V. E. (2004). A Bayesian $\chi^{2}$ test for goodness-of-fit. Ann. Statist. 32 2361–2384.
  • Lee, J., Müller, P., Gulukota, K. and Ji, Y. (2015). A Bayesian feature allocation model for tumor heterogeneity. Ann. Appl. Stat. 9 621–639.
  • Li, H. and Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25 1754–1760.
  • Marass, F., Mouliere, F., Yuan, K., Rosenfeld, N. and Markowetz, F. (2016). A phylogenetic latent feature model for clonal deconvolution. Ann. Appl. Stat. 10 2377–2404.
  • Marusyk, A., Almendro, V. and Polyak, K. (2012). Intra-tumour heterogeneity: A looking glass for cancer? Nat. Rev. Cancer 12 323–334.
  • McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M. et al. (2010). The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20 1297–1303.
  • Miller, K. T., Griffiths, T. L. and Jordan, M. I. (2008). The phylogenetic Indian buffet process: A non-exchangeable nonparametric prior for latent features. In Proceedings of the 24th Conference in Uncertainty in Artificial Intelligence 403–410.
  • Miller, C. A., White, B. S., Dees, N. D., Griffith, M., Welch, J. S., Griffith, O. L., Vij, R., Tomasson, M. H., Graubert, T. A., Walter, M. J. et al. (2014). SciClone: Inferring clonal architecture and tracking the spatial and temporal patterns of tumor evolution. PLoS Comput. Biol. 10 e1003665.
  • Nik-Zainal, S., Van Loo, P., Wedge, D. C., Alexandrov, L. B., Greenman, C. D., Lau, K. W., Raine, K., Jones, D., Marshall, J., Ramakrishna, M. et al. (2012). The life history of 21 breast cancers. Cell 149 994–1007.
  • Nowell, P. C. (1976). The clonal evolution of tumor cell populations. Science 194 23–28.
  • O’Hagan, A. (1995). Fractional Bayes factors for model comparison. J. Roy. Statist. Soc. Ser. B 57 99–138.
  • Roth, A., Khattra, J., Yap, D., Wan, A., Laks, E., Biele, J., Ha, G., Aparicio, S., Bouchard-Côté, A. and Shah, S. P. (2014). PyClone: Statistical inference of clonal population structure in cancer. Nat. Methods 11 396–398.
  • Schwarz, R. F., Ng, C. K., Cooke, S. L., Newman, S., Temple, J., Piskorz, A. M., Gale, D., Sayal, K., Murtaza, M., Baldwin, P. J. et al. (2015). Spatial and temporal heterogeneity in high-grade serous ovarian cancer: A phylogenetic analysis. PLoS Med. 12 e1001789.
  • Sengupta, S., Wang, J., Lee, J., Müller, P., Gulukota, K., Banerjee, A. and Ji, Y. (2015). BayClone: Bayesian nonparametric inference of tumor subclones using NGS data. In Proceedings of the Pacific Symposium on Biocomputing (PSB) 20 467–478.
  • Sengupta, S., Gulukota, K., Zhu, Y., Ober, C., Naughton, K., Wentworth-Sheilds, W. and Ji, Y. (2016). Ultra-fast local-haplotype variant calling using paired-end DNA-sequencing data reveals somatic mosaicism in tumor and normal blood samples. Nucleic Acids Res. 44 e25.
  • Van Loo, P., Nordgard, S. H., Lingjærde, O. C., Russnes, H. G., Rye, I. H., Sun, W., Weigman, V. J., Marynen, P., Zetterberg, A., Naume, B. et al. (2010). Allele-specific copy number analysis of tumors. Proc. Natl. Acad. Sci. USA 107 16910–16915.
  • Xu, Y., Müller, P., Yuan, Y., Gulukota, K. and Ji, Y. (2015). MAD Bayes for tumor heterogeneity—feature allocation with exponential family sampling. J. Amer. Statist. Assoc. 110 503–514.
  • Zare, H., Wang, J., Hu, A., Weber, K., Smith, J., Nickerson, D., Song, C., Witten, D., Blau, C. A. and Noble, W. S. (2014). Inferring clonal composition from multiple sections of a breast cancer. PLoS Comput. Biol. 10 e1003703.
  • Zhou, T., Müller, P., Sengupta, S. and Ji, Y. (2019). PairClone: A Bayesian subclone caller based on mutation pairs. J. R. Stat. Soc. Ser. C. Appl. Stat. 68 705–725.
  • Zhou, T., Sengupta, S., Müller, P. and Ji, Y. (2019). Supplement to “TreeClone: Reconstruction of Tumor Subclone Phylogeny Based on Mutation Pairs using Next Generation Sequencing Data.” DOI:10.1214/18-AOAS1224SUPP.

Supplemental materials

  • Supplement to “TreeClone: Reconstruction of Tumor Subclone Phylogeny Based on Mutation Pairs using Next Generation Sequencing Data”. We provide the R package TreeClone, a glossary of biological terms and the supplementary details referenced in the main text.