The Annals of Applied Statistics

Inferring rooted population trees using asymmetric neighbor joining

Yongliang Zhai and Alexandre Bouchard-Côté

Full-text: Open access


We introduce a new inference method to estimate evolutionary distances for any two populations to their most recent common ancestral population using single-nucleotide polymorphism allele frequencies. Our model takes fixation into consideration, making it nonreversible, and guarantees that the distribution of reconstructed ancestral frequencies is contained on the interval $[0,1]$. To scale this method to large numbers of populations, we introduce the asymmetric neighbor joining algorithm, an efficient method for reconstructing rooted bifurcating nonclock trees. Asymmetric neighbor joining provides a scalable rooting method applicable to any nonreversible evolutionary modeling setups. We explore the statistical properties of asymmetric neighbor joining, and demonstrate its accuracy on synthetic data. We validate our method by reconstructing rooted phylogenetic trees from the Human Genome Diversity Panel data. Our results are obtained without using an outgroup, and are consistent with the prevalent recent single-origin model.

Article information

Ann. Appl. Stat., Volume 10, Number 4 (2016), 2047-2074.

Received: November 2015
Revised: June 2016
First available in Project Euclid: 5 January 2017

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Asymmetric neighbor-joining algorithm fixation and drift phylogenetics population histories rooted tree inference single-nucleotide polymorphism


Zhai, Yongliang; Bouchard-Côté, Alexandre. Inferring rooted population trees using asymmetric neighbor joining. Ann. Appl. Stat. 10 (2016), no. 4, 2047--2074. doi:10.1214/16-AOAS964.

Export citation


  • Balding, D. J. and Nichols, R. A. (1995). A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96 3–12.
  • Battistuzzi, F. U., Filipski, A., Hedges, S. B. and Kumar, S. (2010). Performance of relaxed-clock methods in estimating evolutionary divergence times and their credibility intervals. Mol. Biol. Evol. 27 1289–1300.
  • Benner, P., Bačák, M. and Bourguignon, P.-Y. (2014). Point estimates in phylogenetic reconstructions. Bioinformatics 30 i534–i540.
  • Billera, L. J., Holmes, S. P. and Vogtmann, K. (2001). Geometry of the space of phylogenetic trees. Adv. in Appl. Math. 27 733–767.
  • Bryant, D., Bouckaert, R., Felsenstein, J., Rosenberg, N. A. and RoyChoudhury, A. (2012). Inferring species trees directly from biallelic genetic markers: Bypassing gene trees in a full coalescent analysis. Mol. Biol. Evol. 29 1917–1932.
  • Cann, H. M., de Toma, C., Cazes, L., Legrand, M.-F., Morel, V., Piouffre, L., Bodmer, J., Bodmer, W. F., Bonne-Tamir, B., Cambon-Thomsen, A. et al. (2002). A human genome diversity cell line panel. Science 296 261–262.
  • Cavalli-Sforza, L. L. and Feldman, M. W. (2003). The application of molecular genetic approaches to the study of human evolution. Nat. Genet. 33 Suppl 266–275.
  • Chakerian, J. and Holmes, S. (2012). Computational tools for evaluating phylogenetic and hierarchical clustering trees. J. Comput. Graph. Statist. 21 581–599.
  • Edwards, A. and Cavalli-Sforza, L. (1964). Reconstruction of evolutionary trees. Systematics Association Publ. 6 67–76.
  • Ewens, W. J. (1973). Conditional diffusion processes in population genetics. Theor. Popul. Biol. 4 21–30.
  • Felsenstein, J. (1973). Maximum-likelihood estimation of evolutionary trees from continuous characters. Am. J. Hum. Genet. 25 471–492.
  • Felsenstein, J. (1981). Evolutionary trees from gene frequencies and quantitative characters: Finding maximum likelihood estimates. Evolution 35 1229–1242.
  • Felsenstein, J. (1983). Statistical inference of phylogenies. J. R. Stat. Soc., A 146 246–272.
  • Felsenstein, J. (1989). PHYLIP—Phylogeny inference package (Version 3.2). Cladistics 5 164–166.
  • Felsenstein, J. (2004). Inferring Phytogenies. Sinauer, Sunderland, Massachusetts.
  • Gascuel, O. (1997). Concerning the NJ algorithm and its unweighted version, UNJ. In Mathematical Hierarchies and Biology (Piscataway, NJ, 1996). DIMACS Ser. Discrete Math. Theoret. Comput. Sci. 37 149–170. AMS, Providence, RI.
  • Gray, R. D. and Atkinson, Q. D. (2003). Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature 426 435–439.
  • Gray, R. D., Drummond, A. J. and Greenhill, S. J. (2009). Language phylogenies reveal expansion pulses and pauses in Pacific settlement. Science 323 479–483.
  • Hasegawa, M., Kishino, H. and Yano, T.-A. (1985). Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution 22 160–174.
  • Hernandez, R. D., Kelley, J. L., Elyashiv, E., Melton, S. C., Auton, A., McVean, G., Sella, G., Przeworski, M. et al. (2011). Classic selective sweeps were rare in recent human evolution. Science 331 920–924.
  • Huelsenbeck, J. P., Bollback, J. P. and Levine, A. M. (2002). Inferring the root of a phylogenetic tree. Syst. Biol. 51 32–43.
  • Huelsenbeck, J. P., Ronquist, F., Nielsen, R. and Bollback, J. P. (2001). Bayesian inference of phylogeny and its impact on evolutionary biology. Science 294 2310–2314.
  • Iwabe, N., Kuma, K-i., Hasegawa, M., Osawa, S. and Miyata, T. (1989). Evolutionary relationship of archaebacteria, eubacteria, and eukaryotes inferred from phylogenetic trees of duplicated genes. Proc. Natl. Acad. Sci. USA 86 9355–9359.
  • Jenkins, P. A. and Spano, D. (2015). Exact simulation of the Wright–Fisher diffusion. preprint. Available at arXiv:1506.06998.
  • Kuhner, M. K. and Felsenstein, J. (1994). A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol. Biol. Evol. 11 459–468.
  • Li, S., Pearl, D. K. and Doss, H. (2000). Phylogenetic tree construction using Markov chain Monte Carlo. J. Amer. Statist. Assoc. 95 493–508.
  • Li, J. Z., Absher, D. M., Tang, H., Southwick, A. M., Casto, A. M., Ramachandran, S., Cann, H. M., Barsh, G. S., Feldman, M., Cavalli-Sforza, L. L. and Myers, R. M. (2008). Worldwide human relationships inferred from genome-wide patterns of variation. Science 319 1100–1104.
  • Lipo, C. P. (2006). Mapping Our Ancestors: Phylogenetic Approaches in Anthropology and Prehistory. Transaction Publishers. New Brunswick and London.
  • Mau, B., Newton, M. A. and Larget, B. (1999). Bayesian phylogenetic inference via Markov chain Monte Carlo methods. Biometrics 55 1–12.
  • Nei, M. (1972). Genetic distance between populations. Amer. Nat. 106 283–292.
  • Nichols, J. and Warnow, T. (2008). Tutorial on computational linguistic phylogeny. Language and Linguistics Compass 2 760–820.
  • Nicholson, G., Smith, A. V., Jónsson, F., Gústafsson, O., Stefánsson, K. and Donnelly, P. (2002). Assessing population differentiation and isolation from single-nucleotide polymorphism data. J. R. Stat. Soc. Ser. B. Stat. Methodol. 64 695–715.
  • Outlaw, D. C. and Ricklefs, R. E. (2011). Rerooting the evolutionary tree of malaria parasites. Proc. Natl. Acad. Sci. USA 108 13183–13187.
  • Owen, M. and Provan, J. S. (2011). A fast algorithm for computing geodesic distances in tree space. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 8 2–13.
  • Paradis, E. (2012). Analysis of Phylogenetics and Evolution with R, 2nd ed. Springer, New York.
  • Paradis, E., Claude, J. and Strimmer, K. (2004). Ape: Analyses of phylogenetics and evolution in R language. Bioinformatics 20 289–290.
  • Pearson, T., Hornstra, H. M., Sahl, J. W., Schaack, S., Schupp, J. M., Beckstrom-Sternberg, S. M., O’Neill, M. W., Priestley, R. A., Champion, M. D., Beckstrom-Sternberg, J. S., Kersh, G. J., Samuel, J. E., Massung, R. F. and Keim, P. (2013). When outgroups fail; phylogenomics of rooting the emerging pathogen, Coxiella burnetii. Syst. Biol. 62 752–762.
  • Penny, D. and Hendy, M. (1985). The use of tree comparison metrics. Syst. Zool. 34 75–82.
  • Pickrell, J. K. and Pritchard, J. K. (2012). Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genet. 8 e1002967.
  • Pickrell, J. K., Patterson, N., Barbieri, C., Berthold, F., Gerlach, L., Güldemann, T., Kure, B., Mpoloka, S. W., Nakagawa, H., Naumann, C. et al. (2012). The genetic prehistory of southern Africa. Nature Communications 3 1–6.
  • Revell, L. J. (2012). Phytools: An R package for phylogenetic comparative biology (and other things). Methods in Ecology and Evolution 3 217–223.
  • Roch, S. (2006). A short proof that phylogenetic tree reconstruction by maximum likelihood is hard. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 3 92.
  • RoyChoudhury, A., Felsenstein, J. and Thompson, E. A. (2008). A two-stage pruning algorithm for likelihood computation for a population tree. Genetics 180 1095–1105.
  • Saitou, N. and Nei, M. (1987). The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4 406–425.
  • Semple, C. and Steel, M. (2003). Phylogenetics. Oxford Lecture Series in Mathematics and Its Applications 24. Oxford Univ. Press, Oxford.
  • Sirén, J., Hanage, W. P. and Corander, J. (2013). Inference on population histories by approximating infinite alleles diffusion. Mol. Biol. Evol. 30 457–468.
  • Sirén, J., Marttinen, P. and Corander, J. (2011). Reconstructing population histories from single nucleotide polymorphism data. Mol. Biol. Evol. 28 673–683.
  • Smeulders, M. J., Barends, T. R. M., Pol, A., Scherer, A., Zandvoort, M. H., Udvarhelyi, A., Khadem, A. F., Menzel, A., Hermans, J., Shoeman, R. L. et al. (2011). Evolution of a new enzyme for carbon disulphide conversion by an acidothermophilic archaeon. Nature 478 412–416.
  • Song, Y. S. and Steinrücken, M. (2012). A simple method for finding explicit analytic transition densities of diffusion processes with general diploid selection. Genetics 190 1117–1129.
  • Sukumaran, J. and Holder, M. T. (2010). DendroPy: A Python library for phylogenetic computing. Bioinformatics 26 1569–1571.
  • Swofford, D. L., Olsen, G. J., Waddell, P. J. and Hillis, D. M. (1996). Phylogenetic inference. In Molecular Systematics (M. D. Hillis and C. Moritz, eds.) 407–514. Sinauer Associates, Sunderland.
  • Tavaré, S. (1986). Some probabilistic and statistical problems in the analysis of DNA sequences. In Some Mathematical Questions in Biology—DNA Sequence Analysis (New York, 1984). Lectures Math. Life Sci. 17 57–86. AMS, Providence, RI.
  • Wang, L., Bouchard-Côté, A. and Doucet, A. (2015). Bayesian phylogenetic inference using a combinatorial sequential Monte Carlo method. J. Amer. Statist. Assoc. 110 1362–1374.
  • Weir, B. S. and Cockerham, C. C. (1984). Estimating F-statistics for the analysis of population structure. Evolution 38 1358–1370.
  • Wheeler, W. C. (1990). Nucleic acid sequence phylogeny and random outgroups. Cladistics 6 363–367.
  • Yang, Z., Goldman, N. and Friday, A. (1995). Maximum likelihood trees from DNA sequences: A peculiar statistical estimation problem. Systematic Biology 44 384–399.
  • Yang, Z. and Rannala, B. (1997). Bayesian phylogenetic inference using DNA sequences: A Markov chain Monte Carlo method. Mol. Biol. Evol. 14 717–724.
  • Zhai, Y. and Bouchard-Côté, A. (2016). Supplement to “Inferring rooted population trees using asymmetric neighbor joining.” DOI:10.1214/16-AOAS964SUPP.
  • Zharkikh, A. and Li, W. H. (1995). Estimation of confidence in phylogeny: The complete-and-partial bootstrap technique. Mol. Phylogenet. Evol. 4 44–63.

Supplemental materials

  • Supplement to: “Inferring rooted population trees using asymmetric neighbor joining”. We provide additional simulation studies and proofs on the properties of the algorithms in the supplementary material [Zhai and Bouchard-Côté (2016)].