The perennial problem of “how many clusters?” remains an issue of substantial interest in data mining and machine learning communities, and becomes particularly salient in large data sets such as populational genomic data where the number of clusters needs to be relatively large and open-ended. This problem gets further complicated in a co-clustering scenario in which one needs to solve multiple clustering problems simultaneously because of the presence of common centroids (e.g., ancestors) shared by clusters (e.g., possible descents from a certain ancestor) from different multiple-cluster samples (e.g., different human subpopulations). In this paper we present a hierarchical nonparametric Bayesian model to address this problem in the context of multi-population haplotype inference.
Uncovering the haplotypes of single nucleotide polymorphisms is essential for many biological and medical applications. While it is uncommon for the genotype data to be pooled from multiple ethnically distinct populations, few existing programs have explicitly leveraged the individual ethnic information for haplotype inference. In this paper we present a new haplotype inference program, Haploi, which makes use of such information and is readily applicable to genotype sequences with thousands of SNPs from heterogeneous populations, with competent and sometimes superior speed and accuracy comparing to the state-of-the-art programs. Underlying Haploi is a new haplotype distribution model based on a nonparametric Bayesian formalism known as the hierarchical Dirichlet process, which represents a tractable surrogate to the coalescent process. The proposed model is exchangeable, unbounded, and capable of coupling demographic information of different populations. It offers a well-founded statistical framework for posterior inference of individual haplotypes, the size and configuration of haplotype ancestor pools, and other parameters of interest given genotype data.
References
Airoldi, E. M., Blei, D. M., Xing, E. P. and Fienberg, S. E. (2006). Mixed membership stochastic block models for relational data, with applications to protein—protein interactions. In, Proceedings of International Biometric Society—ENAR Annual Meetings.
Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems., Ann. Statist. 2 1152–1174.
Mathematical Reviews (MathSciNet):
MR365969
Blackwell, D. and MacQueen, J. B. (1973). Ferguson distributions via Pólya urn schemes., Ann. Statist. 1 353–355.
Blei, D., Ng, A. and Jordan, M. I. (2003). Latent Dirichlet allocation., Journal of Machine Learning Research 3 993–1022.
Browning, S. R. and Browning, B. L. (2007). Rapid and accurate haplotype phasing and missing data inference for whole genome association studies using localized haplotype clustering., Amer. J. Hum. Genet. 81 1084–1097.
Chakravarti, A. (2001). Single nucleotide polymorphisms: …to a future of genetic medicine., Nature 409 822–823.
Clark, A. (2003). Finding genes underlying risk of complex disease by linkage disequilibrium mapping., Curr. Opin. Genet. Dev. 13 296–302.
Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures., J. Amer. Statist. Assoc. 90 577–588.
Excoffier, L. and Slatkin, M. (1995). Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population., Mol. Biol. Evol. 12 921–927.
Fallin, D. and Schork, N. J. (2000). Accuracy of haplotype frequency estimation for biallelic loci, via the expectation–maximization algorithm for unphased diploid genotype data., Amer. J. Hum. Genet. 67 947–959.
Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems., Ann. Statist. 1 209–230.
Mathematical Reviews (MathSciNet):
MR350949
Gusfield, D. (2004). An overview of combinatorial methods for haplotype inference. Technical report, UC Davis. Available at, http://wwwcsif.cs.ucdavis.edu/~gusfield/hapreview.pdf.
Hawley, M. E. and Kidd, K. K. (1995). Haplo: A program using the EM algorithm to estimate the frequencies of multi-site haplotypes., J. Hered. 86 409–411.
Hodge, S. E., Boehnke, M. and Spence, M. A. (1999). Loss of information due to ambiguous haplotyping of SNPs., Nat. Genet. 21 360–361.
Hoppe, F. M. (1984). Pólya-like urns and the Ewens’ sampling formula., J. Math. Biol. 20 91–94.
Kimmel, G. and Shamir, R. (2004). Maximum likelihood resolution of multi-block genotypes. In, Proceedings of 8th Annual Conference on Research in Computational Molecular Biology (RECOMB) 2–9. ACM, New York.
Kingman, J. F. C. (1982). On the genealogy of large populations., J. Appl. Probab. 19A 27–43.
Mathematical Reviews (MathSciNet):
MR633178
Li, N. and Stephens, M. (2003). Modelling linkage disequilibrium and identifying recombination hotspots using SNP data genetics., Genetics 165 2213–2233.
Li, Y. and Abecasis, G. R. (2006). Mach 1.0: Rapid haplotype reconstruction and missing genotype inference., Amer. J. Hum. Genet. S79 2290.
Liu, J. S., Sabatti, C., Teng, J., Keats, B. J. B. and Risch, N. (2001). Bayesian analysis of haplotypes for linkage disequilibrium mapping., Genome Res. 11 1716–1724.
Long, J. C., Williams, R. C. and Urbanek, M. (1995). An EM algorithm and testing strategy for multiple-locus haplotypes., Amer. J. Hum. Genet. 56 799–810.
Muller, P., Quintana, F. and Rosner, G. (2004). A method for combining inference across related nonparametric Bayesian models., J. Roy. Statist. Soc. Ser. B 66 735–749.
Niu, T., Qin, S., Xu, X. and Liu, J. (2002). Bayesian haplotype inference for multiple linked single nucleotide polymorphisms., Amer. J. Hum. Genet. 70 157–169.
Patil, N. et al. (2001). Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21., Science 294 1719–1723.
Pritchard, J. K. (2001). Are rare variants responsible for susceptibility to complex disease?, Amer. J. Hum. Genet. 69 124–137.
Qin, Z. S., Niu, T. and Liu, J. S. (2002). Partition–ligation expectation–maximization algorithm for haplotype inference with single-nucleotide polymorphisms., Amer. J. Hum. Genet. 71 1242–1247.
Rasmussen, C. E. (2000). The infinite Gaussian mixture model. In, Advances in Neural Information Processing Systems (S. A. Solla, T. K. Leen and K.-R. Müler, eds.) 12 554–560. MIT Press, Cambridge.
Rodriguez, A., Dunson, D. B. and Gelfand, A. E. (2006). The nested dirichlet process. Technical report, Institute of Statistics and Decision Sciences, Duke, Univ.
Scheet, P. and Stephens, M. (2006). A fast and flexible statistical model for large-scale population genotype data: Applications to inferring missing genotypes and haplotypic phase., Amer. J. Hum. Genet. 78 629–644.
Sohn, K.-A. and Xing, E. P. (2008). Supplement to “A hierarchical dirichlet process mixture model for haplotype reconstruction from multi-population data.” DOI:, 10.1214/08-AOAS225SUPP.
Stephens, M. and Scheet, P. (2005). Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation., Amer. J. Hum. Genet. 76 449–462.
Stephens, M., Smith, N. and Donnelly, P. (2001). A new statistical method for haplotype reconstruction from population data., Amer. J. Hum. Genet. 68 978–989.
Teh, Y. W., Jordan, M. I., Beal, M. J. and Blei, D. M. (2006). Hierarchical Dirichlet processes., J. Amer. Statist. Assoc. 101 1566–1581.
The International SNP Map Working Group (2001). A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms., Nature 409 928–933.
Xing, E. P. and Sohn, K.-A. (2007). Hidden Markov Dirichlet process: Modeling genetic recombination in open ancestral space., Bayesian Analysis 2 501–528.
Xing, E. P., Sharan, R. and Jordan, M. I. (2004). Bayesian haplotype inference via the Dirichlet process. In, Proceedings of the 21st International Conference on Machine Learning. ACM Press, New York.
Xing, E. P., Sohn, K.-A., Jordan, M. I. and Teh, Y. W. (2006). Bayesian multi-population haplotype inference via a hierarchical Dirichlet process mixture. In, Proceedings of the 21st International Conference on Machine Learning 1049–1056. ACM Press, New York.
Zhang, Y., Niu, T. and Liu, J. S. (2006). A coalescence-guided hierarchical Bayesian method for haplotype inference., Amer. J. Hum. Genet. 79 313–322.