The Annals of Applied Statistics

A sticky HDP-HMM with application to speaker diarization

Emily B. Fox, Erik B. Sudderth, Michael I. Jordan, and Alan S. Willsky

Full-text: Open access


We consider the problem of speaker diarization, the problem of segmenting an audio recording of a meeting into temporal segments corresponding to individual speakers. The problem is rendered particularly difficult by the fact that we are not allowed to assume knowledge of the number of people participating in the meeting. To address this problem, we take a Bayesian nonparametric approach to speaker diarization that builds on the hierarchical Dirichlet process hidden Markov model (HDP-HMM) of Teh et al. [J. Amer. Statist. Assoc. 101 (2006) 1566–1581]. Although the basic HDP-HMM tends to over-segment the audio data—creating redundant states and rapidly switching among them—we describe an augmented HDP-HMM that provides effective control over the switching rate. We also show that this augmentation makes it possible to treat emission distributions nonparametrically. To scale the resulting architecture to realistic diarization problems, we develop a sampling algorithm that employs a truncated approximation of the Dirichlet process to jointly resample the full state sequence, greatly improving mixing rates. Working with a benchmark NIST data set, we show that our Bayesian nonparametric architecture yields state-of-the-art speaker diarization results.

Article information

Ann. Appl. Stat., Volume 5, Number 2A (2011), 1020-1056.

First available in Project Euclid: 13 July 2011

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Bayesian nonparametrics hierarchical Dirichlet processes hidden Markov models speaker diarization


Fox, Emily B.; Sudderth, Erik B.; Jordan, Michael I.; Willsky, Alan S. A sticky HDP-HMM with application to speaker diarization. Ann. Appl. Stat. 5 (2011), no. 2A, 1020--1056. doi:10.1214/10-AOAS395.

Export citation


  • Barras, C., Zhu, X., Meignier, S. and Gauvain, J.-L. (2004). Improving speaker diarization. In Proc. Fall 2004 Rich Transcription Workshop (RT-04), November 2004.
  • Beal, M. J. and Krishnamurthy, P. (2006). Gene expression time course clustering with countably infinite hidden Markov models. In Proc. Conference on Uncertainty in Artificial Intelligence, Cambridge, MA.
  • Beal, M. J., Ghahramani, Z. and Rasmussen, C. E. (2002). The infinite hidden Markov model. In Advances in Neural Information Processing Systems 14 577–584. MIT Press, Cambridge, MA.
  • Blackwell, D. and MacQueen, J. B. (1973). Ferguson distributions via Pólya urn schemes. Ann. Statist. 1 353–355.
  • Chen, S. S. and Gopalakrishnam, P. S. (1998). Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In Proc. DARPA Broadcast News Transcription and Understanding Workshop 127–132. Morgan Kaufmann, San Francisco, CA.
  • Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 209–230.
  • Fox, E. B., Sudderth, E. B., Jordan, M. I. and Willsky, A. S. (2008). An HDP-HMM for systems with state persistence. In Proc. International Conference on Machine Learning, Helsinki, Finland, July 2008.
  • Fox, E. B., Sudderth, E. B., Jordan, M. I. and Willsky, A. S. (2009). Nonparametric Bayesian learning of switching dynamical systems. In Advances in Neural Information Processing Systems 21 457–464.
  • Fox, E. B., Sudderth, E. B., Jordan, M. I. and Willsky, A. S. (2010). Supplement to “A sticky HDP-HMM with application to speaker diarization.” DOI: 10.1214/10-AOAS395SUPP.
  • Gales, M. and Young, S. (2007). The Application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing 1 195–304.
  • Gauvain, J.-L., Lamel, L. and Adda, G. (1998). Partitioning and transcription of broadcast news data. In Proc. International Conference on Spoken Language Processing, Sydney, Australia 1335–1338.
  • Hoffman, M., Cook, P. and Blei, D. (2008). Data-driven recomposition using the hierarchical Dirichlet process hidden Markov model. In Proc. International Computer Music Conference, Belfast, UK.
  • Ishwaran, H. and Zarepour, M. (2000a). Markov chain Monte Carlo in approximate Dirichlet and beta two–parameter process hierarchical models. Biometrika 87 371–390.
  • Ishwaran, H. and Zarepour, M. (2002b). Dirichlet prior sieves in finite normal mixtures. Statist. Sinica 12 941–963.
  • Ishwaran, H. and Zarepour, M. (2002c). Exact and approximate sum—representations for the Dirichlet process. Canad. J. Statist. 30 269–283.
  • Jain, S. and Neal, R. M. (2004). A split-merge Markov chain Monte Carlo procedure for the dirichlet process mixture model. J. Comput. Graph. Statist. 13 158–182.
  • Jasra, A., Holmes, C. C. and Stephens, D. A. (2005). Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Statist. Sci. 20 50–67.
  • Johnson, M. (2007). Why doesn’t EM find good HMM POS-taggers. In Proc. Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech Republic.
  • Kivinen, J. J., Sudderth, E. B. and Jordan, M. I. (2007). Learning multiscale representations of natural scenes using Dirichlet processes. In Proc. International Conference on Computer Vision, Rio de Janeiro, Brazil 1–8.
  • Kurihara, K., Welling, M. and Teh, Y. W. (2007). Collapsed variational Dirichlet process mixture models. In Proc. International Joint Conferences on Artificial Intelligence, Hyderabad, India.
  • Meignier, S., Bonastre, J.-F., Fredouille, C. and Merlin, T. (2000). Evolutive HMM for multi-speaker tracking system. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Istanbul, Turkey, June 2000.
  • Meignier, S., Bonastre, J.-F. and Igounet, S. (2001). E-HMM approach for learning and adapting sound models for speaker indexing. In Proc. Odyssey Speaker Language Recognition Workshop, June 2001.
  • Munkres, J. (1957). Algorithms for the assignment and transportation problems. J. Soc. Industr. Appl. Math. 5 32–38.
  • NIST. Rich transcriptions database. Available at, 2007.
  • Papaspiliopoulos, O. and Roberts, G. O. (2008). Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models. Biometrika 95 169–186.
  • Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77 257–286.
  • Reynolds, D. A. and Torres-Carrasquillo, P. A. (2004). The MIT Lincoln Laboratory RT-04F diarization systems: Applications to broadcast news and telephone conversations. In Proc. Fall 2004 Rich Transcription Workshop (RT-04), November 2004.
  • Robert, C. P. (2007). The Bayesian Choice. Springer, New York.
  • Rodriguez, A., Dunson, D. B. and Gelfand, A. E. (2008). The nested Dirichlet process. J. Amer. Statist. Assoc. 103 1131–1154.
  • Scott, S. L. (2002). Bayesian methods for hidden Markov models: Recursive computing in the 21st century. J. Amer. Statist. Assoc. 97 337–351.
  • Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statist. Sinica 4 639–650.
  • Siegler, M., Jain, U., Raj, B. and Stern, R. M. (1997). Automatic segmentation, classification and clustering of broadcast news audio. In Proc. DARPA Speech Recognition Workshop 97–99. Morgan Kaufmann, San Francisco, CA.
  • Teh, Y. W., Jordan, M. I., Beal, M. J. and Blei, D. M. (2006). Hierarchical Dirichlet processes. J. Amer. Statist. Assoc. 101 1566–1581.
  • Tranter, S. E. and Reynolds, D. A. (2006). An overview of automatic speaker diarization systems. IEEE Trans. Audio, Speech Language Process. 14 1557–1565.
  • Van Gael, J., Saatci, Y., Teh, Y. W. and Ghahramani, Z. (2008). Beam sampling for the infinite hidden Markov model. In Proc. International Conference on Machine Learning, Helsinki, Finland, July 2008.
  • Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices. Commun. Statist. Simul. Comput. 36 45–54.
  • Wooters, C. and Huijbregts, M. (2007). The ICSI RT07s speaker diarization system. Lecture Notes in Computer Science 4625 509–519.
  • Wooters, C., Fung, J., Peskin, B. and Anguera, X. (2004). Towards robust speaker segmentation: The ICSI-SRI Fall 2004 diarization system. In Proc. Fall 2004 Rich Transcription Workshop (RT-04), November 2004.
  • Xing, E. P. and Sohn, K.-A. (2007). Hidden Markov Dirichlet process: Modeling genetic inference in open ancestral space. Bayesian Anal. 2 501–528.

Supplemental materials

  • Supplementary material: Notational conventions, Chinese restaurant franchises and derivations of Gibbs samplers. We present detailed derivations of the conditional distributions used for both the direct assignment and blocked Gibbs samplers, as well as the associated pseudo-code. The description of these derivations relies on the Chinese restaurant analogies associated with the HDP and sticky HDP-HMM, which are expounded upon in this supplementary material. We also provide a list of notational conventions used throughout the paper.