The Annals of Statistics

Convergence of contrastive divergence algorithm in exponential family

Bai Jiang, Tung-Yu Wu, Yifan Jin, and Wing H. Wong

Full-text: Open access


The Contrastive Divergence (CD) algorithm has achieved notable success in training energy-based models including Restricted Boltzmann Machines and played a key role in the emergence of deep learning. The idea of this algorithm is to approximate the intractable term in the exact gradient of the log-likelihood function by using short Markov chain Monte Carlo (MCMC) runs. The approximate gradient is computationally-cheap but biased. Whether and why the CD algorithm provides an asymptotically consistent estimate are still open questions. This paper studies the asymptotic properties of the CD algorithm in canonical exponential families, which are special cases of the energy-based model. Suppose the CD algorithm runs $m$ MCMC transition steps at each iteration $t$ and iteratively generates a sequence of parameter estimates $\{\theta_{t}\}_{t\ge 0}$ given an i.i.d. data sample $\{X_{i}\}_{i=1}^{n}\sim p_{\theta_{\star }}$. Under conditions which are commonly obeyed by the CD algorithm in practice, we prove the existence of some bounded $m$ such that any limit point of the time average $\sum_{s=0}^{t-1}\theta_{s}/t$ as $t\to\infty $ is a consistent estimate for the true parameter $\theta_{\star }$. Our proof is based on the fact that $\{\theta_{t}\}_{t\ge 0}$ is a homogenous Markov chain conditional on the data sample $\{X_{i}\}_{i=1}^{n}$. This chain meets the Foster–Lyapunov drift criterion and converges to a random walk around the maximum likelihood estimate. The range of the random walk shrinks to zero at rate $\mathcal{O}(1/\sqrt[3]{{n}})$ as the sample size $n\to \infty $.

Article information

Ann. Statist., Volume 46, Number 6A (2018), 3067-3098.

Received: July 2016
Revised: September 2017
First available in Project Euclid: 7 September 2018

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 68W99: None of the above, but in this section 62F12: Asymptotic properties of estimators
Secondary: 60J20: Applications of Markov chains and discrete-time Markov processes on general state spaces (social mobility, learning theory, industrial processes, etc.) [See also 90B30, 91D10, 91D35, 91E40]

Contrastive Divergence exponential family convergence rate


Jiang, Bai; Wu, Tung-Yu; Jin, Yifan; Wong, Wing H. Convergence of contrastive divergence algorithm in exponential family. Ann. Statist. 46 (2018), no. 6A, 3067--3098. doi:10.1214/17-AOS1649.

Export citation


  • [1] Ackley, D. H., Hinton, G. E. and Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cogn. Sci. 9 147–169.
  • [2] Amit, Y. (1996). Convergence properties of the Gibbs sampler for perturbations of Gaussians. Ann. Statist. 24 122–140.
  • [3] Asuncion, A., Liu, Q., Ihler, A. and Smyth, P. (2010). Learning with blocks: Composite likelihood and contrastive divergence. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics 33–40.
  • [4] Atchadé, Y. F., Fort, G. and Moulines, E. (2017). On perturbed proximal gradient algorithms. J. Mach. Learn. Res. 18 Paper No. 10, 33.
  • [5] Azuma, K. (1967). Weighted sums of certain dependent random variables. Tôhoku Math. J. (2) 19 357–367.
  • [6] Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2 183–202.
  • [7] Bengio, Y. and Delalleau, O. (2009). Justifying and generalizing contrastive divergence. Neural Comput. 21 1601–1621.
  • [8] Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H. (2007). Greedy layer-wise training of deep networks. In Proceedings of the 19th International Conference on Neural Information Processing Systems (NIPS’06) 153–160. MIT Press, Cambridge, MA.
  • [9] Carreira-Perpinan, M. A. and Hinton, G. E. (2005). On contrastive divergence learning. In AISTATS 10 33–40.
  • [10] Coates, A., Ng, A. and Lee, H. (2011). An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (G. Gordon, D. Dunson and M. Dudik, eds.). Proceedings of Machine Learning Research 15 215–223.
  • [11] Conway, J. B. (1990). A Course in Functional Analysis, 2nd ed. Graduate Texts in Mathematics 96. Springer, New York.
  • [12] Diaconis, P. (2009). The Markov chain Monte Carlo revolution. Bull. Amer. Math. Soc. (N.S.) 46 179–205.
  • [13] Hairer, M. and Mattingly, J. C. (2011). Yet another look at Harris’ ergodic theorem for Markov chains. In Seminar on Stochastic Analysis, Random Fields and Applications VI. Progress in Probability 63 109–117. Birkhäuser, Basel.
  • [14] He, X., Zemel, R. S. and Carreira-Perpiñán, M. Á. (2004). Multiscale conditional random fields for image labeling. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2004) 2 692–702. IEEE, New York.
  • [15] Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Comput. 14 1771–1800.
  • [16] Hinton, G. E., Osindero, S. and Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural Comput. 18 1527–1554.
  • [17] Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science 313 504–507.
  • [18] Hinton, G. E. and Salakhutdinov, R. R. (2009). Replicated softmax: An undirected topic model. In Advances in Neural Information Processing Systems 1607–1614.
  • [19] Hunter, D. R., Handcock, M. S., Butts, C. T., Goodreau, S. M. and Morris, M. (2008). ERGM: A package to fit, simulate and diagnose exponential-family models for networks. J. Stat. Softw. 24 1–29 nihpa54860.
  • [20] Hyvärinen, A. (2006). Consistency of pseudolikelihood estimation of fully visible Boltzmann machines. Neural Comput. 18 2283–2292.
  • [21] Jiang, B., Wu, T.-Y., Jin, Y. and Wong, W. H. (2018). Supplement to “Convergence of contrastive divergence algorithm in exponential family.” DOI:10.1214/17-AOS1649SUPP.
  • [22] Kontoyiannis, I. and Meyn, S. P. (2012). Geometric ergodicity and the spectral gap of non-reversible Markov chains. Probab. Theory Related Fields 154 327–339.
  • [23] Krivitsky, P. N. (2017). Using contrastive divergence to seed Monte Carlo MLE for exponential-family random graph models. Comput. Statist. Data Anal. 107 149–161.
  • [24] Larochelle, H. and Bengio, Y. (2008). Classification using discriminative restricted Boltzmann machines. In Proceedings of the 25th International Conference on Machine Learning 536–543. ACM, New York.
  • [25] Lehmann, E. L. and Casella, G. (1991). Theory of Point Estimation. Wadsworth & Brooks/Cole Advanced Books & Software, Pacific Grove, CA. Reprint of the 1983 original.
  • [26] MacKay, D. (2001). Failures of the one-step learning algorithm. Technical report, Available at
  • [27] Mengersen, K. L. and Tweedie, R. L. (1996). Rates of convergence of the Hastings and Metropolis algorithms. Ann. Statist. 24 101–121.
  • [28] Meyn, S. P. and Tweedie, R. L. (1992). Stability of Markovian processes I: Criteria for discrete-time chains. Adv. in Appl. Probab. 24 542–574.
  • [29] Meyn, S. P. and Tweedie, R. L. (1993). Stability of Markovian processes III: Foster–Lyapunov criteria for continuous-time processes. Adv. in Appl. Probab. 25 518–548.
  • [30] Mohamed, A.-R., Dahl, G. E. and Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE/ACM Trans. Audio Speech Lang. Process. 20 14–22.
  • [31] Parikh, N., Boyd, S. P. et al. (2014). Proximal algorithms. Found. Trends Optim. 1 127–239.
  • [32] Rigollet, P. (2015). Lecture notes in high dimensional statistics.
  • [33] Roberts, G. O. and Rosenthal, J. S. (1997). Geometric ergodicity and hybrid Markov chains. Electron. Commun. Probab. 2 13–25.
  • [34] Roberts, G. O. and Rosenthal, J. S. (2004). General state space Markov chains and MCMC algorithms. Probab. Surv. 1 20–71.
  • [35] Roberts, G. O. and Tweedie, R. L. (1996). Geometric convergence and central limit theorems for multidimensional Hastings and Metropolis algorithms. Biometrika 83 95–110.
  • [36] Robins, G., Pattison, P., Kalish, Y. and Lusher, D. (2007). An introduction to exponential random graph (p∗) models for social networks. Soc. Netw. 29 173–191.
  • [37] Roth, S. and Black, M. J. (2005). Fields of experts: A framework for learning image priors. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005) 2 860–867. IEEE, New York.
  • [38] Rudolf, D. (2011). Explicit error bounds for Markov chain Monte Carlo. ArXiv preprint. Available at arXiv:1108.3201.
  • [39] Salakhutdinov, R., Mnih, A. and Hinton, G. (2007). Restricted Boltzmann machines for collaborative filtering. In Proceedings of the 24th International Conference on Machine Learning (ICML’07) 791–798. ACM, New York.
  • [40] Sutskever, I. and Tieleman, T. (2010). On the convergence properties of contrastive divergence. In International Conference on Artificial Intelligence and Statistics 789–795.
  • [41] Teh, Y. W., Welling, M., Osindero, S. and Hinton, G. E. (2004). Energy-based models for sparse overcomplete representations. J. Mach. Learn. Res. 4 1235–1260.
  • [42] Várnai, C., Burkoff, N. S. and Wild, D. L. (2013). Efficient parameter estimation of generalizable coarse-grained protein force fields using contrastive divergence: A maximum likelihood approach. J. Chem. Theory Comput. 9 5718–5733.
  • [43] Williams, C. K. I. and Agakov, F. V. (2002). An analysis of contrastive divergence learning in Gaussian Boltzmann machines. Working paper, Institute for Adaptive and Neural Computation, Edinburgh.
  • [44] Yuille, A. L. (2005). The convergence of contrastive divergences. In Proceedings of the 17th International Conference on Neural Information Processing Systems (NIPS’04) 1593–1600. MIT Press, Cambridge, MA.

Supplemental materials