Bayesian Analysis

Deep Learning: A Bayesian Perspective

Nicholas G. Polson and Vadim Sokolov

Full-text: Open access


Deep learning is a form of machine learning for nonlinear high dimensional pattern matching and prediction. By taking a Bayesian probabilistic perspective, we provide a number of insights into more efficient algorithms for optimisation and hyper-parameter tuning. Traditional high-dimensional data reduction techniques, such as principal component analysis (PCA), partial least squares (PLS), reduced rank regression (RRR), projection pursuit regression (PPR) are all shown to be shallow learners. Their deep learning counterparts exploit multiple deep layers of data reduction which provide predictive performance gains. Stochastic gradient descent (SGD) training optimisation and Dropout (DO) regularization provide estimation and variable selection. Bayesian regularization is central to finding weights and connections in networks to optimize the predictive bias-variance trade-off. To illustrate our methodology, we provide an analysis of international bookings on Airbnb. Finally, we conclude with directions for future research.

Article information

Bayesian Anal., Volume 12, Number 4 (2017), 1275-1304.

First available in Project Euclid: 16 November 2017

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

deep learning machine learning Artificial Intelligence LSTM models prediction Bayesian hierarchical models pattern matching TensorFlow

Creative Commons Attribution 4.0 International License.


Polson, Nicholas G.; Sokolov, Vadim. Deep Learning: A Bayesian Perspective. Bayesian Anal. 12 (2017), no. 4, 1275--1304. doi:10.1214/17-BA1082.

Export citation


  • Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. (2015). “TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.” Software available from URL
  • Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. (1985). “A learning algorithm for Boltzmann machines.” Cognitive Science, 9(1): 147–169.
  • Adams, R., Wallach, H., and Ghahramani, Z. (2010). “Learning the structure of deep sparse graphical models.” In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 1–8.
  • Amit, Y., Blanchard, G., and Wilder, K. (2000). “Multiple randomized classifiers: MRCL.”
  • Amit, Y. and Geman, D. (1997). “Shape Quantization and Recognition with Randomized Trees.” Neural Computation, 9(7): 1545–1588.
  • Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., et al. (2016). “Deep speech 2: End-to-end speech recognition in english and mandarin.” In International Conference on Machine Learning, 173–182.
  • Banerjee, S., Gelfand, A. E., Finley, A. O., and Sang, H. (2008). “Gaussian predictive process models for large spatial data sets.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(4): 825–848.
  • Barber, D. and Bishop, C. M. (1998). “Ensemble learning in Bayesian neural networks.” Neural Networks and Machine Learning, 168: 215–238.
  • Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). “Weight uncertainty in neural networks.” arXiv preprint arXiv:1505.05424.
  • Breiman, L. (2001). “Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author).” Statistical Science, 16(3): 199–231.
  • Bryant, D. W. (2008). Analysis of Kolmogorov’s superpostion theorem and its implementation in applications with low and high dimensional data. University of Central Florida.
  • Carreira-Perpinán, M. A. and Wang, W. (2014). “Distributed optimization of deeply nested systems.” In AISTATS, 10–19.
  • Chen, T. and Guestrin, C. (2016). “XGBoost: A Scalable Tree Boosting System.” CoRR, abs/1603.02754.
  • Chipman, H. A., George, E. I., McCulloch, R. E., et al. (2010). “BART: Bayesian additive regression trees.” The Annals of Applied Statistics, 4(1): 266–298.
  • Cook, R. D. (2007). “Fisher Lecture: Dimension Reduction in Regression.” Statistical Science, 22(1): 1–26.
  • Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Ranzato, M., Senior, A., Tucker, P., Yang, K., Le, Q. V., and Ng, A. Y. (2012a). “Large Scale Distributed Deep Networks.” In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 25, 1223–1231. Curran Associates, Inc.
  • Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q. V., and others (2012b). “Large Scale Distributed Deep Networks.” In Advances in Neural Information Processing Systems, 1223–1231.
  • DeepMind (2016). “DeepMind AI Reduces Google Data Centre Cooling Bill by 40%.”
  • DeepMind (2017). “The story of AlphaGo so far.”
  • Denker, J., Schwartz, D., Wittner, B., Solla, S., Howard, R., Jackel, L., and Hopfield, J. (1987). “Large automatic learning, rule extraction, and generalization.” Complex Systems, 1(5): 877–922.
  • Diaconis, P. and Freedman, D. (1987). “A dozen de Finetti-style results in search of a theory.” In Annales de l’IHP Probabilités et statistiques, volume 23, 397–423.
  • Diaconis, P. and Shahshahani, M. (1981). “Generating a random permutation with random transpositions.” Probability Theory and Related Fields, 57(2): 159–179.
  • — (1984). “On Nonlinear Functions of Linear Combinations.” SIAM Journal on Scientific and Statistical Computing, 5(1): 175–191.
  • Diaconis, P. W., Freedman, D., et al. (1998). “Consistency of Bayes estimates for nonparametric regression: normal theory.” Bernoulli, 4(4): 411–444.
  • Dixon, M. F., Polson, N. G., and Sokolov, V. O. (2017). “Deep Learning for Spatio-Temporal Modeling: Dynamic Traffic Flows and High Frequency Trading.” arXiv:1705.09851 [stat]. ArXiv: 1705.09851.
  • Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M., and Thrun, S. (2017). “Dermatologist-level classification of skin cancer with deep neural networks.” Nature, 542(7639): 115–118.
  • Feller, W. (1971). An introduction to probability theory and its applications. Wiley.
  • Francom, D. (2017). BASS: Bayesian Adaptive Spline Surfaces. R package version 0.2.2. URL
  • Frank, I. E. and Friedman, J. H. (1993). “A statistical view of some chemometrics regression tools.” Technometrics, 35(2): 109–135.
  • Frey, B. J. and Hinton, G. E. (1999). “Variational learning in nonlinear Gaussian belief networks.” Neural Computation, 11(1): 193–213.
  • Gal, Y. (2015). “A Theoretically Grounded Application of Dropout in Recurrent Neural Networks.” arXiv:1512.05287.
  • Gal, Y. and Ghahramani, Z. (2016). “Dropout as a Bayesian approximation: Representing model uncertainty in deep learning.” In International conference on machine learning, 1050–1059.
  • Gramacy, R. B. (2005). “Bayesian treed Gaussian process models.” Ph.D. thesis, University of California Santa Cruz.
  • Gramacy, R. B. and Polson, N. G. (2011). “Particle learning of Gaussian process models for sequential design and optimization.” Journal of Computational and Graphical Statistics, 20(1): 102–118.
  • Graves, A. (2011). “Practical variational inference for neural networks.” In Advances in Neural Information Processing Systems, 2348–2356.
  • Hastie, T., Tibshirani, R., and Friedman, J. (2016). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. New York, NY: Springer, 2nd edition.
  • Heaton, J., Polson, N., and Witte, J. H. (2017). “Deep learning for finance: deep portfolios.” Applied Stochastic Models in Business and Industry, 33(1): 3–12.
  • Hernández-Lobato, J. M. and Adams, R. (2015). “Probabilistic backpropagation for scalable learning of Bayesian neural networks.” In International Conference on Machine Learning, 1861–1869.
  • Hinton, G. E. and Salakhutdinov, R. R. (2006). “Reducing the dimensionality of data with neural networks.” Science (New York, N.Y.), 313(5786): 504–507.
  • Hinton, G. E. and Sejnowski, T. J. (1983). “Optimal perceptual inference.” In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 448–453. IEEE New York.
  • Hinton, G. E. and Van Camp, D. (1993). “Keeping the neural networks simple by minimizing the description length of the weights.” In Proceedings of the sixth annual conference on Computational learning theory, 5–13. ACM.
  • Jiang, B. and Liu, J. S. (2013). “Sliced inverse regression with variable selection and interaction detection.” arXiv preprint arXiv:1304.4056, 652.
  • Kaggle (2015). “Airbnb New User Bookings.” Accessed: 2017-09-11.
  • Kingma, D. and Ba, J. (2014). “Adam: A method for stochastic optimization.” arXiv preprint arXiv:1412.6980.
  • Kingma, D. P. and Welling, M. (2013). “Auto-encoding variational Bayes.” arXiv preprint arXiv:1312.6114.
  • Klartag, B. (2007). “A central limit theorem for convex sets.” Inventiones Mathematicae, 168(1): 91–131.
  • Kolmogorov, A. N. (1963). “On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition.” American Mathematical Society Translation, 28(2): 55–59.
  • Kubota, T. (2017). “Artificial intelligence used to identify skin cancer.” http://news. cancer/.
  • Lawrence, N. (2005). “Probabilistic non-linear principal component analysis with Gaussian process latent variable models.” Journal of Machine Learning Research, 6(Nov): 1783–1816.
  • Lee, H. (2004). Bayesian Nonparametrics via Neural Networks. ASA-SIAM Series on Statistics and Applied Mathematics. Society for Industrial and Applied Mathematics.
  • Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., and Sohl-Dickstein, J. (2017). “Deep Neural Networks as Gaussian Processes.” arXiv preprint arXiv:1711.00165.
  • Li, K.-C. (1991). “Sliced Inverse Regression for Dimension Reduction.” 86(414): 316–327.
  • MacKay, D. J. (1992). “A practical Bayesian framework for backpropagation networks.” Neural Computation, 4(3): 448–472.
  • Mallows, C. L. (1973). “Some comments on Cp.” Technometrics, 15(4): 661–675.
  • Milman, V. D. and Schechtman, G. (2009). Asymptotic theory of finite dimensional normed spaces: Isoperimetric inequalities in riemannian manifolds, volume 1200. Springer.
  • Mnih, A. and Gregor, K. (2014). “Neural variational inference and learning in belief networks.” arXiv preprint arXiv:1402.0030.
  • Montúfar, G. F. and Morton, J. (2015). “When Does a Mixture of Products Contain a Product of Mixtures?” SIAM Journal on Discrete Mathematics, 29(1): 321–347.
  • Müller, P. and Insua, D. R. (1998). “Issues in Bayesian Analysis of Neural Network Models.” Neural Computation, 10(3): 749–770.
  • Neal, R. M. (1990). “Learning stochastic feedforward networks.” Department of Computer Science, University of Toronto, 64.
  • Neal, R. M. (1992). “Bayesian training of backpropagation networks by the hybrid Monte Carlo method.” Technical report, Technical Report CRG-TR-92-1, Dept. of Computer Science, University of Toronto.
  • Neal, R. M. (1993). “Bayesian learning via stochastic dynamics.” Advances in Neural Information Processing Systems, 475–475.
  • — (1996). “Priors for infinite networks.” In Bayesian Learning for Neural Networks, 29–53. Springer.
  • Nesterov, Y. (1983). “A method of solving a convex programming problem with convergence rate O (1/k2).” In Soviet Mathematics Doklady, volume 27, 372–376.
  • Nesterov, Y. (2013). Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media.
  • Pascanu, R., Gulcehre, C., Cho, K., and Bengio, Y. (2013). “How to Construct Deep Recurrent Neural Networks.” arXiv:1312.6026 [cs, stat]. ArXiv: 1312.6026.
  • Poggio, T. (2016). “Deep Learning: Mathematics and Neuroscience.” A Sponsored Supplement to Science, Brain-Inspired intelligent robotics: The intersection of robotics and neuroscience: 9–12.
  • Polson, N. G., Scott, J. G., Willard, B. T., and others (2015). “Proximal algorithms in statistics and machine learning.” Statistical Science, 30(4): 559–581.
  • Polson, N. G. and Sokolov, V. O. (2017). “Deep learning for short-term traffic flow prediction.” Transportation Research Part C: Emerging Technologies, 79: 1–17.
  • Polson, N. G., Willard, B. T., and Heidari, M. (2015b). “A statistical theory of deep learning via proximal splitting.” arXiv preprint arXiv:1509.06061.
  • Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). “Stochastic backpropagation and approximate inference in deep generative models.” arXiv preprint arXiv:1401.4082.
  • Ripley, B. D. (1994). “Neural networks and related methods for classification.” Journal of the Royal Statistical Society. Series B (Methodological), 409–456.
  • Ruiz, F. R., AUEB, M. T. R., and Blei, D. (2016). “The generalized reparameterization gradient.” In Advances in Neural Information Processing Systems, 460–468.
  • Salakhutdinov, R. (2008). “Learning and evaluating Boltzmann machines.” Tech. Rep., Technical Report UTML TR 2008-002, Department of Computer Science, University of Toronto.
  • Salakhutdinov, R. and Hinton, G. (2009). “Deep boltzmann machines.” In Artificial Intelligence and Statistics, 448–455.
  • Saul, L. K., Jaakkola, T., and Jordan, M. I. (1996). “Mean field theory for sigmoid belief networks.” Journal of Artificial Intelligence Research, 4: 61–76.
  • Schmidhuber, J. (2015). “Deep learning in neural networks: An overview.” Neural Networks, 61: 85–117.
  • Simonyan, K. and Zisserman, A. (2014). “Very Deep Convolutional Networks for Large-Scale Image Recognition.”
  • Sjöberg, J., Zhang, Q., Ljung, L., Benveniste, A., Delyon, B., Glorennec, P.-Y., Hjalmarsson, H., and Juditsky, A. (1995). “Nonlinear black-box modeling in system identification: a unified overview.” Automatica, 31(12): 1691–1724.
  • Smolensky, P. (1986). “Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1.” 194–281. Cambridge, MA, USA: MIT Press.
  • Snoek, J., Larochelle, H., and Adams, R. P. (2012). “Practical bayesian optimization of machine learning algorithms.” In Advances in neural information processing systems, 2951–2959.
  • Sprecher, D. A. (1972). “A survey of solved and unsolved problems on superpositions of functions.” Journal of Approximation Theory, 6(2): 123–134.
  • Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). “Dropout: a simple way to prevent neural networks from overfitting.” Journal of Machine Learning Research, 15(1): 1929–1958.
  • Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). “On the importance of initialization and momentum in deep learning.” In International conference on machine learning, 1139–1147.
  • Sutskever, I., Vinyals, O., and Le, Q. V. (2014). “Sequence to sequence learning with neural networks.” In Advances in neural information processing systems, 3104–3112.
  • Tieleman, T. (2008). “Training restricted Boltzmann machines using approximations to the likelihood gradient.” In Proceedings of the 25th international conference on Machine learning, 1064–1071. ACM.
  • Vitushkin, A. G. and Khenkin, G. M. (1967). “Linear superpositions of functions.” Russian Mathematical Surveys, 22(1): 77.
  • Welling, M., Rosen-Zvi, M., and Hinton, G. E. (2005). “Exponential family harmoniums with an application to information retrieval.” In Advances in neural information processing systems, 1481–1488.
  • Williams, C. K. (1997). “Computing with infinite networks.” In Advances in neural information processing systems, 295–301.
  • Wold, H. (1956). “Causal inference from observational data: A review of end and means.” Journal of the Royal Statistical Society. Series A (General), 119(1): 28–61.
  • Wold, S., Sjöström, M., and Eriksson, L. (2001). “PLS-regression: a basic tool of chemometrics.” Chemometrics and Intelligent Laboratory Systems, 58(2): 109–130.
  • Zeiler, M. D. (2012). “ADADELTA: an adaptive learning rate method.” arXiv preprint arXiv:1212.5701.