The Annals of Statistics

On deep learning as a remedy for the curse of dimensionality in nonparametric regression

Benedikt Bauer and Michael Kohler

Full-text: Open access


Assuming that a smoothness condition and a suitable restriction on the structure of the regression function hold, it is shown that least squares estimates based on multilayer feedforward neural networks are able to circumvent the curse of dimensionality in nonparametric regression. The proof is based on new approximation results concerning multilayer feedforward neural networks with bounded weights and a bounded number of hidden neurons. The estimates are compared with various other approaches by using simulated data.

Article information

Ann. Statist., Volume 47, Number 4 (2019), 2261-2285.

Received: November 2017
Revised: April 2018
First available in Project Euclid: 21 May 2019

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62G08: Nonparametric regression
Secondary: 62G20: Asymptotic properties

Curse of dimensionality neural networks nonparametric regression rate of convergence


Bauer, Benedikt; Kohler, Michael. On deep learning as a remedy for the curse of dimensionality in nonparametric regression. Ann. Statist. 47 (2019), no. 4, 2261--2285. doi:10.1214/18-AOS1747.

Export citation


  • Anthony, M. and Bartlett, P. L. (1999). Neural Network Learning: Theoretical Foundations. Cambridge Univ. Press, Cambridge.
  • Bagirov, A. M., Clausen, C. and Kohler, M. (2009). Estimation of a regression function by maxima of minima of linear functions. IEEE Trans. Inform. Theory 55 833–845.
  • Barron, A. R. (1991). Complexity regularization with application to artificial neural networks. In Nonparametric Functional Estimation and Related Topics (Spetses, 1990). NATO Adv. Sci. Inst. Ser. C Math. Phys. Sci. 335 561–576. Kluwer Academic, Dordrecht.
  • Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inform. Theory 39 930–945.
  • Barron, A. R. (1994). Approximation and estimation bounds for artificial neural networks. Mach. Learn. 14 115–133.
  • Bauer, B. and Kohler, M. (2019). Supplement to “On deep learning as a remedy for the curse of dimensionality in nonparametric regression.” DOI:10.1214/18-AOS1747SUPPA, DOI:10.1214/18-AOS1747SUPPB.
  • Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Applications of Mathematics (New York) 31. Springer, New York.
  • Eldan, R. and Shamir, O. (2015). The power of depth for feedforward neural networks. Arxiv preprint.
  • Friedman, J. H. and Stuetzle, W. (1981). Projection pursuit regression. J. Amer. Statist. Assoc. 76 817–823.
  • Györfi, L., Kohler, M., Krzyżak, A. and Walk, H. (2002). A Distribution-Free Theory of Nonparametric Regression. Springer, New York.
  • Härdle, W., Hall, P. and Ichimura, H. (1993). Optimal smoothing in single-index models. Ann. Statist. 21 157–178.
  • Härdle, W. and Stoker, T. M. (1989). Investigating smooth multiple regression by the method of average derivatives. J. Amer. Statist. Assoc. 84 986–995.
  • Haykin, S. O. (2008). Neural Networks and Learning Machines, 3rd ed. Prentice Hall, New York.
  • Hertz, J., Krogh, A. and Palmer, R. G. (1991). Introduction to the Theory of Neural Computation. Addison-Wesley, Redwood City, CA.
  • Horowitz, J. L. and Mammen, E. (2007). Rate-optimal estimation for a general class of nonparametric regression models with unknown link functions. Ann. Statist. 35 2589–2619.
  • Kohler, M. and Krzyżak, A. (2005). Adaptive regression estimation with multilayer feedforward neural networks. J. Nonparametr. Stat. 17 891–913.
  • Kohler, M. and Krzyżak, A. (2017). Nonparametric regression based on hierarchical interaction models. IEEE Trans. Inform. Theory 63 1620–1630.
  • Kong, E. and Xia, Y. (2007). Variable selection for the single-index model. Biometrika 94 217–229.
  • Lazzaro, D. and Montefusco, L. B. (2002). Radial basis functions for the multivariate interpolation of large scattered data sets. J. Comput. Appl. Math. 140 521–536.
  • Lugosi, G. and Zeger, K. (1995). Nonparametric estimation via empirical risk minimization. IEEE Trans. Inform. Theory 41 677–687.
  • McCaffrey, D. F. and Gallant, A. R. (1994). Convergence rates for single hidden layer feedforward networks. Neural Netw. 7 147–158.
  • Mhaskar, H. N. and Poggio, T. (2016). Deep vs. shallow networks: An approximation theory perspective. Anal. Appl. (Singap.) 14 829–848.
  • Mielniczuk, J. and Tyrcha, J. (1993). Consistency of multilayer perceptron regression estimators. Neural Netw. 6 1019–1022.
  • Ripley, B. D. (2008). Pattern Recognition and Neural Networks. Cambridge Univ. Press, Cambridge. Reprint of the 1996 original.
  • Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Netw. 61 85–117.
  • Schmidt-Hieber, J. (2017). Nonparametric regression using deep neural networks with ReLU activation function. Available at arXiv:1708.06633v2.
  • Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. Ann. Statist. 10 1040–1053.
  • Stone, C. J. (1985). Additive regression and other nonparametric models. Ann. Statist. 13 689–705.
  • Stone, C. J. (1994). The use of polynomial splines and their tensor products in multivariate function estimation. Ann. Statist. 22 118–184.
  • Yu, Y. and Ruppert, D. (2002). Penalized spline estimation for partially linear single-index models. J. Amer. Statist. Assoc. 97 1042–1054.

Supplemental materials