The Annals of Statistics

Minimax-optimal nonparametric regression in high dimensions

Yun Yang and Surya T. Tokdar

Full-text: Open access


Minimax $L_{2}$ risks for high-dimensional nonparametric regression are derived under two sparsity assumptions: (1) the true regression surface is a sparse function that depends only on $d=O(\log n)$ important predictors among a list of $p$ predictors, with $\log p=o(n)$; (2) the true regression surface depends on $O(n)$ predictors but is an additive function where each additive component is sparse but may contain two or more interacting predictors and may have a smoothness level different from other components. For either modeling assumption, a practicable extension of the widely used Bayesian Gaussian process regression method is shown to adaptively attain the optimal minimax rate (up to $\log n$ terms) asymptotically as both $n,p\to\infty$ with $\log p=o(n)$.

Article information

Ann. Statist., Volume 43, Number 2 (2015), 652-674.

First available in Project Euclid: 3 March 2015

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62G08: Nonparametric regression 62C20: Minimax procedures
Secondary: 60G15: Gaussian processes

Adaptive estimation high-dimensional regression minimax risk model selection nonparametric regression


Yang, Yun; Tokdar, Surya T. Minimax-optimal nonparametric regression in high dimensions. Ann. Statist. 43 (2015), no. 2, 652--674. doi:10.1214/14-AOS1289.

Export citation


  • [1] Bhattacharya, A., Pati, D. and Dunson, D. (2014). Anisotropic function estimation using multi-bandwidth Gaussian processes. Ann. Statist. 42 352–381.
  • [2] Bickel, P. J. and Li, B. (2007). Local polynomial regression on unknown manifolds. In Complex Datasets and Inverse Problems: Tomography, Networks and Beyond. Institute of Mathematical Statistics Lecture Notes—Monograph Series 54 177–186. IMS, Beachwood, OH.
  • [3] Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and Dantzig selector. Ann. Statist. 37 1705–1732.
  • [4] Borell, C. (1975). The Brunn–Minkowski inequality in Gauss space. Invent. Math. 30 207–216.
  • [5] Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Heidelberg.
  • [6] Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when $p$ is much larger than $n$. Ann. Statist. 35 2313–2351.
  • [7] Chipman, H. A., George, E. I. and McCulloch, R. E. (2010). BART: Bayesian additive regression trees. Ann. Appl. Stat. 4 266–298.
  • [8] Choi, T. and Schervish, M. J. (2007). On posterior consistency in nonparametric regression problems. J. Multivariate Anal. 98 1969–1987.
  • [9] Comminges, L. and Dalalyan, A. S. (2012). Tight conditions for consistency of variable selection in the context of high dimensionality. Ann. Statist. 40 2667–2696.
  • [10] Ghosal, S. and van der Vaart, A. W. (2007). Convergence rates of posterior distributions for noniid observations. Ann. Statist. 35 192–233.
  • [11] Hastie, T. and Tibshirani, R. (1986). Generalized additive models. Statist. Sci. 1 297–318.
  • [12] Koltchinskii, V. and Yuan, M. (2010). Sparsity in multiple kernel learning. Ann. Statist. 38 3660–3695.
  • [13] Kpotufe, S. and Dasgupta, S. (2012). A tree-based regressor that adapts to intrinsic dimension. J. Comput. System Sci. 78 1496–1515.
  • [14] Kulkarni, S. R. and Posner, S. E. (1995). Rates of convergence of nearest neighbor estimation under arbitrary sampling. IEEE Trans. Inform. Theory 41 1028–1039.
  • [15] Lafferty, J. and Wasserman, L. (2008). Rodeo: Sparse, greedy nonparametric regression. Ann. Statist. 36 28–63.
  • [16] Lorentz, G. G. (1966). Metric entropy and approximation. Bull. Amer. Math. Soc. (N.S.) 72 903–937.
  • [17] Meier, L., van de Geer, S. and Bühlmann, P. (2009). High-dimensional additive modeling. Ann. Statist. 37 3779–3821.
  • [18] Meinshausen, N. and Yu, B. (2009). Lasso-type recovery of sparse representations for high-dimensional data. Ann. Statist. 37 246–270.
  • [19] Raskutti, G., Wainwright, M. J. and Yu, B. (2011). Minimax rates of estimation for high-dimensional linear regression over $\ell_q$-balls. IEEE Trans. Inform. Theory 57 6976–6994.
  • [20] Raskutti, G., Wainwright, M. J. and Yu, B. (2012). Minimax-optimal rates for sparse additive models over kernel classes via convex programming. J. Mach. Learn. Res. 13 389–427.
  • [21] Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA.
  • [22] Ravikumar, P., Lafferty, J., Liu, H. and Wasserman, L. (2009). Sparse additive models. J. R. Stat. Soc. Ser. B Stat. Methodol. 71 1009–1030.
  • [23] Scott, C. and Nowak, R. D. (2006). Minimax-optimal classification with dyadic decision trees. IEEE Trans. Inform. Theory 52 1335–1353.
  • [24] Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. Ann. Statist. 10 1040–1053.
  • [25] Suzuki, T. (2012). PAC-Bayesian bound for Gaussian process regression and multiple kernel additive model. JMLR: Workshop and Conference Proceedings 23 8.1–8.20.
  • [26] Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • [27] Tokdar, S. T., Zhu, Y. M. and Ghosh, J. K. (2010). Bayesian density regression with logistic Gaussian process and subspace projection. Bayesian Anal. 5 319–344.
  • [28] Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer, New York.
  • [29] van der Vaart, A. W. and van Zanten, J. H. (2008). Rates of contraction of posterior distributions based on Gaussian process priors. Ann. Statist. 36 1435–1463.
  • [30] van der Vaart, A. W. and van Zanten, J. H. (2008). Reproducing kernel Hilbert spaces of Gaussian priors. In Pushing the Limits of Contemporary Statistics: Contributions in Honor of Jayanta K. Ghosh. Inst. Math. Stat. Collect. 3 200–222. IMS, Beachwood, OH.
  • [31] van der Vaart, A. W. and van Zanten, J. H. (2009). Adaptive Bayesian estimation using a Gaussian random field with inverse gamma bandwidth. Ann. Statist. 37 2655–2675.
  • [32] Verzelen, N. (2012). Minimax risks for sparse regressions: Ultra-high dimensional phenomenons. Electron. J. Stat. 6 38–90.
  • [33] Wainwright, M. J. (2009). Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting. IEEE Trans. Inform. Theory 55 5728–5741.
  • [34] Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell_1$-constrained quadratic programming (Lasso). IEEE Trans. Inform. Theory 55 2183–2202.
  • [35] Yang, Y. and Barron, A. (1999). Information-theoretic determination of minimax rates of convergence. Ann. Statist. 27 1564–1599.
  • [36] Yang, Y. and Dunson, B. D. (2013). Bayesian manifold regression. Preprint. Available at arXiv:1305.0617.
  • [37] Ye, G.-B. and Zhou, D.-X. (2008). Learning and approximation by Gaussians on Riemannian manifolds. Adv. Comput. Math. 29 291–310.
  • [38] Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the LASSO selection in high-dimensional linear regression. Ann. Statist. 36 1567–1594.