## The Annals of Statistics

### Minimax optimal rates of estimation in high dimensional additive models

#### Abstract

We establish minimax optimal rates of convergence for estimation in a high dimensional additive model assuming that it is approximately sparse. Our results reveal a behavior universal to this class of high dimensional problems. In the sparse regime when the components are sufficiently smooth or the dimensionality is sufficiently large, the optimal rates are identical to those for high dimensional linear regression and, therefore, there is no additional cost to entertain a nonparametric model. Otherwise, in the so-called smooth regime, the rates coincide with the optimal rates for estimating a univariate function and, therefore, they are immune to the “curse of dimensionality.”

#### Article information

Source
Ann. Statist., Volume 44, Number 6 (2016), 2564-2593.

Dates
Revised: November 2015
First available in Project Euclid: 23 November 2016

https://projecteuclid.org/euclid.aos/1479891628

Digital Object Identifier
doi:10.1214/15-AOS1422

Mathematical Reviews number (MathSciNet)
MR3576554

Zentralblatt MATH identifier
1360.62200

#### Citation

Yuan, Ming; Zhou, Ding-Xuan. Minimax optimal rates of estimation in high dimensional additive models. Ann. Statist. 44 (2016), no. 6, 2564--2593. doi:10.1214/15-AOS1422. https://projecteuclid.org/euclid.aos/1479891628

#### References

• Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc. 68 337–404.
• Bach, F. R. (2008). Consistency of the group lasso and multiple kernel learning. J. Mach. Learn. Res. 9 1179–1225.
• Bartlett, P. L. and Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res. 3 463–482.
• Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732.
• Bousquet, O. (2002). A Bennett concentration inequality and its application to suprema of empirical processes. C. R. Math. Acad. Sci. Paris 334 495–500.
• Bousquet, O. and Herrmann, D. (2003). On the complexity of learning the kernel matrix. In Advances in Neural Information Processing Systems 15 415–422.
• Breiman, L. (1995). Better subset regression using the nonnegative garrote. Technometrics 37 373–384.
• Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Heidelberg.
• Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when $p$ is much larger than $n$. Ann. Statist. 35 2313–2351.
• Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. Wiley, New York.
• Crammer, K., Keshet, J. and Singer, Y. (2003). Kernel design using boosting. In Advances in Neural Information Processing Systems 15 553–560. MIT Press, Cambridge, MA.
• Cui, X., Peng, H., Wen, S. and Zhu, L. (2013). Component selection in the additive regression model. Scand. J. Stat. 40 491–510.
• Fan, J., Feng, Y. and Song, R. (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. J. Amer. Statist. Assoc. 106 544–557.
• Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
• Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models. Monographs on Statistics and Applied Probability 43. Chapman & Hall, London.
• Huang, J., Horowitz, J. L. and Wei, F. (2010). Variable selection in nonparametric additive models. Ann. Statist. 38 2282–2313.
• Koltchinskii, V. (2011). Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Lecture Notes in Math. 2033. Springer, Heidelberg.
• Koltchinskii, V. and Yuan, M. (2008). Sparse recovery in large ensembles of kernel machines. In Proceedings of 19th Annual Conference on Learning Theory (COLT 2008) 229–238. Omnipress, Madison, WI.
• Koltchinskii, V. and Yuan, M. (2010). Sparsity in multiple kernel learning. Ann. Statist. 38 3660–3695.
• Lanckriet, G. R. G., Cristianini, N., Bartlett, P., El Ghaoui, L. and Jordan, M. I. (2004). Learning the kernel matrix with semidefinite programming. J. Mach. Learn. Res. 5 27–72.
• Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and Processes. Ergebnisse der Mathematik und Ihrer Grenzgebiete (3) 23. Springer, Berlin.
• Lin, Y. and Zhang, H. H. (2006). Component selection and smoothing in multivariate nonparametric regression. Ann. Statist. 34 2272–2297.
• Massart, P. (2007). Concentration Inequalities and Model Selection. Lecture Notes in Math. 1896. Springer, Berlin.
• Meier, L., van de Geer, S. and Bühlmann, P. (2009). High-dimensional additive modeling. Ann. Statist. 37 3779–3821.
• Mendelson, S. (2002). Geometric parameters of kernel machines. In Computational Learning Theory (Sydney, 2002). Lecture Notes in Computer Science 2375 29–43. Springer, Berlin.
• Micchelli, C. A. and Pontil, M. (2005). Learning the kernel function via regularization. J. Mach. Learn. Res. 6 1099–1125.
• Raskutti, G., Wainwright, M. J. and Yu, B. (2011). Minimax rates of estimation for high-dimensional linear regression over $\ell_{q}$-balls. IEEE Trans. Inform. Theory 57 6976–6994.
• Raskutti, G., Wainwright, M. J. and Yu, B. (2012). Minimax-optimal rates for sparse additive models over kernel classes via convex programming. J. Mach. Learn. Res. 13 389–427.
• Ravikumar, P., Lafferty, J., Liu, H. and Wasserman, L. (2009). Sparse additive models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 71 1009–1030.
• Srebro, N. and Ben-David, S. (2006). Learning bounds for support vector machines with learned kernels. In Learning Theory. Lecture Notes in Computer Science 4005 169–183. Springer, Berlin.
• Stone, C. J. (1980). Optimal rates of convergence for nonparametric estimators. Ann. Statist. 8 1348–1360.
• Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. Ann. Statist. 10 1040–1053.
• Stone, C. J. (1985). Additive regression and other nonparametric models. Ann. Statist. 13 689–705.
• Suzuki, T. and Sugiyama, M. (2013). Fast learning rate of multiple kernel learning: Trade-off between sparsity and smoothness. Ann. Statist. 41 1381–1405.
• Talagrand, M. (1996). New concentration inequalities in product spaces. Invent. Math. 126 505–563.
• Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol. 58 267–288.
• Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer, New York.
• van de Geer, S. (2000). Empirical Processes in M-Estimation. Cambridge Univ. Press, Cambridge.
• van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer, New York.
• Wahba, G. (1990). Spline Models for Observational Data. CBMS-NSF Regional Conference Series in Applied Mathematics 59. SIAM, Philadelphia, PA.
• Yang, Y. and Barron, A. (1999). Information-theoretic determination of minimax rates of convergence. Ann. Statist. 27 1564–1599.
• Ye, F. and Zhang, C. (2010). Rate minimaxity of the Lasso and Dantzig selector for the $\ell_{q}$ loss in $\ell_{r}$ balls. J. Mach. Learn. Res. 11 3519–3540.
• Yuan, M. (2007). Nonnegative garrote component selection in functional ANOVA models. In Conference on Artificial Intelligence and Statistics 660–666.