• Bernoulli
  • Volume 23, Number 2 (2017), 789-824.

Empirical entropy, minimax regret and minimax risk

Alexander Rakhlin, Karthik Sridharan, and Alexandre B. Tsybakov

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


We consider the random design regression model with square loss. We propose a method that aggregates empirical minimizers (ERM) over appropriately chosen random subsets and reduces to ERM in the extreme case, and we establish sharp oracle inequalities for its risk. We show that, under the $\varepsilon^{-p}$ growth of the empirical $\varepsilon$-entropy, the excess risk of the proposed method attains the rate $n^{-2/(2+p)}$ for $p\in(0,2)$ and $n^{-1/p}$ for $p>2$ where $n$ is the sample size. Furthermore, for $p\in(0,2)$, the excess risk rate matches the behavior of the minimax risk of function estimation in regression problems under the well-specified model. This yields a conclusion that the rates of statistical estimation in well-specified models (minimax risk) and in misspecified models (minimax regret) are equivalent in the regime $p\in(0,2)$. In other words, for $p\in(0,2)$ the problem of statistical learning enjoys the same minimax rate as the problem of statistical estimation. On the contrary, for $p>2$ we show that the rates of the minimax regret are, in general, slower than for the minimax risk. Our oracle inequalities also imply the $v\log(n/v)/n$ rates for Vapnik–Chervonenkis type classes of dimension $v$ without the usual convexity assumption on the class; we show that these rates are optimal. Finally, for a slightly modified method, we derive a bound on the excess risk of $s$-sparse convex aggregation improving that of Lounici [Math. Methods Statist. 16 (2007) 246–259] and providing the optimal rate.

Article information

Bernoulli, Volume 23, Number 2 (2017), 789-824.

Received: March 2014
Revised: July 2014
First available in Project Euclid: 4 February 2017

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

aggregation empirical risk minimization entropy minimax regret minimax risk


Rakhlin, Alexander; Sridharan, Karthik; Tsybakov, Alexandre B. Empirical entropy, minimax regret and minimax risk. Bernoulli 23 (2017), no. 2, 789--824. doi:10.3150/14-BEJ679.

Export citation


  • [1] Aizerman, M.A., Braverman, E.M. and Rozonoer, L.I. (1970). The Method of Potential Functions in the Theory of Machine Learning. Moscow: Nauka. (in Russian).
  • [2] Audibert, J.Y. (2007). Progressive mixture rules are deviation suboptimal. Adv. Neural Inf. Process. Syst. 20. 41–48.
  • [3] Bartlett, P.L. (2006). CS 281B Statistical Learning Theory Course Notes, U.C. Berkeley: Berkeley, CA.
  • [4] Bartlett, P.L., Bousquet, O. and Mendelson, S. (2005). Local Rademacher complexities. Ann. Statist. 33 1497–1537.
  • [5] Bartlett, P.L. and Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res. 3 463–482.
  • [6] Birgé, L. (1983). Approximation dans les espaces métriques et théorie de l’estimation. Z. Wahrsch. Verw. Gebiete 65 181–237.
  • [7] Bousquet, O. (2002). Concentration inequalities and empirical processes theory applied to the analysis of learning algorithms. Ph.D. thesis, Ecole Polytechnique.
  • [8] Bousquet, O., Koltchinskii, V. and Panchenko, D. (2002). Some local measures of complexity of convex hulls and generalization bounds. In Computational Learning Theory (Sydney, 2002). Lecture Notes in Computer Science 2375 59–73. Berlin: Springer.
  • [9] Buescher, K.L. and Kumar, P.R. (1996). Learning by canonical smooth estimation. I. Simultaneous estimation. IEEE Trans. Automat. Control 41 545–556.
  • [10] Catoni, O. (2004). Statistical Learning Theory and Stochastic Optimization. Lecture Notes in Math. 1851. Berlin: Springer.
  • [11] Cesa-Bianchi, N. and Lugosi, G. (2001). Worst-case bounds for the logarithmic loss of predictors. Mach. Learn. 43 247–264.
  • [12] Dai, D., Rigollet, P. and Zhang, T. (2012). Deviation optimal learning using greedy $Q$-aggregation. Ann. Statist. 40 1878–1905.
  • [13] Dalalyan, A.S. and Tsybakov, A.B. (2012). Mirror averaging with sparsity priors. Bernoulli 18 914–944.
  • [14] Devroye, L. (1987). A Course in Density Estimation. Progress in Probability and Statistics 14. Boston, MA: Birkhäuser.
  • [15] Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Applications of Mathematics (New York) 31. New York: Springer.
  • [16] Dudley, R.M. (1978). Central limit theorems for empirical measures. Ann. Probab. 6 899–929 (1979).
  • [17] Dudley, R.M. (1987). Universal Donsker classes and metric entropy. Ann. Probab. 15 1306–1326.
  • [18] Dudley, R.M. (1999). Uniform Central Limit Theorems. Cambridge Studies in Advanced Mathematics 63. Cambridge: Cambridge Univ. Press.
  • [19] Dudley, R.M., Giné, E. and Zinn, J. (1991). Uniform and universal Glivenko–Cantelli classes. J. Theoret. Probab. 4 485–510.
  • [20] Ibragimov, I.A. and Has’minskiĭ, R.Z. (1980). An estimate of the density of a distribution. Zap. Nauchn. Sem. Leningrad. Otdel. Mat. Inst. Steklov. (LOMI) 98 61–85.
  • [21] Juditsky, A., Rigollet, P. and Tsybakov, A.B. (2008). Learning by mirror averaging. Ann. Statist. 36 2183–2206.
  • [22] Kolmogorov, A.N. and Tihomirov, V.M. (1959). $\varepsilon$-entropy and $\varepsilon$-capacity of sets in function spaces. Uspehi Mat. Nauk 14 3–86.
  • [23] Koltchinskii, V. (2001). Rademacher penalties and structural risk minimization. IEEE Trans. Inform. Theory 47 1902–1914.
  • [24] Koltchinskii, V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Statist. 34 2593–2656.
  • [25] Koltchinskii, V. (2011). Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Lecture Notes in Math. 2033. Heidelberg: Springer.
  • [26] Koltchinskii, V. and Panchenko, D. (2000). Rademacher processes and bounding the risk of function learning. In High Dimensional Probability, II (Seattle, WA, 1999). Progress in Probability 47 443–457. Boston, MA: Birkhäuser.
  • [27] LeCam, L. (1973). Convergence of estimates under dimensionality restrictions. Ann. Statist. 1 38–53.
  • [28] Lecué, G. (2011). Interplay between concentration, complexity and geometry in learning theory with applications to high dimensional data analysis. Habilitation thesis, Univ. Paris-Est.
  • [29] Lecué, G. (2013). Empirical risk minimization is optimal for the convex aggregation problem. Bernoulli 19 2153–2166.
  • [30] Lecué, G. and Mendelson, S. (2009). Aggregation via empirical risk minimization. Probab. Theory Related Fields 145 591–613.
  • [31] Lecué, G. and Rigollet, P. (2014). Optimal learning with $Q$-aggregation. Ann. Statist. 42 211–224.
  • [32] Lee, W.S., Bartlett, P.L. and Williamson, R.C. (1998). The importance of convexity in learning with squared loss. IEEE Trans. Inform. Theory 44 1974–1980.
  • [33] Lounici, K. (2007). Generalized mirror averaging and $D$-convex aggregation. Math. Methods Statist. 16 246–259.
  • [34] Lugosi, G. and Nobel, A.B. (1999). Adaptive model selection using empirical complexities. Ann. Statist. 27 1830–1864.
  • [35] Nemirovski, A. (2000). Topics in non-parametric statistics. In Lectures on Probability Theory and Statistics (Saint-Flour, 1998). Lecture Notes in Math. 1738 85–277. Berlin: Springer.
  • [36] Pollard, D. (1984). Convergence of Stochastic Processes. New York: Springer.
  • [37] Raginsky, M. and Rakhlin, A. (2011). Lower bounds for passive and active learning. In Advances in Neural Information Processing Systems 24 1026–1034.
  • [38] Rigollet, P. and Tsybakov, A. (2011). Exponential screening and optimal rates of sparse estimation. Ann. Statist. 39 731–771.
  • [39] Rigollet, P. and Tsybakov, A.B. (2012). Sparse estimation by exponential weighting. Statist. Sci. 27 558–575.
  • [40] Srebro, N., Sridharan, K. and Tewari, A. (2010). Smoothness, low-noise and fast rates. In NIPS. Available at arXiv:1009.3896.
  • [41] Tsybakov, A.B. (2003). Optimal rates of aggregation. In Proceedings of COLT-2003 303–313. Springer: Berlin.
  • [42] Tsybakov, A.B. (2009). Introduction to Nonparametric Estimation. New York: Springer.
  • [43] van de Geer, S. (1990). Estimating a regression function. Ann. Statist. 18 907–924.
  • [44] Vapnik, V. (1982). Estimation of Dependences Based on Empirical Data. New York: Springer.
  • [45] Vapnik, V.N. and Chervonenkis, A.Ya. (1968). Uniform convergence of frequencies of occurrence of events to their probabilities. Dokl. Akad. Nauk USSR 181 915–918.
  • [46] Vapnik, V.N. and Chervonenkis, A.Ya. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16 264–280.
  • [47] Vapnik, V.N. and Chervonenkis, A.Ya. (1974). Theory of Pattern Recognition. Moscow: Nauka.
  • [48] Yang, Y. (2004). Aggregating regression procedures to improve performance. Bernoulli 10 25–47.
  • [49] Yang, Y. and Barron, A. (1999). Information-theoretic determination of minimax rates of convergence. Ann. Statist. 27 1564–1599.
  • [50] Yuditskiĭ, A.B., Nazin, A.V., Tsybakov, A.B. and Vayatis, N. (2005). Recursive aggregation of estimators by the mirror descent method with averaging. Problems of Information Transmission 41 368–384.