The Annals of Statistics

Generalization bounds for averaged classifiers

Yoav Freund, Yishay Mansour, and Robert E. Schapire

Full-text: Open access


We study a simple learning algorithm for binary classification. Instead of predicting with the best hypothesis in the hypothesis class, that is, the hypothesis that minimizes the training error, our algorithm predicts with a weighted average of all hypotheses, weighted exponentially with respect to their training error. We show that the prediction of this algorithm is much more stable than the prediction of an algorithm that predicts with the best hypothesis. By allowing the algorithm to abstain from predicting on some examples, we show that the predictions it makes when it does not abstain are very reliable. Finally, we show that the probability that the algorithm abstains is comparable to the generalization error of the best hypothesis in the class.

Article information

Ann. Statist., Volume 32, Number 4 (2004), 1698-1722.

First available in Project Euclid: 4 August 2004

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62C12: Empirical decision procedures; empirical Bayes procedures

Classification ensemble methods averaging Bayesian methods generalization bounds


Freund, Yoav; Mansour, Yishay; Schapire, Robert E. Generalization bounds for averaged classifiers. Ann. Statist. 32 (2004), no. 4, 1698--1722. doi:10.1214/009053604000000058.

Export citation


  • Allwein, E. L., Schapire, R. E. and Singer, Y. (2000). Reducing multiclass to binary: A unifying approach for margin classifiers. J. Mach. Learn. Res. 1 113–141.
  • Blumer, A., Ehrenfeucht, A., Haussler, D. and Warmuth, M. K. (1987). Occam's razor. Inform. Process. Lett. 24 377–380.
  • Bousquet, O. and Elisseeff, A. (2002). Stability and generalization. J. Mach. Learn. Res. 2 499–526.
  • Breiman, L. (1996). Bagging predictors. Machine Learning 24 123–140.
  • Breiman, L. (1996). Heuristics of instability and stabilization in model selection. Ann. Statist. 24 2350–2383.
  • Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E. and Warmuth, M. K. (1997). How to use expert advice. J. ACM 44 427–485.
  • de Bruijn, N. G. (1981). Asymptotic Methods in Analysis, 3rd ed. Dover, New York.
  • Freund, Y. (2003). Predicting a binary sequence almost as well as the optimal biased coin. Inform. and Comput. 182 73–94.
  • Freund, Y. and Mason, L. (1999). The alternating decision tree learning algorithm. In Proc. Sixteenth International Conference on Machine Learning 124–133. Morgan Kaufmann, San Francisco.
  • Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55 119–139.
  • Friedman, J. H. (1997). On bias, variance, 0${}/{}$1-loss, and the curse-of-dimensionality. Data Min. Knowl. Discov. 1 55–77.
  • Helmbold, D. P. and Schapire, R. E. (1997). Predicting nearly as well as the best pruning of a decision tree. Machine Learning 27 51–68.
  • Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58 13–30.
  • Littlestone, N. and Warmuth, M. K. (1994). The weighted majority algorithm. Inform. and Comput. 108 212–261.
  • MacKay, D. J. C. (1991). Bayesian methods for adaptive models. Ph.D dissertation, California Institute of Technology.
  • McAllester, D. A. (1999). Some PAC–Bayesian theorems. Machine Learning 37 355–363.
  • McDiarmid, C. (1989). On the method of bounded differences. In Surveys in Combinatorics 1989 148–188. Cambridge Univ. Press.
  • Schapire, R. E., Freund, Y., Bartlett, P. and Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Statist. 26 1651–1686.
  • Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C. and Anthony, M. (1998). Structural risk minimization over data-dependent hierarchies. IEEE Trans. Inform. Theory 44 1926–1940.
  • Shawe-Taylor, J. and Williamson, R. C. (1997). A PAC analysis of a Bayesian estimator. In Proc. Tenth Annual Conference on Computational Learning Theory 2–9. ACM Press, New York.
  • Vapnik, V. N. (1998). Statistical Learning Theory. Wiley, New York.
  • Vovk, V. (2001). Transductive confidence machines. Unpublished manuscript.
  • Willems, F. M. J., Shtarkov, Y. M. and Tjalkens, T. J. (1995). The context-tree weighting method: Basic properties. IEEE Trans. Inform. Theory 41 653–664.