The Annals of Statistics

Generalization bounds for averaged classifiers

Yoav Freund, Yishay Mansour, and Robert E. Schapire

Full-text: Open access

Abstract

We study a simple learning algorithm for binary classification. Instead of predicting with the best hypothesis in the hypothesis class, that is, the hypothesis that minimizes the training error, our algorithm predicts with a weighted average of all hypotheses, weighted exponentially with respect to their training error. We show that the prediction of this algorithm is much more stable than the prediction of an algorithm that predicts with the best hypothesis. By allowing the algorithm to abstain from predicting on some examples, we show that the predictions it makes when it does not abstain are very reliable. Finally, we show that the probability that the algorithm abstains is comparable to the generalization error of the best hypothesis in the class.

Article information

Source
Ann. Statist., Volume 32, Number 4 (2004), 1698-1722.

Dates
First available in Project Euclid: 4 August 2004

Permanent link to this document
https://projecteuclid.org/euclid.aos/1091626184

Digital Object Identifier
doi:10.1214/009053604000000058

Mathematical Reviews number (MathSciNet)
MR2089139

Zentralblatt MATH identifier
1045.62056

Subjects
Primary: 62C12: Empirical decision procedures; empirical Bayes procedures

Keywords
Classification ensemble methods averaging Bayesian methods generalization bounds

Citation

Freund, Yoav; Mansour, Yishay; Schapire, Robert E. Generalization bounds for averaged classifiers. Ann. Statist. 32 (2004), no. 4, 1698--1722. doi:10.1214/009053604000000058. https://projecteuclid.org/euclid.aos/1091626184


Export citation

References

  • Allwein, E. L., Schapire, R. E. and Singer, Y. (2000). Reducing multiclass to binary: A unifying approach for margin classifiers. J. Mach. Learn. Res. 1 113–141.
  • Blumer, A., Ehrenfeucht, A., Haussler, D. and Warmuth, M. K. (1987). Occam's razor. Inform. Process. Lett. 24 377–380.
  • Bousquet, O. and Elisseeff, A. (2002). Stability and generalization. J. Mach. Learn. Res. 2 499–526.
  • Breiman, L. (1996). Bagging predictors. Machine Learning 24 123–140.
  • Breiman, L. (1996). Heuristics of instability and stabilization in model selection. Ann. Statist. 24 2350–2383.
  • Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E. and Warmuth, M. K. (1997). How to use expert advice. J. ACM 44 427–485.
  • de Bruijn, N. G. (1981). Asymptotic Methods in Analysis, 3rd ed. Dover, New York.
  • Freund, Y. (2003). Predicting a binary sequence almost as well as the optimal biased coin. Inform. and Comput. 182 73–94.
  • Freund, Y. and Mason, L. (1999). The alternating decision tree learning algorithm. In Proc. Sixteenth International Conference on Machine Learning 124–133. Morgan Kaufmann, San Francisco.
  • Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55 119–139.
  • Friedman, J. H. (1997). On bias, variance, 0${}/{}$1-loss, and the curse-of-dimensionality. Data Min. Knowl. Discov. 1 55–77.
  • Helmbold, D. P. and Schapire, R. E. (1997). Predicting nearly as well as the best pruning of a decision tree. Machine Learning 27 51–68.
  • Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58 13–30.
  • Littlestone, N. and Warmuth, M. K. (1994). The weighted majority algorithm. Inform. and Comput. 108 212–261.
  • MacKay, D. J. C. (1991). Bayesian methods for adaptive models. Ph.D dissertation, California Institute of Technology.
  • McAllester, D. A. (1999). Some PAC–Bayesian theorems. Machine Learning 37 355–363.
  • McDiarmid, C. (1989). On the method of bounded differences. In Surveys in Combinatorics 1989 148–188. Cambridge Univ. Press.
  • Schapire, R. E., Freund, Y., Bartlett, P. and Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Statist. 26 1651–1686.
  • Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C. and Anthony, M. (1998). Structural risk minimization over data-dependent hierarchies. IEEE Trans. Inform. Theory 44 1926–1940.
  • Shawe-Taylor, J. and Williamson, R. C. (1997). A PAC analysis of a Bayesian estimator. In Proc. Tenth Annual Conference on Computational Learning Theory 2–9. ACM Press, New York.
  • Vapnik, V. N. (1998). Statistical Learning Theory. Wiley, New York.
  • Vovk, V. (2001). Transductive confidence machines. Unpublished manuscript.
  • Willems, F. M. J., Shtarkov, Y. M. and Tjalkens, T. J. (1995). The context-tree weighting method: Basic properties. IEEE Trans. Inform. Theory 41 653–664.