The Annals of Statistics

Evaluating probability forecasts

Tze Leung Lai, Shulamith T. Gross, and David Bo Shen

Full-text: Open access


Probability forecasts of events are routinely used in climate predictions, in forecasting default probabilities on bank loans or in estimating the probability of a patient’s positive response to treatment. Scoring rules have long been used to assess the efficacy of the forecast probabilities after observing the occurrence, or nonoccurrence, of the predicted events. We develop herein a statistical theory for scoring rules and propose an alternative approach to the evaluation of probability forecasts. This approach uses loss functions relating the predicted to the actual probabilities of the events and applies martingale theory to exploit the temporal structure between the forecast and the subsequent occurrence or nonoccurrence of the event.

Article information

Ann. Statist., Volume 39, Number 5 (2011), 2356-2382.

First available in Project Euclid: 30 November 2011

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 60G42: Martingales with discrete parameter 62P99: None of the above, but in this section
Secondary: 62P05: Applications to actuarial sciences and financial mathematics

Forecasting loss functions martingales scoring rules


Lai, Tze Leung; Gross, Shulamith T.; Shen, David Bo. Evaluating probability forecasts. Ann. Statist. 39 (2011), no. 5, 2356--2382. doi:10.1214/11-AOS902.

Export citation


  • Arvesen, J. N. (1969). Jackknifing U-statistics. Ann. Math. Statist. 40 2076–2100.
  • Basel Committee on Banking Supervision (2006). Basel II: International convergence of capital measurement and capital standards: A revised framework. Available at
  • Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review 78 1–3.
  • Bröcker, J. and Smith, L. A. (2007). Increasing the reliability of reliability diagrams. Weather and Forecasting 22 651–661.
  • Cox, D. R. (1958). Two further applications of a model for binary regression. Biometrika 45 562–565.
  • Dawid, A. P. (1982). The well-calibrated Bayesian. J. Amer. Statist. Assoc. 77 605–613.
  • de Finetti, B. (1975). Theory of Probability: A Critical Introductory Treatment. Vol. 2. Wiley, London. Translated from the Italian by Antonio Machî and Adrian Smith.
  • DeGroot, M. H. and Fienberg, S. E. (1983). The comparison and evaluation of forecasters. Statistician 32 12–22.
  • Diebold, F. X. and Mariano, R. S. (1995). Comparing predictive accuracy. J. Bus. Econom. Statist. 13 253–263.
  • Fox, C. R. and Birke, R. (2002). Forecasting trial outcomes: Lawyers assign higher probability to possibilities that are described in greater detail. Law Hum. Behav. 26 159–173.
  • Giacomini, R. and White, H. (2006). Tests of conditional predictive ability. Econometrica 74 1545–1578.
  • Gneiting, T., Balabdaoui, F. and Raftery, A. E. (2007). Probabilistic forecasts, calibration and sharpness. J. R. Stat. Soc. Ser. B Stat. Methodol. 69 243–268.
  • Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. J. Amer. Statist. Assoc. 102 359–378.
  • Good, I. J. (1952). Rational decisions. J. Roy. Statist. Soc. Ser. B 14 107–114.
  • Grünwald, P. D. and Dawid, A. P. (2004). Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. Ann. Statist. 32 1367–1433.
  • Hari, P. N., Zhang, M.-J., Roy, V., Pérez, W. S., Bashey, A., To, L. B., Elfenbein, G., Freytes, C. O., Gale, R. P., Gibson, J., Kyle, R. A., Lazarus, H. M., McCarthy, P. L., Milone, G. A., Pavlovsky, S., Reece, D. E., Schiller, G., Vela-Ojeda, J., Weisdorf, D. and Vesole, D. (2009). Is the international staging system superior to the Durie–Salmon staging system? A comparison in multiple myeloma patients undergoing autologous transplant. Leukemia 23 1528–1534.
  • Lai, T. L. and Wong, S. P.-S. (2008). Statistical models for the Basel II internal ratings-based approach to measuring credit risk of retail products. Stat. Interface 1 229–241.
  • Lichtendahl, K. C. Jr. and Winkler, R. L. (2007). Probability elicitation, scoring rules, and competition among forecasters. Management Sci. 53 1745–1755.
  • Mason, S. J. (2008). Understanding forecast verification statistics. Meteorol. Appl. 15 31–40.
  • Murphy, A. H. and Winkler, R. L. (1984). Probability forecasting in meteorology. J. Amer. Statist. Assoc. 79 489–500.
  • Ranjan, R. and Gneiting, T. (2010). Combining probability forecasts. J. R. Stat. Soc. Ser. B Stat. Methodol. 72 71–91.
  • Redelmeier, D. A., Bloch, D. A. and Hickam, D. H. (1991). Assessing predictive accuracy: How to compare Brier scores. J. Clin. Epidemiol. 44 1141–1146.
  • Schervish, M. J. (1989). A general method for comparing probability assessors. Ann. Statist. 17 1856–1879.
  • Seillier-Moiseiwitsch, F. and Dawid, A. P. (1993). On testing the validity of sequential probability forecasts. J. Amer. Statist. Assoc. 88 355–359.
  • Spiegelhalter, D. J. (1986). Probabilistic prediction in patient management and clinical trials. Stat. Med. 5 421–433.
  • West, K. D. (1996). Asymptotic inference about predictive ability. Econometrica 64 1067–1084.
  • Wilks, D. (2005). Statistical Methods in the Atmospheric Sciences, 2nd ed. International Geophysics 91. Academic Press, New York.
  • Williams, D. (1991). Probability with Martingales. Cambridge Univ. Press, Cambridge.
  • Winkler, R. L. (1994). Evaluating probabilities: Asymmetric scoring rules. Management Sci. 40 1395–1405.