Electronic Journal of Statistics

A note on the use of empirical AUC for evaluating probabilistic forecasts

Simon Byrne

Full-text: Open access


Scoring functions are used to evaluate and compare partially probabilistic forecasts. We investigate the use of rank-sum functions such as empirical Area Under the Curve (AUC), a widely used measure of classification performance, as a scoring function for the prediction of probabilities of a set of binary outcomes. It is shown that the AUC is not generally a proper scoring function, that is, under certain circumstances it is possible to improve on the expected AUC by modifying the quoted probabilities from their true values. However with some restrictions, or with certain modifications, it can be made proper.

Article information

Electron. J. Statist., Volume 10, Number 1 (2016), 380-393.

Received: August 2015
First available in Project Euclid: 17 February 2016

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62C99: None of the above, but in this section

Rank-sum area under the curve probabilistic prediction scoring rule scoring function


Byrne, Simon. A note on the use of empirical AUC for evaluating probabilistic forecasts. Electron. J. Statist. 10 (2016), no. 1, 380--393. doi:10.1214/16-EJS1109. https://projecteuclid.org/euclid.ejs/1455715967

Export citation


  • [1] Agarwal, S., Graepel, T., Herbrich, R., Har-Peled, S. and Roth, D. (2005). Generalization bounds for the area under the ROC curve., Journal of Machine Learning Research 6 393–425.
  • [2] Byrne, S. (2016). Supplement to “A note on the use of empirical AUC for evaluating probabilistic forecasts”., doi:10.1214/16-EJS1109SUPP
  • [3] Clémençon, S., Lugosi, G. and Vayatis, N. (2008). Ranking and empirical minimization of $U$-statistics., Annals of Statistics 36 844–874. doi:10.1214/009052607000000910.
  • [4] Dawid, A. P., Lauritzen, S. and Parry, M. (2012). Proper local scoring rules on discrete sample spaces., Annals of Statistics 40 593–608. doi:10.1214/12-AOS972.
  • [5] Flach, P., Hernandez-Orallo, J. and Ferri, C. (2011). A Coherent Interpretation of AUC as a Measure of Aggregated Classification Performance. In, Proceedings of the 28th International Conference on Machine Learning (L. Getoor and T. Scheffer, eds.) 657–664. ACM, New York, NY, USA.
  • [6] Gneiting, T. (2011). Making and evaluating point forecasts., Journal of the American Statistical Association 106 746–762. doi:10.1198/jasa.2011.r10138.
  • [7] Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation., Journal of the American Statistical Association 102 359–378. doi:10.1198/016214506000001437.
  • [8] Hand, D. J. (2009). Measuring classifier performance: a coherent alternative to the area under the ROC curve., Machine Learning 77 103–123. doi:10.1007/s10994-009-5119-5
  • [9] Hanley, J. A. and McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve., Radiology 143 29–36. doi:10.1148/radiology.143.1.7063747

Supplemental materials