## Electronic Journal of Statistics

### Surrogate losses in passive and active learning

#### Abstract

Active learning is a type of sequential design for supervised machine learning, in which the learning algorithm sequentially requests the labels of selected instances from a large pool of unlabeled data points. The objective is to produce a classifier of relatively low risk, as measured under the $0$-$1$ loss, ideally using fewer label requests than the number of random labeled data points sufficient to achieve the same. This work investigates the potential uses of surrogate loss functions in the context of active learning. Specifically, it presents an active learning algorithm based on an arbitrary classification-calibrated surrogate loss function, along with an analysis of the number of label requests sufficient for the classifier returned by the algorithm to achieve a given risk under the $0$-$1$ loss. Interestingly, these results cannot be obtained by simply optimizing the surrogate risk via active learning to an extent sufficient to provide a guarantee on the $0$-$1$ loss, as is common practice in the analysis of surrogate losses for passive learning. Some of the results have additional implications for the use of surrogate losses in passive learning.

#### Article information

Source
Electron. J. Statist., Volume 13, Number 2 (2019), 4646-4708.

Dates
First available in Project Euclid: 13 November 2019

https://projecteuclid.org/euclid.ejs/1573635664

Digital Object Identifier
doi:10.1214/19-EJS1635

Mathematical Reviews number (MathSciNet)
MR4030368

Zentralblatt MATH identifier
07136627

#### Citation

Hanneke, Steve; Yang, Liu. Surrogate losses in passive and active learning. Electron. J. Statist. 13 (2019), no. 2, 4646--4708. doi:10.1214/19-EJS1635. https://projecteuclid.org/euclid.ejs/1573635664

#### References

• [1] K. S. Alexander. Rates of growth and sample moduli for weighted empirical processes indexed by sets., Probability Theory and Related Fields, 75:379–423, 1987.
• [2] D. Angluin and P. Laird. Learning from noisy examples., Machine Learning, 2:343–370, 1988.
• [3] J.-Y. Audibert and A. B. Tsybakov. Fast learning rates for plug-in classifiers., The Annals of Statistics, 35(2):608–633, 2007.
• [4] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning., Journal of Computer and System Sciences, 75(1):78–89, 2009.
• [5] M.-F. Balcan, S. Hanneke, and J. W. Vaughan. The true sample complexity of active learning., Machine Learning, 80(2–3):111–139, 2010.
• [6] P. Bartlett, M. I. Jordan, and J. McAuliffe. Convexity, classification, and risk bounds., Journal of the American Statistical Association, 101:138–156, 2006.
• [7] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results., Journal of Machine Learning Research, 3(11):463–482, 2002.
• [8] P. L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities., The Annals of Statistics, 33(4) :1497–1537, 2005.
• [9] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In, Proceedings of the 26$^\mathrmth$ International Conference on Machine Learning, 2009.
• [10] G. Blanchard, G. Lugosi, and N. Vayatis. On the rate of convergence of regularized boosting classifiers., Journal of Machine Learning Research, 4:861–894, 2003.
• [11] R. Castro and R. Nowak. Minimax bounds for active learning., IEEE Transactions on Information Theory, 54(5) :2339–2353, July 2008.
• [12] G. Cavallanti, N. Cesa-Bianchi, and C. Gentile. Learning noisy linear classifiers via adaptive and selective sampling., Machine Learning, 83:71–102, 2011.
• [13] C. Cortes and V. Vapnik. Support-vector networks., Machine Learning, 20:273–297, 1995.
• [14] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In, Advances in Neural Information Processing Systems, 2007.
• [15] O. Dekel, C. Gentile, and K. Sridharan. Selective sampling and active learning from single and multiple teachers., Journal of Machine Learning Research, 13 :2655–2697, 2012.
• [16] R. M. Dudley. Central limit theorems for empirical measures., The Annals of Probability, 6(6):899–929, 1978.
• [17] R. M. Dudley. Universal Donsker classes and metric entropy., The Annals of Probability, 15(4) :1306–1326, 1987.
• [18] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting., Journal of Computer and System Sciences, 55(1):119–139, 1997.
• [19] E. Friedman. Active learning for smooth problems. In, Proceedings of the 22$^nd$ Conference on Learning Theory, 2009.
• [20] E. Giné and V. Koltchinskii. Concentration inequalities and asymptotic results for ratio type empirical processes., The Annals of Probability, 34(3) :1143–1216, 2006.
• [21] E. Giné, V. Koltchinskii, and J. Wellner. Ratio limit theorems for empirical processes. In, Stochastic Inequalities, pages 249–278. Birkhäuser, 2003.
• [22] S. Hanneke. A bound on the label complexity of agnostic active learning. In, Proceedings of the 24$^th$ International Conference on Machine Learning, 2007.
• [23] S. Hanneke., Theoretical Foundations of Active Learning. PhD thesis, Machine Learning Department, School of Computer Science, Carnegie Mellon University, 2009.
• [24] S. Hanneke. Rates of convergence in active learning., The Annals of Statistics, 39(1):333–361, 2011.
• [25] S. Hanneke. Activized learning: Transforming passive to active with improved label complexity., Journal of Machine Learning Research, 13 :1469–1587, 2012.
• [26] S. Hanneke. Theory of disagreement-based active learning., Foundations and Trends in Machine Learning, 7(2–3):131–309, 2014.
• [27] S. Hanneke. Nonparametric active learning, part 1: Smooth regression functions., Unpublished Manuscript, 2016.
• [28] S. Hanneke and L. Yang. Negative results for active learning with convex losses. In, Proceedings of the $13^\mathrmth$ International Conference on Artificial Intelligence and Statistics, 2010.
• [29] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications., Information and Computation, 100:78–150, 1992.
• [30] A. T. Kalai, A. R. Klivans, Y. Mansour, and R. A. Servedio. Agnostically learning halfspaces. In, Proceedings of the 46$^th$ Annual IEEE Symposium on Foundations of Computer Science, 2005.
• [31] M. J. Kearns. Efficient noise-tolerant learning from statistical queries., Journal of the Association for Computing Machinery, 45(6):983 –1006, 1998.
• [32] M. J. Kearns, R. E. Schapire, and L. M. Sellie. Toward efficient agnostic learning., Machine Learning, 17:115–141, 1994.
• [33] V. Koltchinskii. Rademacher penalties and structural risk minimization., IEEE Transactions on Information Theory, 47(5) :1902–1914, 2001.
• [34] V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization., The Annals of Statistics, 34(6) :2593–2656, 2006.
• [35] V. Koltchinskii. Oracle inequalities in empirical risk minimization and sparse recovery problems: Lecture notes. Technical report, Ecole d’ete de Probabilités de Saint-Flour, 2008.
• [36] V. Koltchinskii. Rademacher complexities and bounding the excess risk in active learning., Journal of Machine Learning Research, 11 :2457–2485, 2010.
• [37] S. Li. Concise formulas for the area and volume of a hyperspherical cap., Asian Journal of Mathematics and Statistics, 4(1):66–70, 2011.
• [38] A. Locatelli, A. Carpentier, and S. Kpotufe. Adaptivity to noise parameters in nonparametric active learning. In, Proceedings of the 30$^\mathrmth$ Conference on Learning Theory, 2017.
• [39] S. Mahalanabis. A note on active learning for smooth problems., arXiv:1103.3095, 2011.
• [40] E. Mammen and A. B. Tsybakov. Smooth discrimination analysis., The Annals of Statistics, 27 :1808–1829, 1999.
• [41] S. Minsker. Plug-in approach to active learning., Journal of Machine Learning Research, 13(1):67–90, 2012.
• [42] D. Pollard., Convergence of Stochastic Processes. Springer-Verlag, 1984.
• [43] D. Pollard., Empirical Processes: Theory and Applications. NSF-CBMS Regional Conference Series in Probability and Statistics, Vol. 2, Inst. of Math. Stat. and Am. Stat. Assoc., 1990.
• [44] M. Raginsky and A. Rakhlin. Lower bounds for passive and active learning. In, Advances in Neural Information Processing Systems 24, 2011.
• [45] A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning., The Annals of Statistics, 32(1):135–166, 2004.
• [46] A. B. Tsybakov., Introduction to Nonparametric Estimation. Springer, 2009.
• [47] A. W. van der Vaart and J. A. Wellner., Weak Convergence and Empirical Processes. Springer, 1996.
• [48] A. W. van der Vaart and J. A. Wellner. A local maximal inequality under uniform entropy., Electronic Journal of Statistics, 5:192–203, 2011.
• [49] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities., Theory of Probability and its Applications, 16:264–280, 1971.
• [50] L. Wang. Smoothness, disagreement coefficient, and the label complexity of agnostic active learning., Journal of Machine Learning Research, 12 :2269–2292, 2011.
• [51] T. Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization., The Annals of Statistics, 32(1):56–134, 2004.