The Annals of Statistics

Optimal weighted nearest neighbour classifiers

Richard J. Samworth

Abstract

We derive an asymptotic expansion for the excess risk (regret) of a weighted nearest-neighbour classifier. This allows us to find the asymptotically optimal vector of nonnegative weights, which has a rather simple form. We show that the ratio of the regret of this classifier to that of an unweighted $k$-nearest neighbour classifier depends asymptotically only on the dimension $d$ of the feature vectors, and not on the underlying populations. The improvement is greatest when $d=4$, but thereafter decreases as $d\rightarrow\infty$. The popular bagged nearest neighbour classifier can also be regarded as a weighted nearest neighbour classifier, and we show that its corresponding weights are somewhat suboptimal when $d$ is small (in particular, worse than those of the unweighted $k$-nearest neighbour classifier when $d=1$), but are close to optimal when $d$ is large. Finally, we argue that improvements in the rate of convergence are possible under stronger smoothness assumptions, provided we allow negative weights. Our findings are supported by an empirical performance comparison on both simulated and real data sets.

Article information

Source
Ann. Statist., Volume 40, Number 5 (2012), 2733-2763.

Dates
First available in Project Euclid: 4 February 2013

https://projecteuclid.org/euclid.aos/1359987536

Digital Object Identifier
doi:10.1214/12-AOS1049

Mathematical Reviews number (MathSciNet)
MR3097618

Zentralblatt MATH identifier
1373.62317

Subjects
Primary: 62G20: Asymptotic properties

Citation

Samworth, Richard J. Optimal weighted nearest neighbour classifiers. Ann. Statist. 40 (2012), no. 5, 2733--2763. doi:10.1214/12-AOS1049. https://projecteuclid.org/euclid.aos/1359987536

References

• Athitsos, V. and Sclaroff, S. (2005). Boosting nearest neighbour classifiers for multiclass recognition. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition 45–55. IEEE Computer Society, Washington, DC.
• Audibert, J.-Y. and Tsybakov, A. B. (2007). Fast learning rates for plug-in classifiers. Ann. Statist. 35 608–633.
• Bailey, T. and Jain, A. (1978). A note on distance-weighted $k$-nearest neighbour rules. Transactions on Systems, Man, and Cybernetics 8 311–313.
• Biau, G., Cérou, F. and Guyader, A. (2010). On the rate of convergence of the bagged nearest neighbour estimate. J. Mach. Learn. Res. 11 687–712.
• Biau, G. and Devroye, L. (2010). On the layered nearest neighbour estimate, the bagged nearest neighbour estimate and the random forest method in regression and classification. J. Multivariate Anal. 101 2499–2518.
• Blanchard, G., Bousquet, O. and Massart, P. (2008). Statistical performance of support vector machines. Ann. Statist. 36 489–531.
• Boucheron, S., Bousquet, O. and Lugosi, G. (2005). Theory of classification: A survey of some recent advances. ESAIM Probab. Stat. 9 323–375.
• Breiman, L. (1996). Heuristics of instability and stabilization in model selection. Ann. Statist. 24 2350–2383.
• Breiman, L. (1999). Using adaptive bagging to debias regressions. Technical report, Dept. Statistics, Univ. California, Berkeley.
• Breiman, L. (2001). Random forests. Mach. Learn. 45 5–32.
• Chanda, K. C. and Ruymgaart, F. H. (1989). Asymptotic estimate of probability of misclassification for discriminant rules based on density estimates. Statist. Probab. Lett. 8 81–88.
• Cortes, C. and Vapnik, V. (1995). Support-vector networks. Mach. Learn. 20 273–297.
• Cover, T. M. and Hart, P. E. (1967). Nearest neighbour pattern classification. IEEE Trans. Inform. Theory 13 21–27.
• Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Applications of Mathematics (New York) 31. Springer, New York.
• Fix, E. and Hodges, J. L. (1951). Discriminatory analysis—Nonparametric discrimination: Consistency properties. Technical Report 4, Project no. 21-29-004, USAF School of Aviation Medicine, Randolph Field, Texas.
• Fix, E. and Hodges, J. L. (1989). Discriminatory analysis—Nonparametric discrimination: Consistency properties. Internat. Statist. Rev. 57 238–247.
• Frank, A. and Asuncion, A. (2010). UCI machine learning repository. School of Information and Computer Sciences, Univ. California, Irvine, CA. Available at http://archive.ics.uci.edu/ml.
• Gordon, A. D. (1999). Classification, 2nd ed. Chapman & Hall, London.
• Gray, A. (2004). Tubes, 2nd ed. Progress in Mathematics 221. Birkhäuser, Basel.
• Guillemin, V. and Pollack, A. (1974). Differential Topology. Prentice-Hall, Englewood Cliffs, NJ.
• Hall, P. and Kang, K.-H. (2005). Bandwidth choice for nonparametric classification. Ann. Statist. 33 284–306.
• Hall, P., Park, B. U. and Samworth, R. J. (2008). Choice of neighbour order in nearest-neighbour classification. Ann. Statist. 36 2135–2152.
• Hall, P. and Samworth, R. J. (2005). Properties of bagged nearest neighbour classifiers. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 363–379.
• Hand, D. J. (1981). Discrimination and Classification. Wiley, Chichester.
• Ibragimov, I. A. and Khasminskiĭ, R. Z. (1980). Nonparametric regression estimation. Dokl. Akad. Nauk SSSR 252 780–784.
• Ibragimov, I. A. and Khasminskiĭ, R. Z. (1981). Statistical Estimation. Applications of Mathematics 16. Springer, New York.
• Ibragimov, I. A. and Khasminskiĭ, R. Z. (1982). On the bounds for quality of nonparametric regression function estimation. Theory Probab. Appl. 27 81–94.
• Lepskiĭ, O. V. (1991). Asymptotically minimax adaptive estimation. I. Upper bounds. Optimally adaptive estimates. Theory Probab. Appl. 36 682–697.
• Mammen, E. and Tsybakov, A. B. (1999). Smooth discrimination analysis. Ann. Statist. 27 1808–1829.
• Martínez-Muñoz, G. and Suárez, A. (2010). Out-of-bag estimation of the optimal sample size in bagging. Pattern Recognition 43 143–152.
• Moore, J. D. (1992). Book review: Tubes. Bull. Amer. Math. Soc. (N.S.) 27 311–313.
• Polonik, W. (1995). Measuring mass concentrations and estimating density contour clusters—An excess mass approach. Ann. Statist. 23 855–881.
• Raudys, Š. and Young, D. M. (2004). Results in statistical discriminant analysis: A review of the former Soviet Union literature. J. Multivariate Anal. 89 1–35.
• Rigollet, P. and Vert, R. (2009). Optimal rates for plug-in estimators of density level sets. Bernoulli 15 1154–1178.
• Royall, R. (1966). A class of nonparametric estimators of a smooth regression function. Ph.D. thesis, Stanford Univ., Stanford, CA.
• Samworth, R. J. (2012). Supplement to “Optimal weighted nearest neighbour classifiers.” DOI:10.1214/12-AOS1049SUPP.
• Samworth, R. J. and Wand, M. P. (2010). Asymptotics and optimal bandwidth selection for highest density region estimation. Ann. Statist. 38 1767–1792.
• Shorack, G. R. and Wellner, J. A. (1986). Empirical Processes with Applications to Statistics. Wiley, New York.
• Steele, B. M. (2009). Exact bootstrap $k$-nearest neighbour learners. Mach. Learn. 74 235–255.
• Steinwart, I. and Christmann, A. (2008). Support Vector Machines. Springer, New York.
• Stone, C. J. (1977). Consistent nonparametric regression. Ann. Statist. 5 595–645.
• Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135–166.

Supplemental materials

• Supplementary material: Supplement to “Optimal weighted nearest neighbour classifiers”. We complete the proof of Theorem 1, and give the proofs of the other results in the paper. We also discuss minimax properties of weighted nearest neighbour classifiers and a plug-in approach to estimating $k^{*}$.