Electronic Journal of Statistics

A notion of stability for k-means clustering

T. Le Gouic and Q. Paris

Full-text: Open access

Abstract

In this paper, we define and study a new notion of stability for the $k$-means clustering scheme building upon the field of quantization of a probability measure. We connect this definition of stability to a geometric feature of the underlying distribution of the data, named absolute margin condition, inspired by recent works on the subject.

Article information

Source
Electron. J. Statist., Volume 12, Number 2 (2018), 4239-4263.

Dates
Received: March 2018
First available in Project Euclid: 15 December 2018

Permanent link to this document
https://projecteuclid.org/euclid.ejs/1544842901

Digital Object Identifier
doi:10.1214/18-EJS1500

Mathematical Reviews number (MathSciNet)
MR3892141

Zentralblatt MATH identifier
07003242

Keywords
Clustering k-means stability

Rights
Creative Commons Attribution 4.0 International License.

Citation

Le Gouic, T.; Paris, Q. A notion of stability for k-means clustering. Electron. J. Statist. 12 (2018), no. 2, 4239--4263. doi:10.1214/18-EJS1500. https://projecteuclid.org/euclid.ejs/1544842901


Export citation

References

  • Abaya, E. A. and Wise, G. L. (1984). Convergence of vector quantizers with applications to optimal quantization., SIAM Journal of Applied Mathematics 44 183-189.
  • Antos, A. (2005). Improved minimax bounds on the test and training distortion of empirically designed vector quantizers., IEEE Transactions on Information Theory 51 4022-4032.
  • Antos, A., Györfi, L. and György, A. (2005). Improved convergence rates in empirical vector quantizer design., IEEE Transactions on Information Theory 4013-4022.
  • Bartlett, P. L., Linder, T. and Lugosi, G. (1998). The minimax distorsion redundancy in empirical quantizer design., IEEE Transactions on Information Theory 44 1802-1813.
  • Ben-David, S., Pál, D. and Simon, H. U. (2007). Stability of k-means clustering. In, International Conference on Computational Learning Theory 20–34. Springer.
  • Ben-David, S., Von Luxburg, U. and Pál, D. (2006). A sober look at clustering stability. In, International Conference on Computational Learning Theory 5–19. Springer.
  • Biau, G., Devroye, L. and Lugosi, G. (2008). On the performance of clustering in Hilbert spaces., IEEE Transactions on Information Theory 54 781-790.
  • Cadre, B. and Paris, Q. (2012). On Hölder fields clustering., Test 21 301-316.
  • Chou, P. A. (1994). The distorsion of vector quantizers trained on $n$ vectors decreases to the optimum at $O_P(1/n)$., IEEE Transactions on Information Theory 457-457.
  • Graf, S. and Luschgy, H. (2000)., Foundations of quantization for probability distributions. Springer-Verlag, New-York.
  • Kumar, A. and Kannan, R. (2010). Clustering with spectral norm and the k-means algorithm. In, Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Science 299-308. IEEE Computer Society.
  • Levrard, C. (2015). Nonasymptotic bounds for vector quantization in Hilbert spaces., The Annals of Statistics 43 592-619.
  • Levrard, C. (2018). Quantization/clustering: when and why does k-means work?, Arxiv e-prints.
  • Linder, T. (2000). On the training distortion of vector quantizers., IEEE Transactions on Information Theory 1617-1623.
  • Linder, T. (2001)., Learning-theoretic methods in vector quantization. Lecture Notes for the Advanced School on the Principle of Nonparametric Learning, Udine, Italy, July 9-13.
  • Linder, T., Lugosi, G. and Zeger, K. (1994). Rates of convergence in the source coding theorem, in empirical quantizer design, and in universal lossy source coding., IEEE Transactions on Information Theory 40 1728-1740.
  • Loubes, J. M. and Pelletier, B. (2017). Prediction by quantization of a conditional distribution., Electronic journal of statistics 11 2679–2706.
  • Lu, Y. and Zhou, H. H. (2016). Statistical and computational guarantees of Lloyd’s algorithm and its variants., arXiv:1612.02099.
  • Pollard, D. (1981). Strong consistency of $k$-means clustering., The Annals of Statistics 9 135-140.
  • Pollard, D. (1982a). A central limit theorem for $k$-means clustering., The Annals of Probability 10 199-205.
  • Pollard, D. (1982b). Quantization and the method of $k$-means., IEEE Transactions on Information Theory 28 1728-1740.
  • Rakhlin, A. and Caponnetto, A. (2007). Stability of $k$-means clustering. In, Advances in neural information processing systems 1121–1128.
  • Tang, C. and Monteleoni, C. (2016). On Lloyd’s Algorithm: New Theoretical Insights for Clustering in Practice. In, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (A. Gretton and C. C. Robert, eds.). Proceedings of Machine Learning Research 51 1280-1289. PMLR, Cadiz, Spain.