Electronic Journal of Statistics

Quantile-based clustering

Christian Hennig, Cinzia Viroli, and Laura Anderlucci

Full-text: Open access

Abstract

A new cluster analysis method, $K$-quantiles clustering, is introduced. $K$-quantiles clustering can be computed by a simple greedy algorithm in the style of the classical Lloyd’s algorithm for $K$-means. It can be applied to large and high-dimensional datasets. It allows for within-cluster skewness and internal variable scaling based on within-cluster variation. Different versions allow for different levels of parsimony and computational efficiency. Although $K$-quantiles clustering is conceived as nonparametric, it can be connected to a fixed partition model of generalized asymmetric Laplace-distributions. The consistency of $K$-quantiles clustering is proved, and it is shown that $K$-quantiles clusters correspond to well separated mixture components in a nonparametric mixture. In a simulation, $K$-quantiles clustering is compared with a number of popular clustering methods with good results. A high-dimensional microarray dataset is clustered by $K$-quantiles.

Article information

Source
Electron. J. Statist., Volume 13, Number 2 (2019), 4849-4883.

Dates
Received: April 2019
First available in Project Euclid: 4 December 2019

Permanent link to this document
https://projecteuclid.org/euclid.ejs/1575428462

Digital Object Identifier
doi:10.1214/19-EJS1640

Mathematical Reviews number (MathSciNet)
MR4038727

Zentralblatt MATH identifier
07147366

Keywords
Fixed partition model quantile discrepancy high dimensional clustering nonparametric mixture

Rights
Creative Commons Attribution 4.0 International License.

Citation

Hennig, Christian; Viroli, Cinzia; Anderlucci, Laura. Quantile-based clustering. Electron. J. Statist. 13 (2019), no. 2, 4849--4883. doi:10.1214/19-EJS1640. https://projecteuclid.org/euclid.ejs/1575428462


Export citation

References

  • [1] Banerjee, A., I. S. Dhillon, J. Ghosh, and S. Sra (2005). Clustering on the unit hypersphere using von Mises-Fisher distributions., Journal of Machine Learning Research 6(Sep), 1345–1382.
  • [2] Ben-Israel, A. and C. Iyigun (2008). Probabilistic d-clustering., Journal of Classification 25(1), 5.
  • [3] Bryant, P. and J. A. Williamson (1978). Asymptotic behaviour of classification maximum likelihood estimates., Biometrika 65(2), 273–281.
  • [4] Bubeck, S. and U. von Luxburg (2009). Nearest neighbor clustering: A baseline method for consistent clustering with arbitrary objective functions., Journal of Machine Learning Research 10, 657–698.
  • [5] Clemencon, S. (2014). A statistical view of clustering performance through the theory of u-processes., Journal of Multivariate Analysis 124, 42–56.
  • [6] Diaconis, P. (1988)., Grouped Representations in probability and statistics, Volume 11. Hayward, CA: Institute of Mathematical Statistics Lecture Notes – Monograph Series.
  • [7] Fligner, M. A. and J. S. Verducci (1986). Distance based ranking models., Journal of the Royal Statistical Society. Series B (Methodological) 48(3), 359–369.
  • [8] Frey, B. J. and D. Dueck (2007). Clustering by passing messages between data points., Science 315(5814), 972–976.
  • [9] Friedman, J. H. and J. J. Meulman (2004). Clustering objects on subsets of attributes (with discussion)., Journal of the Royal Statistical Society: Series B (Statistical Methodology) 66(4), 815–849.
  • [10] Gnanadesikan, R., J. R. Kettenring, and S. L. Tsao (1995). Weighting and selection of variables., Journal of Classification 12, 113–136.
  • [11] Golub, T., D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H. Coller, M. Loh, J. Downing, M. Caligiuri, C. Bloomfield, and E. Lander (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring., Science 286, 531–537.
  • [12] Halkidi, M., M. Vazirgiannis, and C. Hennig (2016). Method-independent indices for cluster validation and estimating the number of clusters. In C. Hennig, M. Meila, F. Murtagh, and R. Rocci (Eds.), Handbook of Cluster Analysis, Chapter 26, pp. 595–618. Chapman & Hall/CRC, Boca Raton, FL.
  • [13] Hennig, C. (2015). What are the true clusters?, Pattern Recognition Letters 64, 53–62. Philosophical Aspects of Pattern Recognition.
  • [14] Hennig, C. (2016). Clustering strategy and method selection. In C. Hennig, M. Meila, F. Murtagh, and R. Rocci (Eds.), Handbook of Cluster Analysis, Chapter 31, pp. 703–730. Chapman & Hall/CRC, Boca Raton FL.
  • [15] Hennig, C. and C. Viroli (2016). Quantile-based classifiers., Biometrika 103(2), 435.
  • [16] Hennig, C., C. Viroli, and L. Anderlucci (2019). Quantile-based clustering, arXiv:1806.10403 (supplement included from v2).
  • [17] Hubert, L. and P. Arabie (1985). Comparing partitions., Journal of the Classification 2, 193–218.
  • [18] Iyigun, C. (2010). Probabilistic distance clustering. In J. J. Cochran, L. A. Cox, P. Keskinocak, J. P. Kharoufeh, and J. C. Smith (Eds.), Wiley Encyclopedia of Operations Research and Management Science. John Wiley & Sons.
  • [19] Jain, A. K. (2010). Data clustering: 50 years beyond k-means., Pattern Recognition Letters 31(8), 651–666.
  • [20] Joe, H. (2006). Generating random correlation matrices based on partial correlations., Journal of Multivariate Analysis 97, 2177–2189.
  • [21] Kaufman, L. and P. Rousseeuw (2005)., Finding Groups in Data: An Introduction to Cluster Analysis. Wiley.
  • [22] Kozubowski, T. J. and K. Podgorski (2000). A multivariate and asymmetric generalization of Laplace distribution., Computational Statistics 15(4), 531–540.
  • [23] Leisch, F. (2016). Resampling methods for exploring clustering stability. In C. Hennig, M. Meila, F. Murtagh, and R. Rocci (Eds.), Handbook of Cluster Analysis, Chapter 28, pp. 637–652. Chapman & Hall/CRC, Boca Raton, FL.
  • [24] Linder, T., G. Lugosi, and K. Zeger (1994). Rates of convergence in the source coding theorem, in empirical quantizer design, and in universal lossy source coding., IEEE Transactions on Information Theory 40(6), 1728–1740.
  • [25] Lloyd, S. (1982). Least squares quantization in pcm., IEEE Trans. Inf. Theor. 28(2), 129–137.
  • [26] Mallows, C. L. (1957). Non-null ranking models. I., Biometrika 44(1/2), 359–369.
  • [27] McLachlan, G. J. and D. Peel (2000)., Finite Mixture Models. Wiley.
  • [28] Murphy, T. B. and D. Martin (2003). Mixtures of distance-based models for ranking data., Computational Statistics & Data Analysis 41, 645–655.
  • [29] Ng, A. Y., M. I. Jordan, and Y. Weiss (2002). On spectral clustering: Analysis and an algorithm. In, Advances in Neural Information Processing Systems, pp. 849–856.
  • [30] Ostrovsky, R., Y. Rabani, L. J. Schulman, and C. Swamy (2006). The effectiveness of lloyd-type methods for the k-means problem. In, Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, Berkeley. IEEE.
  • [31] Pollard, D. (1981). Strong consistency of $k$-means clustering., Ann. Statist. 9(1), 135–140.
  • [32] van der Vaart, A. W. (1998)., Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.
  • [33] von Luxburg, U., M. Belkin, and O. Bousquet (2008, 04). Consistency of spectral clustering., The Annals of Statistics 36(2), 555–586.