The Annals of Statistics

A general trimming approach to robust cluster Analysis

Luis A. García-Escudero, Alfonso Gordaliza, Carlos Matrán, and Agustin Mayo-Iscar

Full-text: Open access

Abstract

We introduce a new method for performing clustering with the aim of fitting clusters with different scatters and weights. It is designed by allowing to handle a proportion α of contaminating data to guarantee the robustness of the method. As a characteristic feature, restrictions on the ratio between the maximum and the minimum eigenvalues of the groups scatter matrices are introduced. This makes the problem to be well defined and guarantees the consistency of the sample solutions to the population ones.

The method covers a wide range of clustering approaches depending on the strength of the chosen restrictions. Our proposal includes an algorithm for approximately solving the sample problem.

Article information

Source
Ann. Statist., Volume 36, Number 3 (2008), 1324-1345.

Dates
First available in Project Euclid: 26 May 2008

Permanent link to this document
https://projecteuclid.org/euclid.aos/1211819566

Digital Object Identifier
doi:10.1214/07-AOS515

Mathematical Reviews number (MathSciNet)
MR2418659

Zentralblatt MATH identifier
1360.62328

Subjects
Primary: 62H3
Secondary: 62H3

Keywords
Robustness cluster analysis trimming asymptotics trimmed k-means EM-algorithm fast-MCD algorithm Dykstra’s algorithm

Citation

García-Escudero, Luis A.; Gordaliza, Alfonso; Matrán, Carlos; Mayo-Iscar, Agustin. A general trimming approach to robust cluster Analysis. Ann. Statist. 36 (2008), no. 3, 1324--1345. doi:10.1214/07-AOS515. https://projecteuclid.org/euclid.aos/1211819566


Export citation

References

  • [1] Banfield, J. D. and Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics 49 803–821.
  • [2] Bock, H.-H. (2002). Clustering methods: From classical models to new approaches. Statistics in Transition 5 725–758.
  • [3] Celeux, G. and Govaert, A. (1992). A classification EM algorithm for clustering and two stochastic versions. Comput. Statist. Data Anal. 14 315–332.
  • [4] Cuesta-Albertos, J. A., Gordaliza, A. and Matrán, C. (1997). Trimmed k-means: An attempt to robustify quantizers. Ann. Statist. 25 553–576.
  • [5] Dempster, A., Laird, N. and Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38.
  • [6] Dykstra, R. L. (1983). An algorithm for restricted least squares regression. J. Amer. Statist. Assoc. 78 837–842.
  • [7] Flury, B. (1997). A First Course in Multivariate Statistics. Springer, New York.
  • [8] Fraley, C. and Raftery, A. E. (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. Computer J. 41 578–588.
  • [9] Gallegos, M. T. (2001). Robust clustering under general normal assumptions. Preprint. Available at http://www.fmi.uni-passau.de/forschung/mip-berichte/MIP-0103.html.
  • [10] Gallegos, M. T. (2002). Maximum likelihood clustering with outliers. In Classification, Clustering and Data Analysis: Recent Advances and Applications (K. Jajuga, A. Sokolowski and H.-H. Bock, eds.) 247–255. Springer, New York.
  • [11] Gallegos, M. T. and Ritter, G. (2005). A robust method for cluster analysis. Ann. Statist. 33 347–380.
  • [12] García-Escudero, L. A. and Gordaliza, A. (1999). Robustness properties of k-means and trimmed k-means. J. Amer. Statist. Assoc. 94 956–969.
  • [13] García-Escudero, L. A. and Gordaliza, A. (2007). The importance of the scales in heterogeneous robust clustering. Comput. Statist. Data Anal. 51 4403–4412.
  • [14] García-Escudero, L. A., Gordaliza, A. and Matrán, C. (1999). A central limit theorem for multivariate generalized trimmed k-means. Ann. Statist. 27 1061–1079.
  • [15] García-Escudero, L. A., Gordaliza, A. and Matrán, C. (2003). Trimming tools in exploratory data analysis. J. Comput. Graph. Statist. 12 434–449.
  • [16] García-Escudero, L. A., Gordaliza, A., Matrán, C. and Mayo-Iscar, A. (2006). The TCLUST approach to robust cluster analysis. Technical report. Available at http://www.eio.uva.es/inves/grupos/representaciones/trTCLUST.pdf.
  • [17] Goldfarb, D. and Idnani, A. (1983). A numerically stable dual method for solving strictly convex quadratic programs. Math. Program. 27 1–33.
  • [18] Hathaway, R. J. (1985). A constrained formulation of maximum likelihood estimation for normal mixture distributions. Ann. Statist. 13 795–800.
  • [19] Hennig, C. (2004). Breakdown points for ML estimators of location-scale mixtures. Ann. Statist. 32 1313–1340.
  • [20] Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Multivariate Analysis. Academic Press, London.
  • [21] Maronna, R. (2005). Principal components and orthogonal regression based on robust scales. Technometrics 47 264–273.
  • [22] Maronna, R. and Jacovkis, P. M. (1974). Multivariate clustering procedures with variable metrics. Biometrics 30 499–505.
  • [23] McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley, New York.
  • [24] Papadimitriou, C. H. and Steiglitz, K. (1982). Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall, Englewood Cliffs, NJ.
  • [25] Rousseeuw, P. J. and Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics 41 212–223.
  • [26] Scott, A. J. and Symons, M. J. (1971). Clustering methods based on likelihood ratio criteria. Biometrics 27 387–397.
  • [27] Van Aelst, S., Wang, X., Zamar, R. H. and Zhu, R. (2006). Linear grouping using orthogonal regression. Comput. Statist. Data Anal. 50 1287–1312.
  • [28] Van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Wiley, New York.