The Annals of Statistics

A robust method for cluster analysis

María Teresa Gallegos and Gunter Ritter

Full-text: Open access

Abstract

Let there be given a contaminated list of nd-valued observations coming from g different, normally distributed populations with a common covariance matrix. We compute the ML-estimator with respect to a certain statistical model with nr outliers for the parameters of the g populations; it detects outliers and simultaneously partitions their complement into g clusters. It turns out that the estimator unites both the minimum-covariance-determinant rejection method and the well-known pooled determinant criterion of cluster analysis. We also propose an efficient algorithm for approximating this estimator and study its breakdown points for mean values and pooled SSP matrix.

Article information

Source
Ann. Statist., Volume 33, Number 1 (2005), 347-380.

Dates
First available in Project Euclid: 8 April 2005

Permanent link to this document
https://projecteuclid.org/euclid.aos/1112967709

Digital Object Identifier
doi:10.1214/009053604000000940

Mathematical Reviews number (MathSciNet)
MR2157806

Zentralblatt MATH identifier
1064.62074

Subjects
Primary: 62H30: Classification and discrimination; cluster analysis [See also 68T10, 91C20]
Secondary: 62F35: Robustness and adaptive procedures

Keywords
Cluster analysis multivariate data outliers robustness breakdown point determinant criterion minimal distance partition

Citation

Gallegos, María Teresa; Ritter, Gunter. A robust method for cluster analysis. Ann. Statist. 33 (2005), no. 1, 347--380. doi:10.1214/009053604000000940. https://projecteuclid.org/euclid.aos/1112967709


Export citation

References

  • Barnett, V. and Lewis, T. (1994). Outliers in Statistical Data, 3rd ed. Wiley, Chichester.
  • Bezdek, J. C., Keller, J., Krisnapuram, R. and Pal, N. R. (1999). Fuzzy Models and Algorithms for Pattern Recognition and Image Processing. Kluwer, Dordrecht.
  • Coleman, D. A. and Woodruff, D. L. (2000). Cluster analysis for large datasets: An effective algorithm for maximizing the mixture likelihood. J. Comput. Graph. Statist. 9 672–688.
  • Cuesta-Albertos, J. A., Gordaliza, A. and Matrán, C. (1997). Trimmed $k$-means: An attempt to robustify quantizers. Ann. Statist. 25 553–576.
  • Donoho, D. L. and Huber, P. J. (1983). The notion of a breakdown point. In A Festschrift for Erich L. Lehmann (P. J. Bickel, K. A. Doksum and J. L. Hodges, Jr., eds.) 157–184. Wadsworth, Belmont, CA.
  • Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 97 611–631.
  • Friedman, H. and Rubin, J. (1967). On some invariant criteria for grouping data. J. Amer. Statist. Assoc. 62 1159–1178.
  • Garciá-Escudero, L. A. and Gordaliza, A. (1999). Robustness properties of $k$-means and trimmed $k$-means. J. Amer. Statist. Assoc. 94 956–969.
  • Garciá-Escudero, L. A., Gordaliza, A. and Matrán, C. (2003). Trimming tools in exploratory data analysis. J. Comput. Graph. Statist. 12 434–449.
  • Gather, U. and Kale, B. K. (1988). Maximum likelihood estimation in the presence of outliers. Comm. Statist. Theory Methods 17 3767–3784.
  • Hampel, F. R. (1968). Contributions to the theory of robust estimation. Ph.D. dissertation, Univ. California, Berkeley.
  • Hampel, F. R. (1971). A general qualitative definition of robustness. Ann. Math. Statist. 42 1887–1896.
  • Hartigan, J. A. (1975). Clustering Algorithms. Wiley, New York.
  • Hodges, J. L., Jr. (1967). Efficiency in normal samples and tolerance of extreme values for some estimates of location. Proc. Fifth Berkeley Symp. Math. Statist. Probab. 1 163–186. Univ. California Press, Berkeley.
  • Lopuhaä, H. P. and Rousseeuw, P. J. (1991). Breakdown points of affine equivariant estimators of multivariate location and covariance matrices. Ann. Statist. 19 229–248.
  • Mardia, K. V., Kent, J. T. and Bibby, J. M. (1979). Multivariate Analysis. Academic Press, London.
  • Mathar, R. (1981). Ausreiß er bei ein- und mehrdimensionalen Wahrscheinlichkeitsverteilungen. Ph.D. dissertation, Mathematisch–Naturwissenschaftliche Fakultät der Rheinisch-Westfälischen Technischen Hochschule Aachen.
  • Pesch, C. (2000). Eigenschaften des gegenüber Ausreissern robusten MCD-Schätzers und Algorithmen zu seiner Berechnung. Ph.D. dissertation, Fakultät für Mathematik und Informatik, Univ. Passau.
  • Ritter, G. and Gallegos, M. T. (1997). Outliers in statistical pattern recognition and an application to automatic chromosome classification. Pattern Recognition Letters 18 525–539.
  • Ritter, G. and Gallegos, M. T. (2002). Bayesian object identification: Variants. J. Multivariate Anal. 81 301–334.
  • Rousseeuw, P. J. (1985). Multivariate estimation with high breakdown point. In Mathematical Statistics and Applications (W. Grossmann, G. C. Pflug, I. Vincze and W. Wertz, eds.) 283–297. Reidel, Dordrecht.
  • Rousseeuw, P. J. and Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics 41 212–223.
  • Schroeder, A. (1976). Analyse d'un mélange de distributions de probabilités de même type. Rev. Statist. Appl. 24 39–62.
  • Scott, A. J. and Symons, M. J. (1971). Clustering methods based on likelihood ratio criteria. Biometrics 27 387–397.
  • Späth, H. (1985). Cluster Dissection and Analysis. Theory, FORTRAN Programs, Examples. Ellis Horwood, Chichester.
  • Symons, M. J. (1981). Clustering criteria and multivariate normal mixtures. Biometrics 37 35–43.