## The Annals of Statistics

### Generalized cluster trees and singular measures

Yen-Chi Chen

#### Abstract

In this paper we study the $\alpha$-cluster tree ($\alpha$-tree) under both singular and nonsingular measures. The $\alpha$-tree uses probability contents within a set created by the ordering of points to construct a cluster tree so that it is well defined even for singular measures. We first derive the convergence rate for a density level set around critical points, which leads to the convergence rate for estimating an $\alpha$-tree under nonsingular measures. For singular measures, we study how the kernel density estimator (KDE) behaves and prove that the KDE is not uniformly consistent but pointwise consistent after rescaling. We further prove that the estimated $\alpha$-tree fails to converge in the $L_{\infty }$ metric but is still consistent under the integrated distance. We also observe a new type of critical points—the dimensional critical points (DCPs)—of a singular measure. DCPs are points that contribute to cluster tree topology but cannot be defined using density gradient. Building on the analysis of the KDE and DCPs, we prove the topological consistency of an estimated $\alpha$-tree.

#### Article information

Source
Ann. Statist., Volume 47, Number 4 (2019), 2174-2203.

Dates
Revised: February 2018
First available in Project Euclid: 21 May 2019

https://projecteuclid.org/euclid.aos/1558425642

Digital Object Identifier
doi:10.1214/18-AOS1744

Mathematical Reviews number (MathSciNet)
MR3953448

Zentralblatt MATH identifier
07082283

Subjects
Primary: 62G20: Asymptotic properties
Secondary: 62G05: Estimation 62G07: Density estimation

#### Citation

Chen, Yen-Chi. Generalized cluster trees and singular measures. Ann. Statist. 47 (2019), no. 4, 2174--2203. doi:10.1214/18-AOS1744. https://projecteuclid.org/euclid.aos/1558425642

#### References

• Balakrishnan, S., Narayanan, S., Rinaldo, A., Singh, A. and Wasserman, L. (2012). Cluster trees on manifolds. In Advances in Neural Information Processing Systems (C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani and K. Q. Weinberger, eds.) 26 2679–2687. Curran Associates, Red Hook, NY.
• Baryshnikov, Y., Bubenik, P. and Kahle, M. (2014). Min-type Morse theory for configuration spaces of hard spheres. Int. Math. Res. Not. IMRN 9 2577–2592.
• Bobrowski, O., Mukherjee, S. and Taylor, J. E. (2017). Topological consistency via kernel estimation. Bernoulli 23 288–328.
• Bubenik, P. (2015). Statistical topological data analysis using persistence landscapes. J. Mach. Learn. Res. 16 77–102.
• Cadre, B. (2006). Kernel estimation of density level sets. J. Multivariate Anal. 97 999–1023.
• Cadre, B., Pelletier, B. and Pudlo, P. (2009). Clustering by estimation of density level sets at a fixed probability. Available at https://hal.archives-ouvertes.fr/file/index/docid/397437/filename/tlevel.pdf.
• Carlsson, G. (2009). Topology and data. Bull. Amer. Math. Soc. (N.S.) 46 255–308.
• Chaudhuri, K. and Dasgupta, S. (2010). Rates of convergence for the cluster tree. In Advances in Neural Information Processing Systems 343–351.
• Chaudhuri, K., Dasgupta, S., Kpotufe, S. and von Luxburg, U. (2014). Consistent procedures for cluster tree estimation and pruning. IEEE Trans. Inform. Theory 60 7900–7912.
• Chazal, F., Fasy, B., Lecci, F., Michel, B., Rinaldo, A. and Wasserman, L. (2017). Robust topological inference: Distance to a measure and kernel distance. J. Mach. Learn. Res. 18 Paper No. 159, 40.
• Chen, Y.-C. (2017). A tutorial on kernel density estimation and recent advances. ArXiv preprint. Available at arXiv:1704.03924.
• Chen, Y.-C. (2019). Supplement to “Generalized cluster trees and singular measures”. DOI:10.1214/18-AOS1744SUPP.
• Chen, Y.-C. and Dobra, A. (2017). Measuring human activity spaces with density ranking based on GPS data. ArXiv preprint. Available at arXiv:1708.05017.
• Chen, Y.-C., Genovese, C. R. and Wasserman, L. (2015). Asymptotic theory for density ridges. Ann. Statist. 43 1896–1928.
• Chen, Y.-C., Genovese, C. R. and Wasserman, L. (2016). A comprehensive approach to mode clustering. Electron. J. Stat. 10 210–241.
• Chen, Y.-C., Genovese, C. R. and Wasserman, L. (2017). Density level sets: Asymptotics, inference, and visualization. J. Amer. Statist. Assoc. 112 1684–1696.
• Chen, Y.-C., Kim, J., Balakrishnan, S., Rinaldo, A. and Wasserman, L. (2016). Statistical inference for cluster trees. ArXiv preprint. Available at arXiv:1605.06416.
• Cohen-Steiner, D., Edelsbrunner, H. and Harer, J. (2007). Stability of persistence diagrams. Discrete Comput. Geom. 37 103–120.
• Edelsbrunner, H. and Harer, J. (2008). Persistent homology—A survey. In Surveys on Discrete and Computational Geometry. Contemp. Math. 453 257–282. Amer. Math. Soc., Providence, RI.
• Edelsbrunner, H. and Morozov, D. (2013). Persistent homology: Theory and practice. In European Congress of Mathematics 31–50. Eur. Math. Soc., Zürich.
• Einmahl, U. and Mason, D. M. (2005). Uniform in bandwidth consistency of kernel-type function estimators. Ann. Statist. 33 1380–1403.
• Eldridge, J., Belkin, M. and Wang, Y. (2015). Beyond hartigan consistency: Merge distortion metric for hierarchical clustering. In Proceedings of the 28th Conference on Learning Theory 588–606.
• Fasy, B. T., Lecci, F., Rinaldo, A., Wasserman, L., Balakrishnan, S. and Singh, A. (2014). Confidence sets for persistence diagrams. Ann. Statist. 42 2301–2339.
• Federer, H. (1959). Curvature measures. Trans. Amer. Math. Soc. 93 418–491.
• Friedman, G. (2014). Singular intersection homology. Preprint.
• Genovese, C. R., Perone-Pacifico, M., Verdinelli, I. and Wasserman, L. (2009). On the path density of a gradient field. Ann. Statist. 37 3236–3271.
• Genovese, C. R., Perone-Pacifico, M., Verdinelli, I. and Wasserman, L. (2014). Nonparametric ridge estimation. Ann. Statist. 42 1511–1545.
• Giné, E. and Guillou, A. (2002). Rates of strong uniform consistency for multivariate kernel density estimators. Ann. Inst. Henri Poincaré B, Probab. Stat. 38 907–921.
• Goresky, M. and MacPherson, R. (1980). Intersection homology theory. Topology 19 135–162.
• Goresky, M. and MacPherson, R. (1988). Stratified Morse Theory. Ergebnisse der Mathematik und Ihrer Grenzgebiete (3) [Results in Mathematics and Related Areas (3)] 14. Springer, Berlin.
• Hartigan, J. A. (1981). Consistency of single linkage for high-density clusters. J. Amer. Statist. Assoc. 76 388–394.
• Kent, B. P. (2013). Level set trees for applied statistics. Ph.D. thesis, Carnegie Mellon Univ., Pittsburgh, PA.
• Klemelä, J. (2004). Visualization of multivariate density estimates with level set trees. J. Comput. Graph. Statist. 13 599–620.
• Klemelä, J. (2006). Visualization of multivariate density estimates with shape trees. J. Comput. Graph. Statist. 15 372–397.
• Klemelä, J. (2009). Smoothing of Multivariate Data: Density Estimation and Visualization. Wiley, Hoboken, NJ.
• Kpotufe, S. and Luxburg, U. V. (2011). Pruning nearest neighbor cluster trees. In Proceedings of the 28th International Conference on Machine Learning (ICML-11) (L. Getoor and T. Scheffer, eds.) 225–232. International Machine Learning Society, Madison, WI.
• Laloe, T. and Servien, R. (2013). Nonparametric estimation of regression level sets. J. Korean Statist. Soc.
• Lee, J. M. (2013). Introduction to Smooth Manifolds, 2nd ed. Graduate Texts in Mathematics 218. Springer, New York.
• Mammen, E. and Polonik, W. (2013). Confidence regions for level sets. J. Multivariate Anal. 122 202–214.
• Mason, D. M. and Polonik, W. (2009). Asymptotic normality of plug-in level set estimates. Ann. Appl. Probab. 19 1108–1142.
• Mattila, P. (1995). Geometry of Sets and Measures in Euclidean Spaces: Fractals and Rectifiability. Cambridge Studies in Advanced Mathematics 44. Cambridge Univ. Press, Cambridge.
• Milnor, J. (1963). Morse Theory. Based on Lecture Notes by M. Spivak and R. Wells. Annals of Mathematics Studies, No. 51. Princeton Univ. Press, Princeton, NJ.
• Molchanov, I. S. (1990). Empirical estimation of quantiles of distributions of random closed sets. Teor. Veroyatn. Primen. 35 586–592.
• Morse, M. (1925). Relations between the critical points of a real function of $n$ independent variables. Trans. Amer. Math. Soc. 27 345–396.
• Morse, M. (1930). The foundations of a theory of the calculus of variations in the large in $m$-space. II. Trans. Amer. Math. Soc. 32 599–631.
• Polonik, W. (1995). Measuring mass concentrations and estimating density contour clusters—An excess mass approach. Ann. Statist. 23 855–881.
• Preiss, D. (1987). Geometry of measures in $\textbf{R}^{n}$: Distribution, rectifiability, and densities. Ann. of Math. (2) 125 537–643.
• Rinaldo, A. and Wasserman, L. (2010). Generalized density clustering. Ann. Statist. 38 2678–2722.
• Rinaldo, A., Singh, A., Nugent, R. and Wasserman, L. (2012). Stability of density-based clustering. J. Mach. Learn. Res. 13 905–948.
• Scott, D. W. (2015). Multivariate Density Estimation: Theory, Practice, and Visualization, 2nd ed. Wiley, Hoboken, NJ.
• Singh, A., Scott, C. and Nowak, R. (2009). Adaptive Hausdorff estimation of density level sets. Ann. Statist. 37 2760–2782.
• Steinwart, I. (2011). Adaptive density level set clustering. In COLT 703–738.
• Stuetzle, W. (2003). Estimating the cluster type of a density by analyzing the minimal spanning tree of a sample. J. Classification 20 25–47.
• Tsybakov, A. B. (1997). On nonparametric estimation of density level sets. Ann. Statist. 25 948–969.
• Tu, L. W. (2008). An Introduction to Manifolds. Springer, New York.
• Walther, G. (1997). Granulometric smoothing. Ann. Statist. 25 2273–2299.
• Wasserman, L. (2006). All of Nonparametric Statistics. Springer Texts in Statistics. Springer, New York.
• Wasserman, L. (2018). Topological data analysis. Ann. Rev. Stat. Appl. 5 501–535.

#### Supplemental materials

• Supplementary proofs: Generalized cluster trees and singular measures. This document contains all proofs to the theorems and lemmas in this paper.