Abstract
In cancer research, clustering techniques are widely used for exploratory analyses, playing a critical role in the identification of novel cancer subtypes and patient management. As data collected by multiple research groups grows, it is increasingly feasible to investigate the replicability of clustering procedures, that is, their ability to consistently recover biologically meaningful clusters across several data sets. In this paper, we review methods for replicability of clustering analyses, and discuss a novel framework for evaluating cross-study clustering replicability, useful when two or more studies are available. Our approach can be applied to any clustering algorithm and can employ different measures of similarity between partitions to quantify replicability, globally (i.e., for the whole sample) as well as locally (i.e., for individual clusters). Using experiments on synthetic and real gene expression data, we illustrate the usefulness of our procedure to evaluate if the same clusters are identified consistently across a collection of data sets.
Funding Statement
The fifth author has been supported by NIH Grant 5R01LM013352-02 and NSF Grant 2113707.
Citation
Lorenzo Masoero. Emma Thomas. Giovanni Parmigiani. Svitlana Tyekucheva. Lorenzo Trippa. "Cross-Study Replicability in Cluster Analysis." Statist. Sci. 38 (2) 303 - 316, May 2023. https://doi.org/10.1214/22-STS871