Abstract
Case-control experiments are essential to the scientific method, as they allow researchers to test biological hypotheses by looking for differences in outcome between cases and controls. It is then of interest to characterize variation that is enriched in a “foreground” (case) dataset relative to a “background” (control) dataset. For example, in a genomics context, the goal is to identify low-dimensional transcriptional structure unique to patients with certain disease (cases) vs. those without that disease (controls). In this work we propose probabilistic contrastive principal component analysis (PCPCA), a probabilistic dimension reduction method designed for case-control data. We describe inference in PCPCA through a contrastive likelihood and show that our model generalizes PCA, probabilistic PCA, and contrastive PCA. We discuss how to set the tuning parameter in theory and in practice, and we show several of PCPCA’s advantages in the analysis of case-control data over related methods, including greater interpretability, uncertainty quantification and principled inference, robustness to noise and missing data, and the ability to generate “foreground-enriched” data from the model. We demonstrate PCPCA’s performance on case-control data through a series of simulations, and we successfully identify variation specific to case data in genomic case-control experiments with data modalities, including gene expression, protein expression, and images.
Funding Statement
DL was funded by NIH/NCATS award UL1 TR002489, NIH/NHLBI award R01 HL149683, and NIH/NIEHS award P30 ES010126. AJ and BEE were funded by Helmsley Trust grant AWD1006624, NIH NCI 5U2CCA233195, NIH NHGRI R01 HG012967, and NSF CAREER AWD1005627. BEE is a CIFAR Fellow in the Multiscale Human Program.
Acknowledgments
DL and AJ contribute equally to this paper. BEE is also affiliated with the Department of Biomedical Data Science and, by courtesy, Statistics at Stanford University
Citation
Didong Li. Andrew Jones. Barbara Engelhardt. "Probabilistic contrastive dimension reduction for case-control study data." Ann. Appl. Stat. 18 (3) 2207 - 2229, September 2024. https://doi.org/10.1214/24-AOAS1877
Information