September 2021 Stabilizing variable selection and regression
Niklas Pfister, Evan G. Williams, Jonas Peters, Ruedi Aebersold, Peter Bühlmann
Author Affiliations +
Ann. Appl. Stat. 15(3): 1220-1246 (September 2021). DOI: 10.1214/21-AOAS1487

Abstract

We consider regression in which one predicts a response Y with a set of predictors X across different experiments or environments. This is a common setup in many data-driven scientific fields, and we argue that statistical inference can benefit from an analysis that takes into account the distributional changes across environments. In particular, it is useful to distinguish between stable and unstable predictors, that is, predictors which have a fixed or a changing functional dependence on the response, respectively. We introduce stabilized regression which explicitly enforces stability and thus improves generalization performance to previously unseen environments. Our work is motivated by an application in systems biology. Using multiomic data, we demonstrate how hypothesis generation about gene function can benefit from stabilized regression. We believe that a similar line of arguments for exploiting heterogeneity in data can be powerful for many other applications as well. We draw a theoretical connection between multi-environment regression and causal models which allows to graphically characterize stable vs. unstable functional dependence on the response. Formally, we introduce the notion of a stable blanket which is a subset of the predictors that lies between the direct causal predictors and the Markov blanket. We prove that this set is optimal in the sense that a regression based on these predictors minimizes the mean squared prediction error, given that the resulting regression generalizes to unseen new environments.

Funding Statement

The first and fifth authors were supported by European Research Council (CausalStats—AdG grant 786461). The second author was supported by an NIH F32 Ruth Kirchstein Fellowship (F32GM119190). The third author was supported by VILLUM FONDEN research grant 18968 and the Carlsberg Foundation. The fourth author was supported by European Research Council (Proteomics4D—AdG grant 670821 and Proteomics v3.0—AdG grant 233226) and by Swiss National Science Foundation (31003A-140780, 31003A-143914, CSRII3-136201, 310030E_173572 and 310030B_185390).

Acknowledgements

We thank Yuansi Chen and Nicolai Meinshausen for helpful discussions. Most of this work was done while NP and EGW were at ETH Zürich.

Citation

Download Citation

Niklas Pfister. Evan G. Williams. Jonas Peters. Ruedi Aebersold. Peter Bühlmann. "Stabilizing variable selection and regression." Ann. Appl. Stat. 15 (3) 1220 - 1246, September 2021. https://doi.org/10.1214/21-AOAS1487

Information

Received: 1 November 2019; Revised: 1 January 2021; Published: September 2021
First available in Project Euclid: 23 September 2021

MathSciNet: MR4317406
zbMATH: 1478.62060
Digital Object Identifier: 10.1214/21-AOAS1487

Keywords: causality , multiomic data , regression , Variable selection

Rights: Copyright © 2021 Institute of Mathematical Statistics

JOURNAL ARTICLE
27 PAGES

This article is only available to subscribers.
It is not available for individual sale.
+ SAVE TO MY LIBRARY

Vol.15 • No. 3 • September 2021
Back to Top