## Electronic Journal of Statistics

### Marginal integration for nonparametric causal inference

#### Abstract

We consider the problem of inferring the total causal effect of a single continuous variable intervention on a (response) variable of interest. We propose a certain marginal integration regression technique for a very general class of potentially nonlinear structural equation models (SEMs) with known structure, or at least known superset of adjustment variables: we call the procedure S-mint regression. We easily derive that it achieves the convergence rate as for nonparametric regression: for example, single variable intervention effects can be estimated with convergence rate $n^{-2/5}$ assuming smoothness with twice differentiable functions. Our result can also be seen as a major robustness property with respect to model misspecification which goes much beyond the notion of double robustness. Furthermore, when the structure of the SEM is not known, we can estimate (the equivalence class of) the directed acyclic graph corresponding to the SEM, and then proceed by using S-mint based on these estimates. We empirically compare the S-mint regression method with more classical approaches and argue that the former is indeed more robust, more reliable and substantially simpler.

#### Article information

Source
Electron. J. Statist., Volume 9, Number 2 (2015), 3155-3194.

Dates
First available in Project Euclid: 25 January 2016

https://projecteuclid.org/euclid.ejs/1453730084

Digital Object Identifier
doi:10.1214/15-EJS1075

Mathematical Reviews number (MathSciNet)
MR3453973

Zentralblatt MATH identifier
1330.62171

Subjects
Primary: 62G05: Estimation 62H12: Estimation

#### Citation

Ernest, Jan; Bühlmann, Peter. Marginal integration for nonparametric causal inference. Electron. J. Statist. 9 (2015), no. 2, 3155--3194. doi:10.1214/15-EJS1075. https://projecteuclid.org/euclid.ejs/1453730084

#### References

• Bang, H. and Robins, J. (2005). Doubly robust estimation in missing data and causal inference models., Biometrics, 61:962–972.
• Bollen, K. A. (1998)., Structural equation models. Wiley Online Library.
• Bühlmann, P. (2013). Causal statistical inference in high dimensions., Mathematical Methods of Operations Research, 77:357–370.
• Bühlmann, P. and Hothorn, T. (2007). Boosting algorithms: regularization, prediction and model fitting (with discussion)., Statistical Science, 22:477–505.
• Bühlmann, P., Peters, J., and Ernest, J. (2014). CAM: Causal additive models, high-dimensional order search and penalized regression., Annals of Statistics, 42:2526–2556.
• Bühlmann, P. and Yu, B. (2003). Boosting with the $L_2$ loss: regression and classification., Journal of the American Statistical Association, 98:324–339.
• Chickering, D. (2002). Optimal structure identification with greedy search., Journal of Machine Learning Research, 3:507–554.
• Colombo, D., Maathuis, M., Kalisch, M., and Richardson, T. (2012). Learning high-dimensional directed acyclic graphs with latent and selection variables., Annals of Statistics, 40:294–321.
• Dawid, A. P. (2000). Causal inference without counterfactuals., Journal of the American Statistical Association, 95:407–424.
• Editorial (2010). Cause and effect., Nature Methods, 7:243.
• Fan, J., Härdle, W., and Mammen, E. (1998). Direct estimation of low-dimensional components in additive models., Annals of Statistics, 26:943–971.
• Friedman, J. (2001). Greedy function approximation: a gradient boosting machine., Annals of Statistics, 29:1189–1232.
• Friedman, N. (2004). Inferring cellular networks using probabilistic graphical models., Science, 303:799–805.
• Greenland, S., Pearl, J., and Robins, J. M. (1999). Causal diagrams for epidemiologic research., Epidemiology, 10:37–48.
• Hall, P. and Marron, J. (1987). Estimation of integrated squared density derivatives., Statistics & Probability Letters, 6:109–115.
• Hampel, F., Ronchetti, E., Rousseeuw, P., and Stahel, W. (2011)., Robust statistics: the approach based on influence functions. John Wiley & Sons.
• Hauser, A. and Bühlmann, P. (2012). Characterization and greedy learning of interventional markov equivalence classes of directed acyclic graphs., The Journal of Machine Learning Research, 13:2409–2464.
• Hauser, A. and Bühlmann, P. (2014). Two optimal strategies for active learning of causal models from interventional data., International Journal of Approximate Reasoning, 55:926–939.
• Hauser, A. and Bühlmann, P. (2015). Jointly interventional and observational data: estimation of interventional markov equivalence classes of directed acyclic graphs., Journal of the Royal Statistical Society, Series B, 77:291–318.
• He, Y.-B. and Geng., Z. (2008). Active learning of causal networks with intervention experiments and optimal designs., Journal of Machine Learning Research, 9:2523–2547.
• Horowitz, J., Klemelä, J., and Mammen, E. (2006). Optimal estimation in additive regression models., Bernoulli, 12:271–298.
• Hoyer, P., Janzing, D., Mooij, J., Peters, J., and Schölkopf, B. (2009). Nonlinear causal discovery with additive noise models. In, Advances in Neural Information Processing Systems 21, 22nd Annual Conference on Neural Information Processing Systems (NIPS 2008), pages 689–696.
• Husmeier, D. (2003). Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic bayesian networks., Bioinformatics, 19:2271–2282.
• Imoto, S., Goto, T., and Miyano, S. (2002). Estimation of genetic networks and functional structures between genes by using Bayesian network and nonparametric regression. In, Proceedings of the Pacific Symposium on Biocomputing (PSB-2002), volume 7, pages 175–186.
• Kalisch, M. and Bühlmann, P. (2007). Estimating high-dimensional directed acyclic graphs with the PC-algorithm., Journal of Machine Learning Research, 8:613–636.
• Koller, D. and Friedman, N. (2009)., Probabilistic graphical models: principles and techniques. MIT press.
• Lauritzen, S. and Spiegelhalter, D. (1988). Local computations with probabilities on graphical structures and their application to expert systems., Journal of the Royal Statistical Society, Series B, 50:157–224.
• Li, L., Tchetgen, E. T., van der Vaart, A., and Robins, J. (2011). Higher order inference on a treatment effect under low regularity conditions., Statistics & Probability Letters, 81:821–828.
• Linton, O. and Nielsen, J. P. (1995). A kernel method of estimating structured nonparametric regression based on marginal integration., Biometrika, 82:93–100.
• Loh, P. and Bühlmann, P. (2014). High-dimensional learning of linear causal networks via inverse covariance estimation., Journal of Machine Learning Research, 15:3065–3105.
• Maathuis, M., Colombo, D., Kalisch, M., and Bühlmann, P. (2010). Predicting causal effects in large-scale systems from observational data., Nature Methods, 7:247–248.
• Maathuis, M., Kalisch, M., and Bühlmann, P. (2009). Estimating high-dimensional intervention effects from observational data., Annals of Statistics, 37:3133–3164.
• Marzio, M. D. and Taylor, C. (2008). On boosting kernel regression., Journal of Statistical Planning and Inference, 138:2483–2498.
• Meinshausen, N. and Bühlmann, P. (2010). Stability Selection (with discussion)., Journal of the Royal Statistical Society, Series B, 72:417–473.
• Nowzohour, C. and Bühlmann, P. (2015). Score-based causal learning in additive noise models., Statistics. Published online, doi:10.1080/02331888.2015.1060237.
• Pearl, J. (2000)., Causality: models, reasoning and inference. Cambridge Univ. Press.
• Peters, J. and Bühlmann, P. (2014). Identifiability of Gaussian structural equation models with equal error variances., Biometrika, 101:219–228.
• Peters, J., Mooij, J., Janzing, D., and Schölkopf, B. (2014). Causal discovery with continuous additive noise models., Journal of Machine Learning Research, 15:2009–2053.
• Polzehl, J. and Spokoiny, V. (2000). Adaptive weights smoothing with applications to image restoration., Journal of the Royal Statistical Society, Series B, 62:335–354.
• Robins, J., Rotnitzky, A., and Zhao, L. (1994). Estimation of regression coefficients when some of the regressors are not always observed., Journal of the American Statistical Association, 89:846–866.
• Robins, J., Tchetgen, E. T., Li, L., and van der Vaart, A. (2009). Semiparametric minimax rates., Electronic Journal of Statistics, 3:1305–1321.
• Rosenbaum, P. and Rubin, D. (1983). The central role of the propensity score in observational studies for causal effects., Biometrika, 70:41–55.
• Rubin, D. B. (2005). Causal inference using potential outcomes., Journal of the American Statistical Association, 100:322–331.
• Scharfstein, D., Rotnitzky, A., and Robins, J. (1999). Adjusting for nonignorable drop-out using semiparametric nonresponse models (with discussion)., Journal of the American Statistical Association, 94:1096–1146.
• Schmidt, M., Niculescu-Mizil, A., and Murphy, K. (2007). Learning graphical model structure using l1-regularization paths. In, Proceedings of the National Conference on Artificial Intelligence, volume 22, page 1278. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999.
• Shimizu, S., Hoyer, P., Hyvärinen, A., and Kerminen, A. (2006). A linear non-Gaussian acyclic model for causal discovery., Journal of Machine Learning Research, 7:2003–2030.
• Shojaie, A. and Michailidis, G. (2010). Penalized likelihood methods for estimation of sparse high-dimensional directed acyclic graphs., Biometrika, 97:519–538.
• Shpitser, I., Richardson, T. S., and Robins, J. M. (2011). An efficient algorithm for computing interventional distributions in latent variable causal models. In, Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI), pages 661–670.
• Smith, V. A., Jarvis, E. D., and Hartemink, A. J. (2002). Evaluating functional network inference using simulations of complex biological systems., Bioinformatics, 18(suppl 1):S216–S224.
• Song, L., Fukumizu, K., and Gretton, A. (2013). Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models., Signal Processing Magazine, IEEE, 30:98–111.
• Spirtes, P. (2010). Introduction to causal inference., The Journal of Machine Learning Research, 11:1643–1662.
• Spirtes, P., Glymour, C., and Scheines, R. (2000)., Causation, Prediction, and Search. MIT Press, second edition.
• Stekhoven, D., Moraes, I., Sveinbjörnsson, G., Hennig, L., Maathuis, M., and Bühlmann, P. (2012). Causal stability ranking., Bioinformatics, 28:2819–2823.
• Teyssier, M. and Koller, D. (2005). Ordering-based search: a simple and effective algorithm for learning Bayesian networks. In, Proceedings of the 21th Conference on Uncertainty in Artificial Intelligence (UAI), pages 584–590, Edinburgh, Scottland, UK.
• van de Geer, S. (2014). On the uniform convergence of empirical norms and inner products, with application to causal inference., Electronic Journal of Statistics, 8:543–574.
• van de Geer, S. and Bühlmann, P. (2013). $\ell_0$-penalized maximum likelihood for sparse directed acyclic graphs., Annals of Statistics, 41:536–567.
• van der Laan, M. J. and Robins, J. M. (2003)., Unified methods for censored longitudinal data and causality. Springer.
• van der Laan, M. J. and Rose, S. (2011)., Targeted Learning. Causal Inference for Observational and Experimental Data. Springer, New York.
• Wille, A., Zimmermann, P., Vranová, E., Fürholz, A., Laule, O., Bleuler, S., Hennig, L., Prelic, A., von Rohr, P., Thiele, L., et al. (2004). Sparse graphical gaussian modeling of the isoprenoid gene network in arabidopsis thaliana., Genome Biol, 5(11):R92.
• Wood, S. (2006)., Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC.
• Wood, S. N. (2003). Thin-plate regression splines., Journal of the Royal Statistical Society (B), 65:95–114.
• Yu, J., Smith, V. A., Wang, P. P., Hartemink, A. J., and Jarvis, E. D. (2004). Advances to bayesian network inference for generating causal networks from observational biological data., Bioinformatics, 20:3594–3603.
• Zhang, J. (2008). On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias., Artificial Intelligence, 172:1873–1896.