Statistical Science

Automated versus Do-It-Yourself Methods for Causal Inference: Lessons Learned from a Data Analysis Competition

Vincent Dorie, Jennifer Hill, Uri Shalit, Marc Scott, and Dan Cervone

Full-text: Open access


Statisticians have made great progress in creating methods that reduce our reliance on parametric assumptions. However, this explosion in research has resulted in a breadth of inferential strategies that both create opportunities for more reliable inference as well as complicate the choices that an applied researcher has to make and defend. Relatedly, researchers advocating for new methods typically compare their method to at best 2 or 3 other causal inference strategies and test using simulations that may or may not be designed to equally tease out flaws in all the competing methods. The causal inference data analysis challenge, “Is Your SATT Where It’s At?”, launched as part of the 2016 Atlantic Causal Inference Conference, sought to make progress with respect to both of these issues. The researchers creating the data testing grounds were distinct from the researchers submitting methods whose efficacy would be evaluated. Results from 30 competitors across the two versions of the competition (black-box algorithms and do-it-yourself analyses) are presented along with post-hoc analyses that reveal information about the characteristics of causal inference strategies and settings that affect performance. The most consistent conclusion was that methods that flexibly model the response surface perform better overall than methods that fail to do so. Finally new methods are proposed that combine features of several of the top-performing submitted methods.

Article information

Statist. Sci., Volume 34, Number 1 (2019), 43-68.

First available in Project Euclid: 12 April 2019

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Causal inference competition machine learning automated algorithms evaluation


Dorie, Vincent; Hill, Jennifer; Shalit, Uri; Scott, Marc; Cervone, Dan. Automated versus Do-It-Yourself Methods for Causal Inference: Lessons Learned from a Data Analysis Competition. Statist. Sci. 34 (2019), no. 1, 43--68. doi:10.1214/18-STS667.

Export citation


  • Abadie, A. and Imbens, G. W. (2006). Large sample properties of matching estimators for average treatment effects. Econometrica 74 235–267.
  • The H2O. ai team (2016). h2o: R Interface for H2O. R package version
  • Athanasopoulos, G. and Hyndman, R. J. (2011). The value of feedback in forecasting competitions. Int. J. Forecast. 27 845–849.
  • Athey, S. and Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects. Proc. Natl. Acad. Sci. USA 113 7353–7360.
  • Austin, P. C., Grootendorst, P. and Anderson, G. M. (2007). A comparison of the ability of different propensity score models to balance measured variables between treated and untreated subjects: A Monte Carlo study. Stat. Med. 26 734–753.
  • Bang, H. and Robins, J. M. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics 61 962–972.
  • Barnow, B. S., Cain, G. G. and Goldberger, A. S. (1980). Issues in the analysis of selectivity bias. In Evaluation Studies (E. Stromsdorfer and G. Farkas, eds.) 5 42–59. Sage, San Francisco, CA.
  • Breiman, L. (2001). Random forests. Mach. Learn. 45 5–32.
  • Carpenter, J. (2011). May the best analyst win. Science 331 698–699.
  • Chipman, H. A., George, E. I. and McCulloch, R. E. (2010). BART: Bayesian additive regression trees. Ann. Appl. Stat. 4 266–298.
  • Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. J. Abnorm. Soc. Psychol. 65 145–163.
  • Crump, R. K., Hotz, V. J., Imbens, G. W. and Mitnik, O. A. (2009). Dealing with limited overlap in estimation of average treatment effects. Biometrika 96 187–199.
  • Cuturi, M. (2013). Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems 2292–2300.
  • Dietterich, T. G. (2000). Ensemble methods in machine learning. In International Workshop on Multiple Classifier Systems 1–15. Springer, Berlin.
  • Ding, P. and Miratrix, L. (2014). To adjust or not to adjust? Sensitivity analysis of M-bias and butterfly-bias. J. Causal Inference 3 41–57.
  • Dorie, V., Harada, M., Carnegie, N. B. and Hill, J. (2016). A flexible, interpretable framework for assessing sensitivity to unmeasured confounding. Stat. Med. 35 3453–3470.
  • Dorie, V., Hill, J., Shalit, U., Scott, M. and Cervone, D. (2019). Supplement to “Automated versus do-it-yourself methods for causal inference: Lessons learned from a data analysis competition.” DOI:10.1214/18-STS667SUPP.
  • Enders, C. K. and Tofighi, D. (2007). Centering predictor variables in cross-sectional multilevel models: A new look at an old issue. Psychol. Methods 12 121–138.
  • Feller, A. and Holmes, C. C. (2009). Beyond toplines: Heterogeneous treatment effects in randomized experiments. Unpublished manuscript, Oxford Univ.
  • Gelman, A. and Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge Univ. Press, New York.
  • Greenland, S. and Robins, J. M. (1986). Identifiability, exchangeability, and epidemiological confounding. Int. J. Epidemiol. 15 413–419.
  • Greenland, S., Robins, J. M. and Pearl, J. (1999). Confounding and collapsibility in causal inference. Statist. Sci. 14 29–46.
  • Guyon, I., Aliferis, C. F., Cooper, G. F., Elisseeff, A., Pellet, J.-P., Spirtes, P. and Statnikov, A. R. (2008). Design and analysis of the causation and prediction challenge. In WCCI Causation and Prediction Challenge 1–33.
  • Haberman, S. J. (1984). Adjustment by minimum discriminant information. Ann. Statist. 12 971–988.
  • Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica 66 315–331.
  • Hahn, P. R., Murray, J. S. and Carvalho, C. (2017). Bayesian regression tree models for causal inference: Regularization, confounding, and heterogeneous effects. Preprint. Available at arXiv:1706.09523.
  • Hartman, E., Grieve, R., Ramsahai, R. and Sekhon, J. S. (2015). From sample average treatment effect to population average treatment effect on the treated: Combining experimental with observational studies to estimate population treatment effects. J. Roy. Statist. Soc. Ser. A 178 757–778.
  • Hill, J. (2008). Discussion of research using propensity-score matching: Comments on ‘A critical appraisal of propensity-score matching in the medical literature between 1996 and 2003’ by Peter Austin. Stat. Med. 27 2055–2061.
  • Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference. J. Comput. Graph. Statist. 20 217–240.
  • Hill, J. L., Reiter, J. P. and Zanutto, E. L. (2004). A comparison of experimental and observational data analyses. In Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives (X.-L. Meng and A. Gelman, eds.) 49–60. Wiley, Chichester.
  • Hill, J. and Su, Y.-S. (2013). Assessing lack of common support in causal inference using Bayesian nonparametrics: Implications for evaluating the effect of breastfeeding on children’s cognitive outcomes. Ann. Appl. Stat. 7 1386–1420.
  • Hirano, K. and Imbens, G. W. (2001). Estimation of causal effects using propensity score weighting: An application of data on right ear catheterization. Health Serv. Outcomes Res. Methodol. 1 259–278.
  • Hirano, K., Imbens, G. W. and Ridder, G. (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71 1161–1189.
  • Imai, K., and Ratkovic, M. (2014). Covariate balancing propensity score. J. Roy. Statist. Soc. Ser. B 76 243–263.
  • Imbens, G. (2004). Nonparametric estimation of average treatment effects under exogeneity: A review. Rev. Econ. Stat. 86 4–29.
  • Kern, H. L., Stuart, E. A., Hill, J. L. and Green, D. P. (2016). Assessing methods for generalizing experimental impact estimates to target samples. J. Res. Educ. Eff. 9 103–127.
  • Kurth, T., Walker, A. M., Glynn, R. J., Chan, K. A., Gaziano, J. M., Berger, K. and Robins, J. M. (2006). Results of multivariable logistic regression, propensity matching, propensity adjustment, and propensity-based weighting under conditions of non-uniform effect. Am. J. Epidemiol. 163 262–270.
  • LaLonde, R. and Maynard, R. (1987). How precise are evaluations of employment and training programs: Evidence from a field experiment. Eval. Rev. 11 428–451.
  • Lechner, M. (2001). Identification and estimation of causal effects of multiple treatments under the conditional independence assumption. In Econometric Evaluation of Labour Market Policies (M. Lechner and F. Pfeiffer, eds.). ZEW Economic Studies 13 43–58. Physica-Verlag, Heidelberg.
  • LeDell, E. (2016). h2oEnsemble: H2O Ensemble Learning. R package version 0.1.8.
  • Lee, B. K., Lessler, J. and Stuart, E. A. (2010). Improving propensity score weighting using machine learning. Stat. Med. 29 337–346.
  • Little, R. J. (1988). Missing-data adjustments in large surveys. J. Bus. Econom. Statist. 6 287–296.
  • Middleton, J., Scott, M., Diakow, R. and Hill, J. (2016). Bias amplification and bias unmasking. Polit. Anal. 24 307–323.
  • Niswander, K. R. and Gordon, M. (1972). The Collaborative Perinatal Study of the National Institute of Neurological Diseases and Stroke: The Women and Their Pregnancies. W.B. Saunders, Philadelphia, PA.
  • Paulhamus, B., Ebaugh, A., Boylls, C., Bos, N., Hider, S. and Giguere, S. (2012). Crowdsourced cyber defense: Lessons from a large-scale, game-based approach to threat identification on a live network. In International Conference on Social Computing, Behavioral–Cultural Modeling, and Prediction 35–42. Springer, Berlin.
  • Pearl, J. (2009a). Causality: Models, Reasoning, and Inference, 2nd ed. Cambridge Univ. Press, Cambridge.
  • Pearl, J. (2009b). Causal inference in statistics: An overview. Stat. Surv. 3 96–146.
  • Pearl, J. (2010). On a class of bias-amplifying variables that endanger effect estimates. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence 425–432. Accessed 02/02/2016.
  • Polley, E., LeDell, E., Kennedy, C. and van der Laan, M. (2016). SuperLearner: Super Learner Prediction. R package version 2.0-21.
  • Ranard, B. L., Ha, Y. P., Meisel, Z. F., Asch, D. A., Hill, S. S., Becker, L. B., Seymour, A. K. and Merchant, R. M. (2014). Crowdsourcing—Harnessing the masses to advance health and medicine, a systematic review. J. Gen. Intern. Med. 29 187–203.
  • Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA.
  • Robins, J. M. (1999). Association, causation, and marginal structural models. Synthese 121 151–179.
  • Robins, J. M. and Rotnitzky, A. (2001). Comment on ‘Inference for semiparametric models: Some questions and an answer,’ by P. J. Bickel and J. Kwon. Statist. Sinica 11 920–936.
  • Rokach, L. (2009). Taxonomy for characterizing ensemble methods in classification tasks: A review and annotated bibliography. Comput. Statist. Data Anal. 53 4046–4072.
  • Rosenbaum, P. R. (1987). Model-based direct adjustment. J. Amer. Statist. Assoc. 82 387–394.
  • Rosenbaum, P. R. (2002). Observational Studies, 2nd ed. Springer, New York.
  • Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70 41–55.
  • Rosenbaum, P. R. and Rubin, D. B. (1984). Reducing bias in observational studies using subclassification on the propensity score. J. Amer. Statist. Assoc. 79 516–524.
  • Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychol. Bull. 86 638–641.
  • Rubin, D. B. (1978). Bayesian inference for causal effects: The role of randomization. Ann. Statist. 6 34–58.
  • Rubin, D. B. (2006). Matched Sampling for Causal Effects. Cambridge Univ. Press, Cambridge.
  • Scharfstein, D. O., Rotnitzky, A. and Robins, J. M. (1999). Adjusting for nonignorable drop-out using semiparametric nonresponse models. J. Amer. Statist. Assoc. 94 1096–1146.
  • Sekhon, J. S. (2007). Multivariate and propensity score matching software with automated balance optimization: The matching package for R. J. Stat. Softw..
  • Shadish, W. R., Clark, M. H. and Steiner, P. M. (2008). Can nonrandomized experiments yield accurate answers? A randomized experiment comparing random and nonrandom assignments. J. Amer. Statist. Assoc. 103 1334–1343.
  • Steiner, P. and Kim, Y. (2016). The mechanics of omitted variable bias: Bias amplification and cancellation of offsetting biases. J. Causal Inference 4.
  • Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statist. Sci. 25 1–21.
  • Taddy, M., Gardner, M., Chen, L. and Draper, D. (2016). A nonparametric Bayesian analysis of heterogenous treatment effects in digital experimentation. J. Bus. Econom. Statist. 34 661–672.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Vanschoren, J., Van Rijn, J. N., Bischl, B. and Torgo, L. (2014). OpenML: Networked science in machine learning. ACM SIGKDD Explor. Newsl. 15 49–60.
  • van der Laan, M. J. and Robins, J. M. (2003). Unified Methods for Censored Longitudinal Data and Causality. Springer, New York.
  • van der Laan, M. J. and Rubin, D. (2006). Targeted maximum likelihood learning. Int. J. Biostat. 2 Art. 11, 40.
  • Wager, S. and Athey, S. (2015). Estimation and inference of heterogeneous treatment effects using random forests. Preprint. Available at arXiv:1510.04342.
  • Westreich, D., Lessler, J. and Funk, M. J. (2010). Propensity score estimation: Neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression. J. Clin. Epidemiol. 63 826–833.
  • Wind, D. K. and Winther, O. (2014). Model selection in data analysis competitions. In MetaSel@ ECAI 55–60.
  • Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 301–320.

See also

  • Comment: Spherical Cows in a Vacuum: Data Analysis Competitions for Causal Inference.
  • Comment: Will Competition-Winning Methods for Causal Inference Also Succeed in Practice?.
  • Comment: Strengthening Empirical Evaluation of Causal Inference Methods.
  • Comment on "Automated Versus Do-It-Yourself Methods for Causal Inference: Lessons Learned from a Data Analysis Competition".
  • Comment: Causal Inference Competitions: Where Should We Aim?.
  • Comment: Contributions of Model Features to BART Causal Inference Performance Using ACIC 2016 Competition Data.
  • Rejoinder: Response to Discussions and a Look Ahead.

Supplemental materials

  • Supplement to “Automated versus Do-It-Yourself Methods for Causal Inference: Lessons Learned from a Data Analysis Competition”. The online supplement contains the full set of parameters used to generate the simulations, the metrics used to analyze simulations for difficulty, and the names and institutions of those who submitted. They have our deepest gratitude.