Statistical Science

Choice as an Alternative to Control in Observational Studies

Paul R. Rosenbaum

Full-text: Open access


In a randomized experiment, the investigator creates a clear and relatively unambiguous comparison of treatment groups by exerting tight control over the assignment of treatments to experimental subjects, ensuring that comparable subjects receive alternative treatments. In an observational study, the investigator lacks control of treatment assignments and must seek a clear comparison in other ways. Care in the choice of circumstances in which the study is conducted can greatly influence the quality of the evidence about treatment effects. This is illustrated in detail using three observational studies that use choice effectively, one each from economics, clinical psychology and epidemiology. Other studies are discussed more briefly to illustrate specific points. The design choices include (i) the choice of research hypothesis, (ii) the choice of treated and control groups, (iii) the explicit use of competing theories, rather than merely null and alternative hypotheses, (iv) the use of internal replication in the form of multiple manipulations of a single dose of treatment, (v) the use of undelivered doses in control groups, (vi) design choices to minimize the need for stability analyses, (vii) the duration of treatment and (viii) the use of natural blocks.

Article information

Statist. Sci., Volume 14, Number 3 (1999), 259-304.

First available in Project Euclid: 24 December 2001

Permanent link to this document

Digital Object Identifier

Zentralblatt MATH identifier

Causal effects control groups internal replication observational studies sensitivity analysis stability analysis treatment effects undelivered doses


Rosenbaum, Paul R. Choice as an Alternative to Control in Observational Studies. Statist. Sci. 14 (1999), no. 3, 259--304. doi:10.1214/ss/1009212410.

Export citation


  • Altonji, J. G. and Dunn, T. A. (1996). Using siblings to estimate the effect of school quality on wages. Review of Economics and Statistics 77 665-671.
  • Angrist, J. D. (1997). Using Maimonides' rule to estimate the effect of class size on scholastic achievement. Working Paper 5888, National Bureau of Economic Research, Cambridge, MA.
  • Angrist, J. D., Imbens, G. and Rubin, D. B. (1996). Identification of causal effects using instrumental variables (with discussion). J. Amer. Statist. Assoc. 91 444-469.
  • Angrist, J. D. and Krueger, A. B. (1998). Empirical strategies in labor economics. Handbook of Labor Economics. (Working paper 401, Industrial Relations Section, Princeton Univ.) To appear.
  • Ashenfelter, O. A. and Krueger, A. B. (1994). Estimates of the economic return to schooling from a new sample of twins. American Economic Review 84 1157-1173.
  • Behrman, J., Rosenzweig, M. and Taubman, P. (1996). College choice and wages: estimates using data on female twins. The Review of Economics and Statistics 78 672-685.
  • Boruch, R. (1997). Randomized Experiments for Planning and Evaluation. Sage, Thousand Oaks, CA.
  • Box, G. E. P. (1966). The use and abuse of regression. Technometrics 8 625-629.
  • Bronars, S. G. and Grogger, J. (1994). The economic consequences of unwed motherhood: using twin births as a natural experiment. American Economic Review 84 1141-1156.
  • Campbell, D. T. (1988). Can we be scientific in applied social science? In Methodology and Epistemology for Social Science: Selected Papers [Originally published in Evaluation Studies Review Annual 9 (1984) 26-48.] 315-333. Univ. Chicago Press.
  • Campbell, D. and Stanley, R. (1963). Experimental and QuasiExperimental Designs for Research. Rand McNally, Chicago.
  • Card, D. (1990). The impact of the Mariel boatlift on the Miami labor market. Industrial and Labor Relations Review 43 245-257.
  • Card, D. and Krueger, A. (1994). Minimum wages and employment: a case study of the fast-food industry in New Jersey and Pennsylvania. American Economic Review 84 772-793.
  • Card, D. and Krueger, A. (1995). Myth and Measurement: The New Economics of the Minimum Wage. Princeton Univ. Press.
  • Copas, J. B. and Li, H. G. (1997). Inference for non-random samples (with discussion). J. Roy. Statist. Soc. Ser. B 59 55-96.
  • Cox, D. R. (1958). The Planning of Experiments. Wiley, New York. Cornfield, J., Haenszel, W., Hammond, E., Lilienfeld, A.,
  • Shimkin, M. and Wynder, E. (1959). Smoking and lung cancer: recent evidence and a discussion of some questions. Journal of the National Cancer Institute 22 173-203.
  • Dawes, R. (1996). The purpose of experiments: ecological validity versus comparing hypotheses. Behavioral and Brain Sciences 19 20.
  • Dawid, A. P. (1979). Conditional independence in statistical theory (with discussion). J. Roy. Statist. Soc. Ser. B 41 1-31.
  • Deere, D., Murphy, K. and Welch, F. (1995). Employment and the 1990-1991 minimum-wage hike. American Economic Review 85 232-237.
  • Diggle, P. J., Liang, K. Y. and Zeger, S. L. (1994). Analysis of Longitudinal Data. Oxford Univ. Press.
  • Duff, C. (1996). New minimum wage makes few waves: employers offset 50-cent raise with minor shifts. Wall Street Journal 20 November 1996 2-4.
  • Engle, R., Hendry, D. and Richard, J. (1983). Exogeneity. Econometrica 51 277-304.
  • Feyerabend, P. (1968). How to be a good empiricist-a plea for tolerance in matters epistemological. In The Philosophy of Science (P. H. Nidditch, ed.) 12-39. Oxford Univ. Press.
  • Feyerabend, P. (1975). Against Method. Verso, London.
  • Fisher, R. A. (1935). The Design of Experiments. Oliver and Boyd, Edinburgh.
  • Freedman, D. (1997). From association to causation via regression. Adv. in Appl. Math. 18 59-110.
  • Friedman, M. (1953). The methodology of positive economics. In Essays in Positive Economics 3-43. Univ. Chicago Press.
  • Gastwirth, J. L. (1992). Methods for assessing the sensitivity of statistical comparisons used in Title VII cases to omitted variables. Jurimetrics 33 19-34.
  • Gastwirth, J. L., Krieger, A. M. and Rosenbaum, P. R. (1998). Cornfield's inequality. In Encyclopedia of Biostatistics 952- 955. Wiley, New York.
  • Greenhouse, S. (1982). Jerome Cornfield's contributions to epidemiology. Biometrics (Suppl.) 38 33-45.
  • Hedges, L. and Olkin, I. (1985). Statistical Methods for MetaAnalysis. Academic Press, New York.
  • Holland, P. (1993). Which comes first, cause or effect? In A Handbook for Data Analysis in the Behavioral Sciences: Methodological Issues (G. Keren and C. Lewis, eds.) 273-282. Erlbaum, Hillsdale, NJ.
  • Holland, P. and Rubin, D. (1983). On Lord's paradox. In Principles of Psychological Measurement: A Festschrift for Frederic Lord (H. Wainer and S. Messick, eds.) 3-25. Erlbaum, Hillsdale, NJ. Joffe, M., Hoover, D., Jacobson, L., Kingsley, L., Chmiel, J.,
  • Visscher, B. and Robins, J. (1997). Estimating the effect of zidovudine on Kaposi's sarcoma from observational data using a rank preserving structural failure-time model. Statistics in Medicine. To appear.
  • Kempthorne, O. (1952). Design and Analysis of Experiments. Wiley, New York. [Reprinted (1973)] by Krieger, (Malabar, FL).
  • Lakatos, I. (1970). Falsification and the methodology of scientific research programs. In Criticism and the Growth of Knowledge (I. Lakatos and A. Musgrave, eds.) 91-196. Cambridge Univ. Press. [Reprinted in I. Lakatos, Philosophical Papers 1 8-101. Cambridge Univ. Press 1978.]
  • Lakatos, I. (1981). History of science and its rational reconstructions. In Scientific Revolutions (I. Hacking, ed.) 107-127. Oxford Univ. Press. [Reprinted from Boston Studies in the Phi
  • losophy of Science 8 (1970).
  • Lehman, D., Wortman, C. and Williams, A. (1987). Long-term effects of losing a spouse or a child in a motor vehicle crash. Journal of Personality and Social Psychology 52 218-231.
  • Lehmann, E. L. (1975). Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, San Francisco. Lichtenstein, P., Gatz, M., Pedersen, N., Berg, S. and Mc
  • Clearn, G. (1996). A co-twin-control study of response to widowhood. Journal of Gerontology: Psychological Sciences 51B 279-289.
  • Manski, C. (1990). Nonparametric bounds on treatment effects. American Economic Review 80 319-323.
  • Manski, C. (1995). Identification Problems in the Social Sciences. Harvard Univ. Press.
  • Marcus, S. (1997). Using omitted variable bias to assess uncertainty in the estimation of an AIDS education treatment effect. Journal of Educational and Behavioral Statistics 22 193-202.
  • Meehl, P. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology 46 806-834. [Reprinted in P. Meehl, Selected Philosophical and Method
  • ological Papers 1-42. Univ. Minnesota Press (1991).]
  • Meyer, B. D. (1995). Natural and quasi-experiments in economics. J. Bus. Econom. Statist. 13 151-161.
  • Meyer, M. and Fienberg, S. eds. (1992). Assessing Evaluation Studies: The Case of Bilingual Education Strategies. National Academy Press, Washington, DC. Morton, D., Saah, A., Silberg, S., Owens, W., Roberts, M. and
  • Saah, M. (1982). Lead absorption in children of employees in a lead-related industry. American Journal of Epidemiology 115 549-555.
  • Neumark, D. and Wascher, W. (1992). Employment effects of minimum and subminimum wages: lanel data on state minimum wage laws. Industrial and Labor Relations Review 46 55-81. [See also (1993) 47, 487-512 for discussion by Card, Katz and Krueger and a reply by Neumark and Wascher.]
  • Neyman, J. (1923). On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Roczniki Nauk Roiniczych 10 1-51. (In Polish.) [Reprinted in English (1990) Statist. Sci. 5 463-480, with discussion by T. Speed and D. Rubin.]
  • Peirce, C. S. (1903). On selecting hypotheses. [Reprinted (1960) Collected Papers of Charles Sanders Peirce (C. Hartshorne and P. Weiss, eds.) 5 413-422. Harvard Univ. Press.] Peto, R., Pike, M., Armitage, P., Breslow, N., Cox, D., Howard, S., Mantel, N., McPherson, K., Peto, J. and Smith, P.
  • (1976). Design and analysis of randomized clinical trials requiring prolonged observation of each patient. I. Introduction and design. British Journal of Cancer 34 585-612.
  • Platt, J. (1964). Strong inference. Science 146 347-352.
  • Polanyi, M. (1964). Science, Faith and Society. Univ. Chicago Press. [Originally published (1946) by Oxford Univ. Press.]
  • Popper, K. (1968). The Logic of Scientific Discovery. Harper and Row, New York. (English translation of Popper's 1934 Logik der Forschung.)
  • Popper, K. (1965). Conjectures and Refutations. Harper and Row, New York.
  • Putnam, H. (1995). Pragmatism. Blackwell, Oxford, UK.
  • Quine, W. (1951). Two dogmas of empiricism. Philosophical Review. [Reprinted in W. Quine, From a Logical Point of View
  • 20-46. Harvard Univ. Press (1980).]
  • Robins, J. (1989). The control of confounding by intermediate variables. Statistics in Medicine 8 679-701.
  • Robins, J. (1992). Estimation of the time-dependent accelerated failure time model in the presence of confounding factors. Biometrika 79 321-334.
  • Robins, J., Rotnitzky, A. and Zhao, L. (1995). Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J. Amer. Statist. Assoc. 90 106-121.
  • Robins, J. (1998). Correction for non-compliance in equivalence trials. Statistics in Medicine 17 269-302.
  • Rosenbaum, P. R. (1984). The consequences of adjustment for a concomitant variable that has been affected by the treatment. J. Roy. Statist. Soc. Ser. A 147 656-666.
  • Rosenbaum, P. R. (1986). Dropping out of high school in the United States: an observational study. Journal of Educational Statistics 11 207-224.
  • Rosenbaum, P. R. (1987). Sensitivity analysis for certain permutation inferences in matched observational studies. Biometrika 74 13-26.
  • Rosenbaum, P. R. (1991). Some poset statistics. Ann. Statist. 19 1091-1097.
  • Rosenbaum, P. R. (1993). Hodges-Lehmann point estimates of treatment effect in observational studies. J. Amer. Statist. Assoc. 88 1250-1253. Rosenbaum, P. R. (1995a). Observational Studies. Springer, New York. Rosenbaum, P. R. (1997a). Discussion of a paper by Copas and Li. J. Roy. Statist. Soc. Ser. B 59 90. Rosenbaum, P. R. (1997b). Signed rank statistics for coherent predictions. Biometrics 53 556-566. Rosenbaum, P. R. (1999a). Using combined quantile averages in matched observational studies. J. Roy. Statist. Soc. Ser. C 48 63-78. Rosenbaum, P. R. (1999b). Reduced sensitivity to hidden bias at upper quantiles in observational studies with dilated effects. Biometrics, 55 560-564. Rosenbaum, P. R. and Rubin, D. B. (1983a). The central role of the propensity score in observational studies for causal effects. Biometrika 70 41-55. Rosenbaum, P. and Rubin, D. (1983b). Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome. J. Roy. Statist. Soc. Ser. B 45 212-218.
  • Rosenzweig, M. and Wolpin, K. (1980). Testing the quantity- quality fertility model: the use of twins as a natural experiment. Econometrica 48 227-240.
  • Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66 688-701.
  • Rubin, D. B. (1977). Randomization on the basis of a covariate. Journal of Educational Statistics 2 1-26.
  • Rubin, D. B. (1978). Bayesian inference for causal effects: the role of randomization. Ann. Statist. 6 34-48.
  • Rubin, D. B. (1991). Practical implications of modes of statistical inference for causal inference and the critical role of the assignment mechanism. Biometrics 47 1213-1234.
  • Samuelson, P. (1947). Foundations of Economic Analysis. Harvard Univ. Press. [Reprinted (1983) by Harvard Univ. Press.]
  • Shafer, G. (1996). The Art of Causal Conjecture. MIT Press.
  • Smith, H. L. (1997). Matching with multiple controls to estimate treatment effects in observational studies. Sociological Methodology 27 325-353.
  • Sobel, M. (1995). Causal inference in the social and behavioral sciences. In Handbook of Statistical Modelling for the Social and Behavioral Sciences (G. Arminger, C. Clogg and M. Sobel, eds.) 1-38. Plenum, New York.
  • Sowell, T. (1995). Repealing the law of gravity. Forbes 22 May 1995, 82.
  • Susser, M. (1987). Falsification, verification and causal inference in epidemiology: reconsideration in the light of Sir Karl Popper's philosophy. In Epidemiology, Health and Society: Selected Papers (M. Susser, ed.) 82-93. New York: Oxford.
  • analysis of treatment choice (Manski, 1998, 1999). I find it useful to suppose that a planner must choose a treatment rule which assigns a treatment to each member of a heterogeneous population of interest. The planner might, for example, be a physician choosing medical treatments for each member of a population of patients or a judge deciding sentences for each member of a population of convicted offenders. The planner observes certain covariates for each person: demographic attributes, medical or criminal records and so on. Each member of the population has a response function which maps treatments into a realvalued outcome of interest, perhaps a measure of health status or recidivism in the examples above. I suppose that the planner wants to choose a treatment rule that maximizes the population mean outcome. (In economic terms, the planner wants to maximize a utilitarian social welfare function.) An optimal treatment rule assigns to each member of the population a treatment that maximizes mean outcome conditional on the person's
  • observed covariates (Manski, 1998, 1999). This motivates interest in the empirical study of treatment effects. We want to infer treatment effects to help the planner select a good treatment rule.
  • from X0 to its complement (Manski, 1996). Now consider an observational study performed on the entire population. It is well known that an observational study can be used to identify treatment effects if sufficiently strong prior information is available. Unfortunately, the prior information needed to achieve identification is so strong that it is rarely credible in practice (Manski, 1995, Chapter 2). This being the case, my research program has sought to determine what may be learned about treatment response when observational studies are combined with weak prior information that may be deemed credible in practice. The findings generally take the form of bounds on the conditional mean outcomes E rt x t T x X The starting point is the worst-case analysis of Manski (1990); see Manski, Sandefur, McLanahan and Powers (1992) for an empirical application. This shows that if the outcome variable r is itself bounded, then an observational study reveals informative bounds on E rt x t T for x X even in the absence of prior information. However, the bounds for different treatments necessarily intersect, implying that prior information is necessary if the planner is to rank treatments. Consequently, in Manski (1990, 1994, 1995, 1997a) and Manski and Pepper (2000), I have investigated the identifying power of various nonparametric restrictions on the distribution of treatment response or on the process of treatment selection in an observational study. Such prior restrictions enable the planner to tighten the worst-case bounds on E rt x t T for x X. Some forms of prior information may yield nonintersecting bounds on mean outcomes under different treatments. When this happens, the planner can use an observational study to rank treatments. In particular, the instrumental variable assumptions used by economists in observational studies for over 50 years have this important property. See Manski (1990, 1994), Manski and Pepper (2000) and related work on experiments with imperfect compliance by Robins (1989b), Robins and Greenland
  • (1996) and Balke and Pearl (1997). See Manski and Nagin (1998) for an empirical application.
  • study of short-acting exposures (MacClure, 1991). Such designs have been rather successful in confirming that unusually vigorous physical exertion and sexual intercourse are triggers for myocardial infarction, by demonstrating a higher than normal incidence of the hypothesized triggering activity in the hour or two before onset of chest pain. The bouts of "treatment" are not assigned at random, and so confounding factors, such as time of day, day of the week, recency of a major meal, and so on, need to be controlled for in the analysis. In the minimum wage study of Rosenbaum's Section 2.2, I would have some concern that betweenstate differences in employment could fluctuate
  • Pearl and Robins (1999). We first provide an informal discussion. Formal justification will be given in the Appendix. Informally, a causal graph for the measured variables in a study is a directed acyclic graph (DAG) in which the vertices (nodes) of the graph represent variables measured at specific times, the directed edges (arrows) represent direct causal relations and there are no directed cycles, because no variable can cause itself. The variables represented on the graph include the measured variables and additional unmeasured variables, such that if any two variables on the graph have a cause in common, that common cause is itself included as a variable on the graph. For example, in Figure 1, E and D are the measured variables; U represents all unmeasured common causes of E and D. We have made the arrow from E to D dotted to represent the fact that the purpose of data collection is to determine whether E causes D (i.e., whether the arrow from E to D is actually present). Suppose our goal is to use the assumptions encoded in our causal graph to determine whether the association ORED between E and D represents the causal effect of E on D as measured on an odds ratio scale. To do so, we proceed as follows. We begin by pretending that we know that the null hypothesis of no causal effect of E on D is true by removing the arrow from E to D. If, under this null hypoth
  • due to Pearl (1988). Suppose E encodes whether a sprinkler is on, D encodes whether it is raining and C is the indicator of whether the grass is wet. Then if it rains at random times of the day and the sprinkler is set to go on at times that do not depend on whether it is raining, clearly E and D will be independent, even though they both cause the grass to be wet C. If we condition on the fact that the grass is wet C = 1, and I tell you that it is not raining E = 0, then you will know for certain that the sprinkler is on D = 1. But if I tell you that it is raining, the probability that the sprinkler is on will not be increased above its marginal probability. An extension of this last example provides an explanation of the well-known adage that one must not adjust for variables affected by treatment. To see why, consider the graph in Figure 5, in which the exposure E has a direct causal effect on C, and C and D have an unmeasured common cause U. Under the causal null with the arrow from E to D removed, E and D will be unassociated because they do not have an unmeasured common cause. Thus, the marginal association ORED will represent causation. However, the conditional associations ORED C=1 and ORED C=0 will be biased for the conditional causal effect within levels of C. This reflects the fact that, under the causal null, E and U will be associated once we condition on their common effect C. Thus because U itself is correlated
  • E, because conditioning on all common causes of causally unconnected variables renders them independent. But one can check from Table 1 that, among subjects with E = 1, D and E are correlated. Now Figure 6 is isomorphic to Figure 4 with E playing the role of C. Thus, as in Figure 4, we conclude that the marginal association ORED is causal but the conditional association ORED E will differ from the conditional causal effect of exposure and disease within strata of E. Mistakenly interpreting ORED E = 3 as causal could in principle lead to poor public health decisions, as would occur if a cost-benefit analysis determines that a condiFig. 6. tional causal odds ratio of 2 9 is the cutoff point above which the risks of congenital malformation outweigh the benefits to the mother of treatment with E. Finally, a possibility that we have not considered is that those mothers who develop, say, a subclinical infection in the first trimester are both at increased risk of a second trimester congenital malformation and of worsening arthritis, which they may then treat with the drug E. In that case, we would need to add to our causal graph an unmeasured common cause U of both E and D that represents subclinical first-trimester infection, in which case ORED would be confounded.
  • E, which would imply that ORDE E has a causal interpretation. In contrast, the conditional association ORE D E = 0 6 represents not a protective
  • Pearl, 1997). Because E was randomly assigned, it has no arrows into it. However, given assignment, both the decision to comply and the outcome D may well depend on underlying health status U; E has no direct arrow to D, because, by assumption, E causally influences D only through its effect on E. We observe that, under the causal null in which the arrow from E to D is removed, E and D will be associated due to their common cause U both marginally and within levels of E. Hence, neither ORED nor ORED E will have a causal interpretation. However, under the causal null, E and D will be independent, because they have no unmeasured common cause. Hence we can test for the absence of an arrow between E and D (i.e., lack of causality) by testing whether E and D are ind
  • tients (Robins, 1997). AZT inhibits the AIDS virus. Aerosolized pentamidine prevents pneumocystis pneumonia (PCP), a common opportunistic infection of AIDS patients. The trial was conducted as follows. Each of 32,000 subjects was randomized with probability 0 5 to AZT A0 = 1 or placebo A0 = 0 at time t0. All subjects survived to time t1. At time t1, it was determined whether a subject had had an episode of PCP L1 = 1 or had been free of PCP L1 = 0 in the interval t0 t1. Because PCP is a potential life-threatening illness, all subjects with L1 = 1 were treated with aerosolized pentamidine (AP) therapy A1 = 1 at time t1. Among subjects who were free of PCP L1 = 0, one-half were randomized to receive AP at t1 and one-half were randomized to placebo A1 = 0. At time t2, the vital status was recorded for each subject with Y = 1 if alive and Y = 0 if deceased. We view A0 L1 A1 Y as random variables with realizations a0 l1 a1 y. All investigators agreed that the data supported a beneficial effect of treatment with AP (A1 = 1) because among the 8,000 subjects with A0 = 1 and L1 = 0, AP was assigned at random and the survival rates were greater among those given AP: P Y = 1 A1 = 1 L1 = 0 A0 = 1 P Y = 1 A1 = 0 L1 = 0 A0 = 1 = 3/4 1/4 = 1/2 (3.1)
  • gorithm of Robins (1986). We show that there is no direct causal effect of AZT controlling for AP. That is, given that all subjects take AP, whether or not AZT is also taken is immaterial to the survival rate in the study population. Suppose, however, the data from our trial were as in Figure 11. We shall show in the next subsection that when the data in Figure 11 are appropriately analyzed using the G-computation algorithm, the analysis reveals a direct AZT effect.
  • algorithm formula or functional (Robins, 1986). Equation 3 9 states that the marginal density of Y a is obtained from the joint distribution of the observables by taking a weighted average of the f y K aK with weights proportional to
  • orem proved in Robins (1986).
  • V, X and Y are conditionally independent given Z (i.e., X Y Z) if X and Y are d-separated by Z on G (written X Y Z G). D-separation is a purely graphical criterion described by Pearl (1995) as follows. Definition (d-Separation). Let X, Y and Z be three disjoint subsets of nodes in a directed acyclic graph G, and let p be any path between a node in X and a node in Y, where by "path" we mean any succession of arcs, regardless of their directions. Then Z is said to block p if there is a node w on p satisfying one of the following two conditions: (i) w has converging arrows along p, and neither w nor any of its descendants is in Z; (ii) w does not have converging arrows along p, and w is in Z. Further, Z is said to d-separate X from Y, in G, written X Y Z G, if and only if Z blocks every path from a node in X to a node in Y. Definition of causal graphs. For any subset of variables X V, let Vm x be the random variable encoding the value of the variable Vm had, possibly contrary to fact, X been set to x. Note here we have assumed that the variables X are manipulable and that the counterfactuals Vm x are well defined.
  • Definitions (Robins, 1986, pages 1419-1423). We say the following: (a) The complete DAG G is a finest causal graph if (i) Vi and all one-step-ahead counterfactuals Vm vm-1 exist for m > 1 and (ii) the observed variables V and, for any subset X contained in V, the counterfactual variables Vm x are obtained by recursive substitution, for example, V3 V3 V1 V2 V1, V3 v1 = V3 v1 V2 v1, V3 v2 = V3 V1 v2 ; (b) DAG G is a finest causal graph if G is a finest causal graph and Vm vm-1 = Vm pam depends on vm-1 only through Vm's parents on G; (c) a finest causal graph is a finest fully randomized causal graph if, for all m, Vm+1 Vm-1 vm VM Vm-1 vm vM-1 Vm Vm-1 Definition. We simply say G is a causal graph if it is a finest fully randomized causal graph. Pearl (1995) originally gave an alternative, but equivalent definition of a causal graph as a nonparametric structural equations model. It is easy to show that if G is a causal graph over variables V, then the density fV v factorizes as in (3.1). Furthermore if we let V = AK LK+1 with AK LK+1 as defined in Section 3.1.2, so that the variables in Ak and Lk+1 are nondescendants of Ak+1 then G being a causal graph implies that the assumption (3.7) of no unmeasured confounders holds.
  • Theorem A.1 (Pearl and Robins, 1995). If A 3 Y Ak Ok Ak-1 Gak k K then 3 7 implies (A.2). Here A B C Gak stands for d-separation of A
  • and B given C in Gak (Pearl, 1995). Note that checking d-separation is a purely graphical (i.e., visual) procedure. Pearl (1995) had earlier proved Theorem A.1 for non-sequential treatments (i.e., in the case K = 0) and named it the no-back-door path criterion.
  • 1997). Equation (A.3) is true if and only if, for k K Uk = Uak Ubk for possibly empty mutually exclusive sets Uak Ubk satisfying (i) Ubk Ak Ok Ak-1 Ga·k and (ii) Uak Y Ok Ak Ubk Ga·k.
  • bell, 1999). So this commentary is less a rejoinder to Rosenbaum and more a sympathetic extension of his thinking, and a request to make even more systematic the theoretical foundations of quasi-experimentation. Our key theme is reflected in our title's double entendre: in the inevitable give-and-take between statistical analysis and experimental design, it is design that should rule if quality causal inferences are to result; so we need theories of quasi-experimentation that describe which rules for designing quasi-experiments provide better justification for such inferences. Researchers in economics and statistics are rediscovering this message, but for different reasons. For economists, the rediscovery stems from the failure of elegant statistical selection bias models to deliver predictably accurate effect estimates, despite diverse efforts to do this over several decades. (To quasi-experimentalists, some of these efforts seemed doomed to failure from the start because they involved contrasting a local program group of substantive interest with a purportedly matched control group abstracted from some national record-keeping system.) In any event, many economists eventually realized that these models might work better when supported by stronger design; so Heckman and Todd (1996, page 60) noted their selection-adjustment methods may "perform best when (1) comparison group members come from the same local labor markets as participants, (2) they answer the same survey questionnaires, and (3) when data on key determinants of program participation is (sic) available." For statisticians, the need to rediscover quasiexperimental design stems from the lack of strong statistical theory for nonrandomized experiments that parallels the elegant theoretical grounding of randomization in the work of R. A. Fisher and his successors. Lacking such a hammer, statisticians swung at the quasi-experimental nail less often than at more congenial targets that suited their tool box better. One of many merits to Rubin's causal model (RCM) is that it provides such a ha
  • Cook and Campbell (1979). How do various design features diagnose or reduce the doubts (i.e., rule out threats to validity)? Table 1 provides some examples of how this question might be addressed, and we can add others. For example, in a controlled study, the decision to compare treatment to a placebo rather than to a notreatment control group is primarily done to clarify the construct validity of the treatment-to what extent should the intended treatment be characterized as the causal agent as opposed to receiving a placebo? This differential choice between placebo or no-treatment control is not done to assess whether anything causal happened in a study, which is the domain of internal validity. Similarly, when reason exists to think that pretreatment trends in an outcome may mimic treatment effects in the treatment group (e.g., participants were getting better by themselves anyway), adding pretest observations on the same measurement instrument at multiple prior time points helps diagnose this possibility. The use of multiple design features in one study can help to create a complex pattern of evidence that functions in a manner similar to Rosenbaum's discussion of creating a complex pattern of evidence by making design choices that stem from a broad theory. The tradition within which we work has focused attention on creating a theory of quasiexperimentation that would constructively deal with issues like those just broached. This theory is not yet fully developed, and it may not be possible to formulate any theory of quasi-experimentation in a fully axiomatized way with the precision that many statisticians prefer and that has been approached for experiments with random assignment. However, we suspect that increased interchange between the design tradition we represent and the emergent statistical tradition Rosenbaum shares will improve the judgments he calls for in Section 1.2 about assessing the quality of evidence.
  • 1986). Such judgments operate in randomized experiments where, before we can conclude that a treatment is effective, we have to judge such issues as whether any violations of normality were sufficient to question the validity of statistical tests,
  • Cook and Campbell, 1999). We should remember that the kind of generalization sampling promotes is not the only kind of generalization scientists do. For instance, we routinely generalize from the specific operations used in a study to the general terms we want to attach to the operations (construct validity). Such construct generalizations are sometimes given a formal statistical rationale, as with the domain
  • (1996). Scientific theories make broad claims about the past, present and future of the natural world, claims that can be refuted but not sampled. "Pure observation lends only negative evidence, by refuting an observation categorical that a proposed theory implies" (Quine, 1992, page 13).
  • (1996). Both studies contrasted the same two theories, one theory claiming that bereavement has only short-term effects on mental functioning and the other that it has long-term effects. The first study looked at sudden deaths of close relatives in car crashes and the second at the loss of a spouse by one of a pair of twins. Both studies are deliberately unrepresentative: car crashes are an atypical cause of bereavement, and twins are also atypical. If one wanted to generalize from a study population to a natural population, one would look for the typical, not the atypical, situation. But, for reasons discussed in the paper, car crashes and twins provide a sharper contrast of the two theories than typical situations permit.
  • and Campbell (1999). Once in a while, one sees the issues that arise in quasi-experimentation described in terms of dichotomies determined with certainty. Either a specific threat to validity is present or it is not. If a corresponding design feature is in place, then one is completely protected from this threat to validity, but if it is not, then the study is invalid. If a design is described in this way, then two questions go unasked, hence unanswered. First, does the design feature have the ability, given the sample size and variability, to address the specific threat to validity if it is present in a magnitude sufficient to affect the study? When the design features are the use of multiple control groups or unaffected outcomes (i.e., nonequivalent dependent variables), this question can be answered in conventional statistical terms such as the unbiasedness and monotonicity of the power function of tests for hidden bias, and the impact of these design features on confidence intervals that allow for hidden bias (Rosenbaum 1987b,
  • 1989 a, b, 1992). Second, if a hidden bias is present or is suspected, is it of a magnitude sufficient to alter the conclusions of the study? One may be unable to rule out hidden bias, but biases of plausible size may or may not be able to account for the ostensible effects of the treatment. For instance, smokers and nonsmokers are known and suspected to differ in many ways not controlled in observational studies, and yet biases of plausible size cannot account for the extremely strong association between heavy smoking and lung cancer; see Cornfield et al., (1959) for a sensitivity analysis. The answers to these two questions seem to me to be part of the answer to the question: "How are we to judge the quality of evidence in quasi-experimentation?"
  • Balke, A. and Pearl, J. (1997). Bounds on treatment effects from studies with imperfect compliance. J. Amer. Statist. Assoc. 92 1171-1177.
  • Campbell, D. T. (1957). Factors relevant to the validity of experiments in social settings. Psychological Bulletin 54 297-312.
  • Cook, T. D. (1993). A quasi-sampling theory of the generalization of causal relationships. In Understanding Causes and Generalizing About Them (L. Sechrest and A. G. Scott, eds.) 39-82. Jossey-Bass, San Francisco.
  • Cook, T. D. (1999). Towards a practical theory of external validity. In Validity and Social Experiments: Donald Campbell's Legacy 1 (L. Bickman, ed.) Sage, Thousand Oaks, CA. To appear.
  • Cook, T. D. and Campbell, D. T. (1979). Quasi-Experimentation: Design and Analysis Issues for Field Settings. Rand-McNally, Chicago.
  • Cordray, D. W. (1986). Quasi-experimental analysis: a mixture of methods and judgment. In Advances in QuasiExperimental Design and Analysis (W. M. K. Trochim, ed.) 9-27. Jossey-Bass, San Francisco. Cornfield, J., Haenszel, W., Hammond, E., Lilienfeld, A.,
  • Shimkin, M. and Wynder, E. (1959). Smoking and lung cancer: recent evidence and a discussion of some questions. Journal of the National Cancer Institute 22 173-203.
  • Corrin, W. J. and Cook, T. D. (1998). Design elements of quasiexperimentation. Advances in Educational Productivity 7 35-57.
  • Gail, M. (1972). Does cardiac transplantation prolong life? A reassessment. Annals of Internal Medicine 76 815-817.
  • Geiger, D., Verma, T. and Pearl, J. (1990). The logic of influence diagrams (with comments). Influence Diagrams, Belief Nets and Decision Analysis 67-87.
  • Greenland, S., Pearl, J. and Robins, J. M. (1999). Causal diagrams for epidemiologic research. Epidemiology 10 37-48.
  • Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica 50 1029-1054.
  • Hausman, J. A. (1978). Specification tests in econometrics. Econometrica 78 1251-1272.
  • Heckman, J. J. and Todd, P. E. (1996). Assessing the performance of alternative estimators of program impacts: a study of adult men and women in JTPA. Unpublished manuscript. (Available from the author, Dept. Economics, Univ. Chicago).
  • Hedges, L. V. (1997). The role of construct validity in causal generalization: the concept of total causal inference error. In Causality in Crisis? Statistical Methods and the Search for Causal Knowledge in the Social Sciences (V. R. McKim and S. P. Turner, eds.) 325-341. Univ. Notre Dame Press. Keiding, N., Filiberti, M., Esbjerg, S., Robins, J. M. and Ja
  • cobsen, N. (1999). The graft versus leukemia effect after bone marrow transplantation: a case study using structural nested failure time models. Biometrics 55 23-28.
  • Kish, L. (1987). Statistical Design for Research. Wiley, New York. Lavori, P. W., Louis, T. A., Bailar, J. C. and Polansky, H.
  • (1986). Designs for experiments: parallel comparisons of treatment. In Medical Uses of Statistics (J. C. Bailar and F. Mosteller, eds.). New England Journal of Medicine, Waltham, MA.
  • Light, R. J., Singer, J. D. and Willett, J. B. (1990). By Design: Planning Research in Higher Education. Harvard Univ. Press.
  • Maclure, M. (1991). The case-crossover design: a method for studying transient effects on risk of acute events. American Journal of Epidemiology 133 144-153.
  • Manski, C. (1994). The selection problem. In Advances in Econometrics, Sixth World Congress (C. Sims, ed.) 143-170. Cambridge Univ. Press
  • Manski, C. (1996). Learning about treatment effects from experiments with random assignment of treatments. Journal of Human Resources 31 707-733. Manski, C. (1997a). Monotone treatment response. Econometrica 65 1311-1334. Manski, C. (1997b). The mixing problem in programme evaluation. Rev. Econom. Stud. 64 537-553.
  • Manski, C. (1998). Treatment choice in heterogeneous populations using experiments without covariate data," In Uncertainty in Artificial Intelligence, Proceedings of the Fourteenth Conference (G. Cooper and S. Moral, eds.) 379-385. Morgan Kaufmann, San Francisco.
  • Manski, C. (1999). Identification problems and decisions under ambiguity: empirical analysis of treatment response and normative analysis of treatment choice. J. Econometrics. To appear.
  • Manski, C. and Nagin, D. (1998). Bounding disagreements about treatment effects: a case study of sentencing and recidivism," Sociological Methodology 28 99-137.
  • Manski, C. and Pepper, J. (2000). Monotone instrumental variables: with an application to the returns to schooling. Econometrica. To appear. Manski, C., Sandefur, G., McLanahan, S. and Powers, D.
  • (1992). Alternative estimates of the effect of family structure during adolescence on high school graduation. J. Amer. Statist. Assoc. 87 25-37. Mittleman, M. A., Maldonado, G., Gerberich, S. G., Smith,
  • G. S. and Sorock, G. S. (1997). Aternative approaches to analytical designs in occupational injury epidemiology. American Journal of Industrial Medicine 32 129-141.
  • Newey, W. K. (1985). Generalized method of moments specification testing. Journal of Econometrics 29 229-256.
  • Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Francisco.
  • Pearl, J. (1995). Causal diagrams for empirical research. Biometrika 82 669-688.
  • Pearl, J. and Robins, J. M. (1995). Probabilistic evaluation of sequential plans from causal models with hidden variables. In Uncertainty in Artificial Intelligence. Proceedings of the 11th Conference on Artificial Intelligence 444-453. Morgan Kaufmann, San Francisco.
  • Pearl, J. and Verma, T. (1991). A theory of inferred causation. In Principles of Knowledge Representation and Reasoning: Proceedings of the 2nd International Conference (J. A. Allen, R. Fikes and E. Sandewall, eds.) 441-452. Morgan Kaufmann, San Francisco.
  • Piantadosi, S. (1997). Clinical Trials: A Methodologic Perspective. Wiley, New York.
  • Popper, K. R. (1990). A World of Propensities. Thoemmes, Bristol, UK.
  • Quine, W. (1992). Pursuit of Truth. Harvard Univ. Press.
  • Robins, J. M. (1986). A new approach to causal inference in mortality studies with sustained exposure periods-application to control of the healthy worker survivor effect. Mathematical Modelling 7 1393-1512.
  • Robins, J. M. (1987). Addendum to "A new approach to causal inference in mortality studies with sustained exposure periods-application to control of the healthy worker survivor effect." Computers and Mathematics with Applications 14 923-945. Robins, J. (1989a). The analysis of randomized and nonrandomized AIDS treatment trials using a new approach to causal inference in longitudinal studies. In Health Service Research Methodology: A Focus on AIDS (L. Sechrest, H. Freeman, and A. Mulley, eds.). NCHSR, U.S. Public Health Service.
  • Robins, J. M. (1994). Confounding and DAGS. Technical report, Dept. Epidemiology, Harvard School of Public Health.
  • Robins, J. M. (1997). Causal inference from complex longitudinal data. In Latent Variable Modeling and Applications to Causality. Lecture Notes in Statist.. 120 69-117. Springer, New York.
  • Robins J. M. (1999). Marginal structural models versus structural nested models as tools for causal inference. In Statistical Models in Epidemiology (E. Halloran, ed.) Springer, New York. To appear.
  • Robins, J. and Greenland, S. (1996). Comment on Angrist, Imbens, and Rubin's "Identification of causal effects using instrumental variables." J. Amer. Statist. Assoc. 91 456-458. Rosenbaum, P. R. (1987a). The role of a second control group in an observational study (with discussion). Statist. Sci. 2 292-316. Rosenbaum, P. R. (1989a). On permutation tests for hidden biases in observational studies. Ann. Statist. 17 643-653. Rosenbaum, P. R. (1989b). The role of known effects in observational studies. Biometrics 45 557-569.
  • Rosenbaum, P. R. (1992). Detecting bias with confidence in observational studies. Biometrika 79 367-374. Rosenbaum, P. R. (1995b). Quantiles in nonrandom samples and observational studies. J. Amer. Statist. Assoc. 90 1424-1431.
  • Sackett, D. L. (1979). Bias in analytic research. Journal of Chronic Diseases 32 51-63.
  • Salzberg, A. (1999). Removable selection bias in quasiexperiments. Amer. Statist. 53 103-107.
  • Shadish, W. R. (1999). The empirical program of quasiexperimentation. In Validity and Social Experimentation: Donald Campbell's Legacy (L. Bickman, ed.). Sage, Thousand Oaks, CA.
  • Shadish, W. R., Cook, T. D. and Campbell, D. T. (1999). Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Houghton-Mifflin, Boston. To appear.
  • Shadish, W. R., Cook, T. D. and Leviton, L. C. (1995). Foundations of Program Evaluation. Sage, Thousand Oaks, CA.
  • Spirtes, P., Glymour, C. and Scheines, R. (1993). Causation, Prediction, and Search. Springer, New York.
  • Stanton, B. (1997). Editorial: good news for everyone? American Journal of Public Health 87 1917-1919.
  • Triplett, N. (1898). The dynamogenic factors in pacemaking and competition. American Journal of Psychology 9 507-533.
  • Toedter, L. J., Lasker, J. N. and Campbell, D. T. (1990). The comparison group problem in bereavement studies and the retrospective pretest. Evaluation Review 14 75-90.
  • Wittgenstein, L. (1972). On Certainty. Harper and Row, New York.
  • Yu, E., Xie, Q., Zhang, K., Lu, P. and Chan, L. (1996). HIV infection and AIDS in China. American Journal of Public Health 68 1116-1122.