class: center, middle, inverse, title-slide # Introduction to causal inference ### Dr. Olanrewaju Michael Akande ### Oct 31, 2019 --- ## Announcements - Reminder: project presentations tomorrow at 10:00am. - Everyone should be in class by 9:55 at most. ## Outline - Association vs. causation; confounding - Potential outcomes framework - Causal estimands - Assignment mechanisms - Randomization vs observational studies --- class: center, middle # Association vs. causation; confounding --- ## Causality --- <i class="fa fa-quote-left fa-2x fa-pull-left fa-border" aria-hidden="true"></i> <i class="fa fa-quote-right fa-2x fa-pull-right fa-border" aria-hidden="true"></i> We do not have knowledge of a thing until we have grasped its why, that is to say, <font color="red">its cause</font>. -- Aristotle, Physics -- </br> - Over the next few classes, we will discuss causal inference, specifically, we will focus on measuring the .hlight[effects of causes]. -- - For today's class, we will simply lay the foundations for causal inference. We will get more into the actual methods in the next class. --- ## Association vs. causation - In the models we have covered so far, our focus has been on inferring .hlight[associations] using samples drawn from our population of interest. -- - For example, we have been asking questions such as, do people who receive job training tend to earn more wages than people who do not? -- - Causal inference goes further as we try to infer aspects of the actual data generating process, that is, .hlight[causation]. -- - For example, does receiving job training actually cause one to earn more wage than they would have without the training? -- - The additional information needed to move from association to causation is often provided by .hlight[causal assumptions] (often untestable). -- - Note: in most cases, .hlight[association does not imply causation]! --- ## Confounding - Why is it that association does not often imply causation? .hlight[confounding variables or confounders]! -- - Causal relationship
-- - Confounding
--- ## Examples of confounding - Ice cream consumption and number of people who drowned. .hlight[Confounder: temperature]; people tend to consume more ice cream and also swim more when it is hot. -- - Medical treatment and patient outcome. .hlight[Confounders: age, sex, other complications] -- - Education and income. .hlight[Confounder: socio-economic status of family] -- - An extreme example of confounding is Simpson’s paradox: where a confounder reverses the sign of the correlation between treatment and outcome --- ## Simpson's paradox - Example: kidney stone treatment (Charig et al., BMJ, 1986). + Compare the success rates of two treatments for kidney stones + Treatment A: open surgery; treatment B: small puncture -- <br /> | Treatment A | Treatment B | :----------- | :--------- | :--------- | Small stones | <b>93%</b> (81/87) | 87% (234/270) | Large stones | <b>73%</b> (192/263) | 69% (55/80) | Both | 78% (273/350) | <b>83%</b> (289/350) | -- + What is the confounder here? Severity of the case/type of stones. --- ## Simpson's paradox or Yule-Simpson effect - Simpson’s paradox: a trend appears in different groups of data but disappears or reverses when these groups are combined. <img src="img/SimpsonsParadox.png" height="250px" style="display: block; margin: auto;" /> -- - Mathematically, it is about conditioning. -- - Another well-known example is the Berkeley admission gender bias (Bickel et al., Science, 1976). --- ## General notation - .hlight[W]: Treatment (e.g. job training); we will focus on binary treatments. - .hlight[Y]: Outcome (e.g. annual wages). - .hlight[X]: Observed predictors or confounders (e.g. age, education, etc). - .hlight[U]: Unobserved predictors or confounders. </br> -- - Examples of causal questions: + Causal effect of exposure to a disease. + Comparative effectiveness research such as in clinical trials: whether one drug or medical procedure is better than the other. + Program evaluation in economics and policy. --- class: center, middle # Potential outcomes framework --- ## Potential outcomes framework - The .hlight[potential outcomes framework] or the counterfactual framework or the Rubin Causal Model (RCM) is arguably the most widely used framework across many disciplines, e.g., medicine, health care, policy, social sciences. -- - Under this framework, causal inference is viewed as a problem of missing data with explicit mathematical modeling of the assignment mechanism as a process for revealing the observed data. -- - Rooted in the statistical work on randomized experiments by Fisher (1918, 1925) and Neyman (1923), as extended by Rubin (1974, 1976, 1977, 1978, 1990). --- ## Potential outcomes framework - For a binary treatment, each individual gets exactly one or the two options, and we observe the corresponding response for that. -- - Conceptually, under the potential outcomes framework, we think about what each individual's response should have been had they gotten the other treatment option instead of the one they actually got. -- - The individual causal effect then is the difference between the two "potential" outcomes, only one of which is observed. -- - Clearly, we never observe the two potential outcomes for any individual, making it natural to think of this as a missing data problem. --- ## Potential outcomes framework --- <i class="fa fa-quote-left fa-2x fa-pull-left fa-border" aria-hidden="true"></i> <i class="fa fa-quote-right fa-2x fa-pull-right fa-border" aria-hidden="true"></i> If a person eats of a particular dish, and dies in consequence, that is, would not have died if he had not eaten of it, people would be apt to say that eating of that dish was the source of his death. -- Philosopher and economist Mill (1973, pp.327) --- ## Potential outcomes framework Applying the potential outcomes framework to this quotation: + There are two treatments: 1="eat dish" and 0="not eat dish". -- + There are therefore two potential outcomes `\(Y(1)\)` and `\(Y(0)\)` for the same person. -- + The observed outcome here for `\(Y(1)\)` is "death", so that `\(Y(0)\)` is in fact, unobserved. -- + Mills appears to posit that if the alternate potential outcome `\(Y(1)\)` is "not death", then one could conclude that eating the dish was the "cause" of the death. --- ## Potential outcomes framework - No causation without manipulation - "cause" must be (hypothetically speaking) something we can manipulate. e.g., intervention, action, treatment. -- - That is, gender, time and age are not well defined “causes" under the RCM. -- - Three integral components of the potential outcomes framework: + .hlight[potential outcomes] corresponding to the various levels of a treatment. + .hlight[assignment mechanisms], that is, the treatment indicator for all observations. + a .hlight[model] for the science (the potential outcomes and covariates). --- ## Potential outcomes framework: basic concepts - .hlight[Unit]: The person, place, or thing upon which a treatment will operate, at a particular time (note: A single person, place, or thing at two different times comprises two different units). -- - .hlight[Treatment]: An intervention, the effects of which (on some particular measurement of the units) the investigator wishes to assess relative to no intervention (i.e., the control). -- - .hlight[Potential Outcomes]: The values of a unit’s measurement of interest after (a) application of the treatment and (b) non-application of the treatment (i.e., under control). -- - .hlight[Causal Effect]: For each unit, the comparison of the potential outcome under treatment and the potential outcome under control. --- ## Causal effects - For a single unit, let `\(Y(0)\)` denote the outcome given the control treatment and `\(Y(1)\)`, the outcome given the active treatment. -- - For example, suppose `\(Y\)` denotes a score (level of severity) for headache, then for a single unit, we could have -- <table> <caption>Raw scores</caption> <tr> <th>Unit</th> <th>Initial headache</th> <th height="30px" colspan="2">Potential outcomes</th> <th>Causal effect</th> </tr> <tr> <th> </th> <td height="30px" style="text-align:center" width="100px"> X </td> <td style="text-align:center" width="100px"> Y(asp) </td> <td style="text-align:center" width="100px"> Y(not) </td> <td style="text-align:center" width="250px"> Y(asp) - Y(not) </td> </tr> <tr> <td height="30px" style="text-align:center" width="80px"> you</td> <td style="text-align:center"> 80 </td> <td style="text-align:center"> 25 </td> <td style="text-align:center"> 75 </td> <td style="text-align:center"> -50 </td> </tr> </table> -- <table> <caption>Gain scores</caption> <tr> <th>Unit</th> <th>Initial headache</th> <th height="30px" colspan="2">Potential outcomes</th> <th>Causal effect</th> </tr> <tr> <th> </th> <td height="30px" style="text-align:center" width="100px"> X </td> <td style="text-align:center" width="100px"> Y(asp) - X </td> <td style="text-align:center" width="100px"> Y(not) - X </td> <td style="text-align:center" width="250px"> [Y(asp) - X] - [Y(not) - X] </td> </tr> <tr> <td height="30px" style="text-align:center" width="80px"> you</td> <td style="text-align:center"> 80 </td> <td style="text-align:center"> -55 </td> <td style="text-align:center"> -5 </td> <td style="text-align:center"> -50 </td> </tr> </table> --- ## The fundamental problem of causal inference As mentioned before, - The fundamental problem of causal inference: we can observe at most one of the potential outcomes for each unit. -- - Causal inference under the potential outcome framework is essentially a missing data problem. -- - To identify causal effects from observed data, under any mathematical framework, one must make assumptions (structural or/and stochastic) -- - Since we can at most observe a single potential outcome, we must rely on multiple units (and a lot of assumptions) to make causal inferences. --- ## The stable unit treatment value assumption (SUTVA) To quote the Causal Inference book by Imbens and Rubin, - In many situations, it may be reasonable assume that treatments applied to one unit do not affect the outcome of another unit. -- - For example, if we are in different locations and have no contact with each other, it would appear reasonable to assume that whether you take an aspirin has no effect on the status of my headache. -- - .hlight[SUTVA] incorporates both this idea that units do not interfere with one another and the concept that for each unit, there is only a single version of each treatment (ruling out, in this case, that a particular individual could take aspirin tablets of varying efficacy). -- - Formally, we have.... --- ## SUTVA - .hlight[SUTVA] .block[The potential outcomes for any unit do not vary with the treatments assigned to other units, and for each unit, there are no different forms or versions of each treatment level, which lead to different potential outcomes.] -- - Mathematically, SUTVA `\(\Rightarrow\)` + If `\(W_i = 1\)`, then `\(Y_i = Y_i(1)\)` + If `\(W_i = 0\)`, then `\(Y_i = Y_i(0)\)` Seems trivial but may not hold -- - SUTVA includes two assumptions: (1) no interference, (2) no different versions of a treatment <!-- -- --> <!-- - Also know as .hlight[consistency] --> --- ## SUTVA: no interference - .hlight[Interference]: the potential outcome `\(Y_i(w)\)` where `\(w = 0,1\)` for an individual `\(i\)` depends on what treatment other people receive. -- - SUTVA assumes we can't have this problem. -- - Examples: infectious diseases, social networks, agricultural experiments. -- - That is, there are lots of possible `\(Y_i(w)\)` *depending on what happens to other people*. -- - When in the presence of interference, other assumptions required for causal inference (e.g., Rosebaum 2007; Hudgens and Hollaran 2008) --- ## SUTVA: no hidden/multiple variations of treatment - .hlight[Multiple variations of treatment]: Sometimes the treatment `\(W\)` doesn't have a clear meaning, as it has many versions. -- - SUTVA also assumes we can't have this problem. -- - For example, suppose two people need to take one aspirin tablet each. -- - However, suppose that one of the tablets is old and no longer contains an effect dose, whereas the other is new and at full strength. -- - In that case, each person may now have three treatments available, and we can now think of there being three potential outcomes for each person. -- - There are lots of possible `\(Y_i(w)\)` values depending on what version of `\(W\)` gets selected. --- ## Basic setup - .hlight[Target population]: a well-defined population of individuals whose outcomes are going to be compared -- - .hlight[Data]: a random sample of `\(N\)` units from a target population. -- - A treatment with two levels: `\(w = 0,1\)`. -- - For each unit `\(i\)`, we observe + the binary treatment status `\(W_i \in \{0,1\}\)`, + a vector of `\(p\)` predictors/covariates `\(X_i = (X_{i1}, \ldots, X_{ip})\)`, and + an outcome `\(Y_i^{\text{obs}}\)`. --- ## Basic setup - For each unit `\(i\)`, there are two potential outcomes `\((Y_i(0),Y_i(1))\)`. -- - That is, the outcomes under the two values of the treatment, at most one of which is observed. -- - Potential outcomes and assignments jointly determine the values of the observed outcomes .block[ $$ Y_i^{\text{obs}} \equiv Y_i(W_i) = W_i \cdot Y_i(1) + (1-W_i) \cdot Y_i(0) $$ ] -- and the missing outcomes: .block[ $$ Y_i^{\text{mis}} \equiv Y_i(1-W_i) = (1-W_i) \cdot Y_i(1) + W_i \cdot Y_i(0) $$ ] --- ## Causal estimands - The .hlight[average treatment effect (ATE)]: .block[ $$ \tau = \mathbb{E}[Y_i(1) - Y_i(0)]. $$ ] -- - The .hlight[average treatment effect for the treated (ATT)]: .block[ $$ \tau = \mathbb{E}[Y_i(1) - Y_i(0) | W_i = 1]. $$ ] -- - The .hlight[average treatment effect for the control (ATC)]: .block[ $$ \tau = \mathbb{E}[Y_i(1) - Y_i(0) | W_i = 0]. $$ ] -- - For binary outcomes, .hlight[causal odds ratio (OR) or risk ratio (RR):]: .block[ $$ \tau = \dfrac{\mathbb{Pr}[Y_i(1) = 1]/\mathbb{Pr}[Y_i(1) = 0]}{\mathbb{Pr}[Y_i(0) = 1]/\mathbb{Pr}[Y_i(0) = 0]}. $$ ] - Obviously these estimands are not identifiable without further assumptions. --- ## Assignment mechanism - The assignment mechanism describes how each individual actually came to receive each treatment level they actually received. -- - Usually, this is the key piece of information that we key; thus, most key identifying assumptions in causal inference are on the assignment mechanism. -- - .hlight[Assignment mechanism]: the probabilistic rule that decides which unit gets assigned to which treatment. -- - In .hlight[randomized experiments], assignment mechanism is usually .hlight[known] and .hlight[controlled] by investigators. -- - In .hlight[observational studies], assignment mechanism is usually .hlight[unknown] and .hlight[uncontrolled]. --- ## Example <table> <caption>Medical example with two treatments, four units and SUTVA: Surgery (S) and Drug Treatment (D)</caption> <tr> <th> Unit </th> <th height="30px" colspan="2">Potential Outcomes</th> <th> Causal Effect </th> </tr> <tr> <th> </th> <td height="30px" style="text-align:center" width="150px"> Y<sub>i</sub>(0) </td> <td style="text-align:center" width="150px"> Y<sub>i</sub>(1) </td> <td style="text-align:center" width="250px"> Y<sub>i</sub>(1) - Y<sub>i</sub>(0) </td> </tr> <tr> <td height="30px" style="text-align:center" width="130px"> Patient #1 </td> <td style="text-align:center"> 1 </td> <td style="text-align:center"> 7 </td> <td style="text-align:center"> 6 </td> </tr> <tr> <td height="30px" style="text-align:center" width="100px"> Patient #2 </td> <td style="text-align:center"> 6 </td> <td style="text-align:center"> 5 </td> <td style="text-align:center"> -1 </td> </tr> <tr> <td height="30px" style="text-align:center" width="100px"> Patient #3 </td> <td style="text-align:center"> 1 </td> <td style="text-align:center"> 5 </td> <td style="text-align:center"> 4 </td> </tr> <tr> <td height="30px" style="text-align:center" width="100px">Patient #4 </td> <td style="text-align:center"> 8 </td> <td style="text-align:center"> 7 </td> <td style="text-align:center"> -1 </td> </tr> <tr> <td height="30px" style="text-align:center" width="100px"> </td> <td style="text-align:center"> </td> <td style="text-align:center"> </td> <td style="text-align:center"> </td> </tr> <tr> <td height="30px" style="text-align:center" width="100px"> Average </td> <td style="text-align:center"> 4 </td> <td style="text-align:center"> 6 </td> <td style="text-align:center"> 2 </td> </tr> </table> --- ## Example <table> <tr> <th> </th> <th height="30px" colspan="2">Potential Outcomes</th> <th> </th> <th height="30px" colspan="3">Observed Data</th> </tr> <tr> <th> </th> <td height="30px" style="text-align:center" width="100px"> Y(0) </td> <td style="text-align:center" width="100px"> Y(1) </td> <td style="text-align:center" width="100px"> </td> <td style="text-align:center" width="100px"> W </td> <td height="30px" style="text-align:center" width="100px"> Y(0) </td> <td style="text-align:center" width="100px"> Y(1) </td> </tr> <tr> <td height="30px" style="text-align:center" width="100px"> </td> <td style="text-align:center"> 13 </td> <td style="text-align:center"> 14 </td> <td style="text-align:center"> </td> <td style="text-align:center"> 1 </td> <td style="text-align:center"> ? </td> <td style="text-align:center"> 14 </td> </tr> <tr> <td height="30px" style="text-align:center" width="100px"> </td> <td style="text-align:center"> 6 </td> <td style="text-align:center"> 0 </td> <td style="text-align:center"> </td> <td style="text-align:center"> 0 </td> <td style="text-align:center"> 6 </td> <td style="text-align:center"> ? </td> </tr> <tr> <td height="30px" style="text-align:center" width="100px"> </td> <td style="text-align:center"> 4 </td> <td style="text-align:center"> 1 </td> <td style="text-align:center"> </td> <td style="text-align:center"> 0 </td> <td style="text-align:center"> 4 </td> <td style="text-align:center"> ? </td> </tr> <tr> <td height="30px" style="text-align:center" width="100px"> </td> <td style="text-align:center"> 5 </td> <td style="text-align:center"> 2 </td> <td style="text-align:center"> </td> <td style="text-align:center"> 0 </td> <td style="text-align:center"> 5 </td> <td style="text-align:center"> ? </td> </tr> <tr> <td height="30px" style="text-align:center" width="100px"> </td> <td style="text-align:center"> 6 </td> <td style="text-align:center"> 3 </td> <td style="text-align:center"> </td> <td style="text-align:center"> 0 </td> <td style="text-align:center"> 6 </td> <td style="text-align:center"> ? </td> </tr> <tr> <td height="30px" style="text-align:center" width="100px"> </td> <td style="text-align:center"> 6 </td> <td style="text-align:center"> 1 </td> <td style="text-align:center"> </td> <td style="text-align:center"> 0 </td> <td style="text-align:center"> 6 </td> <td style="text-align:center"> ? </td> </tr> <tr> <td height="30px" style="text-align:center" width="100px"> </td> <td style="text-align:center"> 8 </td> <td style="text-align:center"> 10 </td> <td style="text-align:center"> </td> <td style="text-align:center"> 1 </td> <td style="text-align:center"> ? </td> <td style="text-align:center"> 10 </td> </tr> <tr> <td height="30px" style="text-align:center" width="100px"> </td> <td style="text-align:center"> 8 </td> <td style="text-align:center"> 9 </td> <td style="text-align:center"> </td> <td style="text-align:center"> 1 </td> <td style="text-align:center"> ? </td> <td style="text-align:center"> 9 </td> </tr> <tr> <td height="30px" style="text-align:center" width="100px"> True averages </td> <td style="text-align:center"> 7 </td> <td style="text-align:center"> 5 </td> <td style="text-align:center"> Observed averages </td> <td style="text-align:center"> </td> <td style="text-align:center"> 5.4 </td> <td style="text-align:center"> 11 </td> </tr> </table> --- ## Properties of assignment mechanisms - We will not all the mathematical details on assignment mechanisms in this course. -- - Use bold font to denote the vector of a variable of all sample units, e.g. `\(\boldsymbol{W} = (W_1, \ldots, W_N)\)`, `\(\boldsymbol{Y}(1) = (Y_1(1), \ldots, Y_N(1))\)`, and so on. -- - .hlight[Probabilistic assignment]: an assignment mechanism is probabilistic if for all `\(i\)`, and all `\(\boldsymbol{X}\)`, `\(\boldsymbol{Y}(0)\)` and `\(\boldsymbol{Y}(1)\)`, the probability of assignment is strictly between 0 and 1. That is, .block[ .small[ $$ 0 < \mathbb{Pr}[W_i = 1 | X_i, Y_i(0), Y_i(1)] < 1. $$ ] ] -- - .hlight[Local independence]: assignment probabilities do not depend on pre-treatment variables or potential outcomes for other units, that is, suppose `\(\boldsymbol{X}^\star = \boldsymbol{X}_{-i}\)`, `\(\boldsymbol{Y}^\star(0) = \boldsymbol{Y}_{-i}(0)\)`, and `\(\boldsymbol{Y}^\star(1) = \boldsymbol{Y}_{-i}(1)\)`, .block[ .small[ $$ \mathbb{Pr}[W_i = 1 | X_i, Y_i(0), Y_i(1), \boldsymbol{X}^\star,\boldsymbol{Y}^\star(0),\boldsymbol{Y}^\star(1)] = \mathbb{Pr}[W_i = 1 | X_i, Y_i(0), Y_i(1)]. $$ ] ] --- ## Properties of assignment mechanisms - .hlight[Ignorable assignment]: an assignment mechanism is ignorable if the assignment mechanism does not depend on the missing outcomes: .block[ .small[ $$ \mathbb{Pr}[W_i = 1 | X_i, Y_i(0), Y_i(1)] = \mathbb{Pr}[W_i = 1 | X_i, Y_i^{\text{obs}}]. $$ ] ] -- - .hlight[Unconfounded assignment]: an assignment mechanism is unconfounded if the assignment mechanism does not depend on the potential outcomes: .block[ .small[ $$ \mathbb{Pr}[W_i = 1 | X_i, Y_i(0), Y_i(1)] = \mathbb{Pr}[W_i = 1 | X_i]. $$ ] ] -- - Note that: + An unconfounded assignment is ignorable + An ignorable assignment may be confounded --- ## Classification of assignment mechanisms - .hlight[Randomized experiments]: a randomized experiment is an assignment mechanism such that + the assignment mechanism is ignorable + the assignment mechanism is probabilistic + the assignment mechanism is a known function of its arguments -- - .hlight[Observational studies]: an assignment mechanism corresponds to an observational study if it is an unknown function of its arguments. --- ## Classification of assignment mechanisms - .hlight[Regular assignment mechanisms]: an assignment mechanism is regular if + the assignment mechanism is ignorable + the assignment is probabilistic + the assignment mechanism is locally independent -- - .hlight[Irregular assignment mechanisms]: an assignment mechanism is irregular if + instrumental variables (latent regular design) + regression discontinuity (the probabilistic assignment assumption is violated) + before/after control/treatment group designs or difference in differences methods --- ## The role of randomization - In randomized experiments, assignment mechanism is known and controlled by investigators -- - Randomization gives us the following: + balanced observed covariates: `\(W \perp \!\!\! \perp X\)`. + balanced unobserved covariates: `\(W \perp \!\!\! \perp U\)`. + balanced potential outcomes, i.e. guarantees ignorability or unconfoundedness (Rubin 1978) `\(W \perp \!\!\! \perp (Y(1),Y(0))\)`. where the `\(\perp \!\!\! \perp\)` symbol represents independence. -- - In other words, randomization ensures that the treated and control groups are actually similar in both observed and predictors, so that whatever difference we see in their response values can be attributed to the treatment. --- ## The role of randomization - Under randomization, causal effects are (nonparametrically) identified, because we can show .block[ .small[ $$ \mathbb{Pr}(Y_i(w)) = \mathbb{Pr}[Y_i^\textrm{obs}|W_i=w], \ \ \ \ w= 0,1. $$ ] ] -- - Thus, under randomization, .hlight[association does imply causation] (of course within the potential outcome framework and with assumptions), for example. .block[ .small[ $$ \textrm{ATE} = \mathbb{E}[Y_i^\textrm{obs}|W_i=1] - \mathbb{E}[Y_i^\textrm{obs}|W_i=0]. $$ ] ] -- since .block[ .small[ $$ `\begin{split} \textrm{ATE} & = \mathbb{E}[Y_i(1) - Y_i(0)] = \mathbb{E}[Y_i(1)] - \mathbb{E}[Y_i(0)] \\ & = \mathbb{E}[Y_i^\textrm{obs}|W_i=1] - \mathbb{E}[Y_i^\textrm{obs}|W_i=0], \end{split}` $$ ] ] using the identification result above. -- - In randomized experiments, ATE is equivalent to ATT (and ATC) because treatment and control groups are comparable in expectation. --- ## Observational studies - In observational studies, we do not control or know the assignment mechanism. -- - Presence of measured and unmeasured confounders: unbalanced between groups. -- - Some structural (often untestable) assumptions must be made, e.g. on the treatment assignment, for identifying causal effects. -- - Model assumptions are also made. -- - In observational studies, ATE is usually different from ATT and ATC. --- ## Acknowledgements These slides contain materials adapted from courses taught by Dr. Fan Li.