class: center, middle, inverse, title-slide # Observational studies: regression, stratification and matching ### Dr. Olanrewaju Michael Akande ### Nov 5, 2019 --- ## Announcements - Reminder: project proposals are due at 11:59pm today. - Regrade requests for all assignments should be submitted via gradescope. - Modified office hours for Chenxi (this week only): 4.30-6.30 on Wed. - HW4 later today. - Second class survey later today. --- ## Outline - Observational studies - Regression-based estimation - Balancing covariates - Stratification - Matching --- class: center, middle # Observational studies: setup, estimands and assumptions --- ## Observational studies - We will not focus on randomized experiments since most of the data you will have to analyze in practice are actually based on observational studies. -- - Recall that in observational studies, we do not control or know the assignment mechanism. -- - In addition, the presence of measured and unmeasured confounders can create unbalance between the groups. -- - To do causal inference, we often have to make some structural (often untestable) assumptions, e.g. on the treatment assignment, for identifying causal effects. -- - Once we have those general assumptions, we also usually have to make model assumptions to do the actual estimation. --- ## Estimands Once again, we will focus on the following estimands: - The .hlight[average treatment effect (ATE)]: .block[ .small[ $$ \tau = \mathbb{E}[Y_i(1) - Y_i(0)]. $$ ] ] -- - The .hlight[average treatment effect for the treated (ATT)]: .block[ .small[ $$ \tau = \mathbb{E}[Y_i(1) - Y_i(0) | W_i = 1]. $$ ] ] -- - The .hlight[average treatment effect for the control (ATC)]: .block[ .small[ $$ \tau = \mathbb{E}[Y_i(1) - Y_i(0) | W_i = 0]. $$ ] ] -- - For binary outcomes, .hlight[causal odds ratio (OR) or risk ratio (RR):]: .block[ .small[ $$ \tau = \dfrac{\mathbb{Pr}[Y_i(1) = 1]/\mathbb{Pr}[Y_i(1) = 0]}{\mathbb{Pr}[Y_i(0) = 1]/\mathbb{Pr}[Y_i(0) = 0]}. $$ ] ] --- ## Estimands - The relationship between ATE, ATT and ATC is given by .block[ .small[ $$ \textrm{ATE} = \mathbb{Pr}[W_i = 1] \cdot \textrm{ATT} + \mathbb{Pr}[W_i = 0] \cdot \textrm{ATC} $$ ] ] -- - In randomized experiments, ATT is equivalent to ATC because treatment and control groups are similar/comparable. -- - ATE is then also equivalent to ATT (and ATC). -- - In observational studies, ATE is usually different from ATT and ATC. -- - The above relation does not hold for ratio estimands. --- ## Assumptions: unconfoundedness We will need two major assumptions (in addition to SUTVA). -- Assumption 1: .hlight[Unconfoundedness] .block[ .small[ $$ \{Y_i(0), Y_i(1)\} \perp W_i | X_i, $$ ] ] or using the equivalent form from last class, .block[ .small[ $$ \mathbb{Pr}[W_i = 1 | X_i, Y_i(0), Y_i(1)] = \mathbb{Pr}[W_i = 1 | X_i] $$ ] ] -- - Assumes that within subpopulations defined by values of observed covariates, the treatment assignment is random. -- - Rules out unobserved confounders (assumption of “no unmeasured confounders"). -- - Randomized experiments satisfy unconfoundedness. -- - Untestable in most observational studies, but sensitivity can be checked. --- ## Implications of unconfoundedness - Under unconfoundedness, it turns out that .block[ .small[ $$ \mathbb{Pr}[Y(w) | X] = \mathbb{Pr}[Y^{\text{obs}} | X, W =w] \ \ \ \ w = 0,1. $$ ] ] -- - That is, the observed distribution of `\(Y\)` in treatment arm `\(W = w\)` equals the distribution of the potential outcomes `\(Y(w)\)`. -- <div class="question"> Why does this matter or how does this help us? </div> -- - Well, the causal estimands are essentially expectations and probabilities. -- - From probability theory, we know that we need to know the distribution of the potential outcomes `\(Y(w)\)` to calculate the estimands. --- ## Implications of unconfoundedness - Under unconfoundedness, we are able to therefore estimate the causal estimands from the observed data. -- - Recall again that ATE is .block[ .small[ $$ \textrm{ATE} = \mathbb{E}[Y_i(1) - Y_i(0)]. $$ ] ] -- - ATE can then be estimated from the observed data using .block[ .small[ $$ \textrm{ATE} = \mathbb{E}_X\left(\mathbb{E}[Y^{\text{obs}} | X, W =1] - \mathbb{E}[Y^{\text{obs}} | X, W =0]\right). $$ ] ] -- - Note that we need to average out over the distribution of `\(X\)` since the original formula for ATE does not depend on any `\(X\)`. --- ## Assumptions: overlap Assumption 2: .hlight[Overlap (or positivity)] .block[ .small[ $$ 0 < \mathbb{Pr}[W_i = 1 | X_i] < 1, \ \ \ \textrm{for all}\ \ \ i. $$ ] ] -- - Notice that this is the probabilistic assignment from last class, that is, .block[ .small[ $$ 0 < \mathbb{Pr}[W_i = 1 | X_i, Y_i(0), Y_i(1)] < 1. $$ ] ] -- - However, we can exclude `\(\{Y_i(0), Y_i(1)\}\)` now because of the unconfoundedness assumption. -- - .block[ .small[ $$ e(x) = \mathbb{Pr}[W_i = 1 | X_i = x] $$ ] ] is usually called the .hlight[propensity score]. --- ## Implications of overlap - Overlap implies that, in large samples, for all possible values of the covariates, there are both treated and control units. -- - This is important within the potential outcomes (or counterfactual) framework both conceptually and operationally (variance inflation). -- - Unlike unconfoundedness, overlap can be directly checked from the data often using the estimated propensity scores. -- - Unconfoundedness and positivity jointly define the .hlight[strong ignorability] assumption. --- class: center, middle # Regression-based estimation --- ## Estimation: regression-based - With those assumptions, we can move on to estimation. -- - Clearly, we need to adjust for any difference in the outcomes due to the differences in pre-treatment characteristics. -- - Commonly via a regression model for the potential outcome on covariates -- - However, 1. validity of the analysis critically relies on the validity of the unconfoundedness assumption (which, remember is untestable); and 2. usually, model parameters do not directly correspond to the causal estimand of interest. --- ## Estimation: regression-based - For example, consider two regressions, one for each potential outcome. Write the mean functions as .block[ .small[ `$$\mathbb{E}[Y(1) | X=x] = \mu_1(x), \ \ \ \mathbb{E}[Y(0) | X=x] = \mu_0(x).$$` ] ] notice that this need not be two completely separate regressions, but could be a regression with `\(W\)` included as a predictor. -- - Let `\(\hat{\mu}_w(X_i)\)` denote the fitted potential outcome for `\(Y_i(w)\)` based on the regression models. -- - For ATE, the covariate-adjusted estimator is then .block[ .small[ `$$\hat{\tau}_{\textrm{adj}} = \sum^N_{i=1} \dfrac{W_i (Y_i^{\text{obs}} - \hat{\mu}_0(X_i)) + (1-W_i) (\hat{\mu}_1(X_i) - Y_i^{\text{obs}}) }{N}$$` ] ] -- - Unlike randomized experiments, the estimator is .hlight[not consistent] if the linear model is misspecified. --- ## Estimation: regression-based - Variance can be estimated using bootstrap. -- - Note that regression itself does not take the lack of overlap into account. -- - If the imbalance of the covariates between the two groups is large, the model-based results heavily relies on extrapolation in the non-overlap region, which is sensitive to the model specification assumption. -- - .hlight[Take away]: Regression (or any model) here comes with a package. You need to know and acknowledge what assumptions—explicit or implicit—come with that model. --- ## Strategies for mitigating model dependence - To mitigate model dependence in the case of linear regression, there are two general strategies 1. Attempt to fix the design - balance covariates 2. Use more flexible model for analysis -- - Best strategy is to actually use both jointly: first balance covariates in the design stage, then use flexible models in the analysis stage. -- - However, in this class, we will not cover the kind of flexible models that would help, so we will focus on balancing the predictors/covariates instead. --- ## Strategies for mitigating model dependence - .hlight[Covariate balance] (our focus) + Stratification + Matching + Propensity score methods -- - .hlight[Flexible models] (we won't cover these) + Semiparametric models (e.g., power series) + Machine learning methods (e.g., tree-based methods (CART, random forest), boosting) + Bayesian non-and semi parametric models (e.g., Gaussian Processes, BART, Dirichlet Processes mixtures) --- class: center, middle # Covariate balance --- ## Covariate balance - Under strong ignorability, valid causal inference can be obtained by comparing the observed distributions of `\(Y\)` under treatment and control if the covariates are "balanced". -- - Thus, a good practice is always to first check balance. That is, how similar are the two groups? -- - What metric to use? The most common one is the .hlight[absolute standardized difference (ASD)]: .block[ .small[ `$$\textrm{ASD}_1 = \dfrac{\left| \dfrac{\sum_{i=1}^N X_iW_i}{N_1} - \dfrac{\sum_{i=1}^N X_i(1-W_i)}{N_0} \right|}{\sqrt{\dfrac{s^2_1}{N_1} + \dfrac{s^2_0}{N_0}}},$$` ] ] where `\(s^2_w\)` is the sample variance of the covariate in group `\(w\)` for `\(w = 0,1\)`, `\(N_1 = \sum_{i=1}^N W_i\)`, and `\(N_0 = \sum_{i=1}^N (1-W_i)\)`. --- ## Covariate balance - For a continuous covariate, `\(\textrm{ASD}_1\)` is the standard two-sample t-statistic, and the threshold is based on a t- or z- test (e.g. 1.96). -- - There is some debate on whether `\(N_1\)` and `\(N_0\)` should be in the denominator. -- - In some disciplines, the ASD is defined as .block[ .small[ `$$\textrm{ASD}_2 = \dfrac{\left| \dfrac{\sum_{i=1}^N X_iW_i}{N_1} - \dfrac{\sum_{i=1}^N X_i(1-W_i)}{N_0} \right|}{\sqrt{s^2_1 + s^2_0}}.$$` ] ] -- - The common threshold is 0.1. -- - Limitation of ASD: only on the difference in means (1st moments), can not capture difference in higher order moments and interactions. --- ## Covariate balance - More general, multivariate, balance metrics are available. -- - R package for balance assessment: `cobalt`. -- - `cobalt` generates customizable balance tables, plots (marginal distribution and Love plots) for covariates, with balance metrics. -- - Besides checking marginal balance, it is always good to also check higher order terms and interactions. -- - However, most times ASD is still the only balance metric checked in practice... --- ## The minimum wage analysis - In 1992, New Jersey decided to raise it’s minimum wage from $4.25 an hour to $5.05 an hour. -- - What was the .hlight[causal effect] of this decision on employment in the fast food industry? -- - To study this, economists from Princeton collected data from fast food restaurants along the New Jersey - Pennsylvania border, with the Pennsylvania restaurants acting as a control group for the New Jersey restaurants. -- - They also collected data on several covariates for the restaurants. -- - The outcome is the employment rate after the minimum wage was raised in New Jersey. -- - For more information, see the NY Times article [Supersize My Wage](https://www.nytimes.com/2013/12/22/magazine/supersize-my-wage.html?pagewanted=all&_r=0). --- ## The minimum wage analysis - The data is in the file `MinimumWageData.csv` on Sakai. .small[ Variables | Description :------------- | :------------ NJ.PA | indicator for which state the restaurant is in (1 if NJ, 0 if PA) EmploymentPre | measures employment for each restaurant before the minimum wage raise in NJ EmploymentPost | measures employment for each restaurant after the minimum wage raise in NJ WagePre | measures the hourly wage for each restaurant before the minimum wage raise BurgerKing | indicator for Burger King KFC | indicator for KFC Roys | indicator for Roys Wendys | indicator for Wendys ] --- ## The minimum wage analysis ```r MinWage <- read.csv("data/MinimumWageData.csv",header=T, colClasses=c("factor","numeric","numeric","numeric", "factor","factor","factor","factor")) str(MinWage) ``` ``` ## 'data.frame': 372 obs. of 8 variables: ## $ NJ.PA : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... ## $ EmploymentPost: num 18 29.5 24 30.5 9 6.5 13.5 25 26.5 23 ... ## $ EmploymentPre : num 30 19 67.5 18.5 6 7 12.5 55 21.5 25.5 ... ## $ WagePre : num 5 5.5 5 5 5.25 5 5 5 5 5.5 ... ## $ BurgerKing : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 2 2 2 ... ## $ KFC : Factor w/ 2 levels "0","1": 1 1 1 1 2 2 1 1 1 1 ... ## $ Roys : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ... ## $ Wendys : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 1 1 ... ``` ```r head(MinWage) ``` ``` ## NJ.PA EmploymentPost EmploymentPre WagePre BurgerKing KFC Roys Wendys ## 1 0 18.0 30.0 5.00 0 0 0 1 ## 2 0 29.5 19.0 5.50 0 0 0 1 ## 3 0 24.0 67.5 5.00 1 0 0 0 ## 4 0 30.5 18.5 5.00 1 0 0 0 ## 5 0 9.0 6.0 5.25 0 1 0 0 ## 6 0 6.5 7.0 5.00 0 1 0 0 ``` --- ## The minimum wage analysis ```r summary(MinWage[,c(2:4)]) ``` ``` ## EmploymentPost EmploymentPre WagePre ## Min. : 0.00 Min. : 3.00 Min. :4.250 ## 1st Qu.:11.25 1st Qu.:11.38 1st Qu.:4.250 ## Median :17.00 Median :16.38 Median :4.500 ## Mean :17.33 Mean :17.65 Mean :4.611 ## 3rd Qu.:22.50 3rd Qu.:21.00 3rd Qu.:4.890 ## Max. :55.50 Max. :80.00 Max. :5.750 ``` ```r summary(MinWage[,-c(2:4)]) ``` ``` ## NJ.PA BurgerKing KFC Roys Wendys ## 0: 73 0:218 0:295 0:280 0:323 ## 1:299 1:154 1: 77 1: 92 1: 49 ``` --- ## The minimum wage analysis Let's examine covariate balance. First, summarize covariates by NJ and PA. ```r summary(MinWage[MinWage$NJ.PA == 0, 3:8]) #first PA ``` ``` ## EmploymentPre WagePre BurgerKing KFC Roys Wendys ## Min. : 4.5 Min. :4.250 0:40 0:63 0:56 0:60 ## 1st Qu.:12.5 1st Qu.:4.250 1:33 1:10 1:17 1:13 ## Median :17.0 Median :4.500 ## Mean :20.1 Mean :4.629 ## 3rd Qu.:25.0 3rd Qu.:5.000 ## Max. :67.5 Max. :5.500 ``` ```r summary(MinWage[MinWage$NJ.PA == 1, 3:8]) #now NJ ``` ``` ## EmploymentPre WagePre BurgerKing KFC Roys Wendys ## Min. : 3.00 Min. :4.250 0:178 0:232 0:224 0:263 ## 1st Qu.:11.00 1st Qu.:4.250 1:121 1: 67 1: 75 1: 36 ## Median :15.75 Median :4.500 ## Mean :17.05 Mean :4.606 ## 3rd Qu.:20.38 3rd Qu.:4.870 ## Max. :80.00 Max. :5.750 ``` --- ## The minimum wage analysis Using the `bal.tab` function in the `cobalt` package, we have ```r bal.tab(list(treat=MinWage$NJ.PA,covs=MinWage[,3:8],estimand="ATE")) ``` ``` ## Balance Measures ## Type Diff.Un ## EmploymentPre Contin. 0.2937 ## WagePre Contin. 0.0645 ## BurgerKing Binary 0.0474 ## KFC Binary -0.0871 ## Roys Binary -0.0180 ## Wendys Binary 0.0577 ## ## Sample sizes ## Control Treated ## All 73 299 ``` -- The distribution of prior employment is not well balanced across groups; other variables are pretty close, but we might be able to do better. --- ## The minimum wage analysis Can also use `love.plot` instead. ```r love.plot(list(treat=MinWage$NJ.PA,covs=MinWage[,3:8],estimand="ATE"),stars = "std") ``` <img src="19-causal-inference-II_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- ## Balancing covariates: small number of covariates - When the number of covariates is small, the adjustment can be achieved by exact matching or stratification. -- - Exact matching: for each treated subject, get a control with the exact same value of the covariates (easier for categorical covariates). -- - Exact matching ensures distributions of covariates in treatment and control groups are exactly the same, thus eliminating bias due to difference in `\(X\)`. -- - Exact matching is usually unfeasible, even with low dimensional covariates. --- ## Stratification - Suppose we have a single covariate `\(X\)` with `\(k\)` levels (e.g. race), where `\(X\)` makes the assignment mechanism ignorable. -- - Even better if we assume unconfoundness. Again, remember that unconfoundness means that .block[ .small[ $$ \{Y_i(0), Y_i(1)\} \perp W_i | X_i. $$ ] ] -- - Suppose we want to estimate ATE. -- - Let + `\(n_k\)` be the number of observations with `\(X_i = k\)`; and + `\(\bar{Y}_{k,w}\)` be the sample average of all `\(Y_i\)` values among observations in cell `\(X_i = k\)` and `\(W_i = w\)`. -- - Once again, recall that ATE is `\(\tau = \mathbb{E}[Y_i(1) - Y_i(0)]\)`. --- ## Stratification - Then we have .block[ .small[ $$ \mathbb{E}[Y_i(1)] = \sum_k \mathbb{E}[Y_i | X_i = k, W_i = 1] \cdot \mathbb{Pr}[X_i = k], $$ ] ] -- and .block[ .small[ $$ \mathbb{E}[Y_i(0)] = \sum_k \mathbb{E}[Y_i | X_i = k, W_i = 0] \cdot \mathbb{Pr}[X_i = k]. $$ ] ] -- - We can then estimate `\(\mathbb{E}[Y_i(1)]\)` using a consistent estimator `\(\sum_k \bar{Y}_{k,1} \dfrac{n_k}{N}\)`. We can use a similar estimand for `\(\mathbb{E}[Y_i(0)]\)`. -- - Therefore, the ATE `\(\tau\)` can be estimated by .block[ .small[ `$$\hat{\tau} = \sum_k \left(\bar{Y}_{k,1} - \bar{Y}_{k,0} \right) \dfrac{n_k}{N}.$$` ] ] --- ## Stratification - What if `\(X\)` is continuous? -- - Stratification (subclassification): split `\(X\)` into `\(k\)` classes. -- - Then, for class `\(k\)`, define `\(n_k\)` and `\(\bar{Y}_{k,w}\)` as before. -- - An estimator of `\(\tau\)` is then once again .block[ .small[ `$$\hat{\tau}^k = \sum_k \left(\bar{Y}_{k,1} - \bar{Y}_{k,0} \right) \dfrac{n_k}{N}.$$` ] ] -- - `\(\hat{\tau}^k\)` is generally biased for `\(\tau\)`, however, stratification of over 5 blocks can remove 90% of the bias! --- ## Nearest-neighbor matching - Regression estimators impute the missing potential outcomes using the estimated regression function. -- - Matching estimators also impute the missing potential outcomes, but do so using only the outcomes of nearest neighbors of the opposite treatment group. -- - They have often (but not exclusively) been applied in settings where + the interest is in the ATT; and + there is a large reservoir of potential controls. This allows matching each treated unit to one or more distinct controls (matching without replacement). -- - More general settings: both treated and control units are (potentially) matched and matching is done with replacement. --- ## Matching (fixed number of matches) - Let `\(\cal{M}_i\)` be the set of the indices of `\(M\)` closest matches of unit `\(i\)` using a distance metric that depends on `\(X\)`. -- - Let .block[ .small[ $$ `\begin{split} \hat{Y}_i(0) & = \sum_{i \in \cal{M}_i} \dfrac{Y_j}{M} \ \ \ \textrm{if} \ \ \ W_i = 1 \ \ \ \textrm{and} \ \ \ \hat{Y}_i(0) = Y_i \ \ \ \textrm{if} \ \ \ W_i = 0; \\ \hat{Y}_i(1) & = Y_i \ \ \ \textrm{if} \ \ \ W_i = 1 \ \ \ \textrm{and} \ \ \ \hat{Y}_i(1) = \sum_{i \in \cal{M}_i} \dfrac{Y_j}{M} \ \ \ \textrm{if} \ \ \ W_i = 0. \end{split}` $$ ] ] -- - Then, the treatment effect within a pair is estimated as the difference in outcomes, and then we can average these within-pair differences. -- - That is, .block[ .small[ $$ `\begin{split} \hat{\tau}^{\textrm{ATE}} & = \sum_i \dfrac{\hat{Y}_i(1) - \hat{Y}_i(0) }{N}; \\ \hat{\tau}^{\textrm{ATT}} & = \sum_i \dfrac{\left(Y_i - \hat{Y}_i(0)\right)W_i }{N_i}. \\ \end{split}` $$ ] ] --- ## Matching (fixed number of matches) - Pros: Matching estimators that ensure good balance in covariates between groups are generally robust. -- - Cons: With fixed number of matches and matching with replacement, the NN matching estimators are generally biased. -- - Matching estimators are generally not efficient, estimators combining matching and regression adjustment are usually more efficient. -- - There can be residual imbalance in matching. -- - Perform bias correction via regression on the matched sample. --- ## Matching: Tuning - Matching involves lots of tuning + distance metric + fixed or varying number of matches + for fixed `\(M\)`, number of matches + with or without replacement -- - Tuning for matching is an art, with some theory and general guidelines available... --- ## Matching: Tuning - Distance metric: Mahalanobis distance, propensity score, tree-based. -- - Fixed `\(M\)` or varying `\(M\)`? For varying `\(M\)`: + Matching with caliper: define a caliper (say 0.1) and all units within that caliper are matches + M increases with sample size. -- - For fixed `\(M\)`, the choice of `\(M\)` (number of matches per unit) has a bias-variance trade-off: + smaller `\(M \Rightarrow\)` smaller bias but larger variance + larger `\(M \Rightarrow\)`, larger bias but smaller variance. -- - Also depends on the proportion of treatment versus control: when there is a much larger control group, we can use one-to-many matching. --- ## Matching: Tuning - Matching with replacement: + Pros: 1. computationally easier 2. both controls and treated can be matched, but with high variances 3. not order-dependent + Cons: some units (especially ones with extreme propensity scores) can be matched many times and thus heavily influence overall estimates. -- - What about matching ties? What should we do about them? -- - Matching is a vast topic and there are so many matching methods. -- - Implementation in R: `Matchit`, `Matching`, and many more. --- ## Balancing covariates: large number of covariates - What if we have a large number of covariates? -- - With just 20 binary covariates, there are `\(2^{20}\)` or about a million covariate patterns! -- - Direct matching or stratification is nearly impossible. -- - Need dimensional reduction: propensity score. -- - We will focus on .hlight[propensity score methods] in the next class and use them to analyze the minimum wage data. --- ## Acknowledgements These slides contain materials adapted from courses taught by Dr. Fan Li.