Reminder: please remember to respond to survey II.
No more HW5; more time to work on individual projects.
Final lab next Friday; will cover causal inference.
Propensity scores: definition and properties
Estimation
PS stratification
PS matching
PS regression
The propensity score (ps) is defined as the conditional probability of receiving a treatment given pre-treatment covariates X.
That is,
e(X)=Pr[W=1|X]=E[W|X],
where X=(X1,…,Xp) is the vector of p covariates/predictors.
The propensity score (ps) is defined as the conditional probability of receiving a treatment given pre-treatment covariates X.
That is,
e(X)=Pr[W=1|X]=E[W|X],
where X=(X1,…,Xp) is the vector of p covariates/predictors.
Propensity score is a probability, analogous to a summary statistic.
The propensity score (ps) is defined as the conditional probability of receiving a treatment given pre-treatment covariates X.
That is,
e(X)=Pr[W=1|X]=E[W|X],
where X=(X1,…,Xp) is the vector of p covariates/predictors.
Propensity score is a probability, analogous to a summary statistic.
Propensity score has really nice properties which makes it desirable to use within our causal inference framework.
W⊥X|e(X)
Property 1. The propensity score e(X) balances the distribution of all X between the treatment groups:
W⊥X|e(X)
Equivalently,
Pr[Wi=1|Xi,e(Xi)]=Pr[Wi=1|e(Xi)].
Property 1. The propensity score e(X) balances the distribution of all X between the treatment groups:
W⊥X|e(X)
Equivalently,
Pr[Wi=1|Xi,e(Xi)]=Pr[Wi=1|e(Xi)].
The propensity score is not the only balancing score. Generally, a balancing score b(x) is a function of the covariates such that:
W⊥X|b(X)
Rosenbaum and Rubin (1983) show that all balancing scores are a function of e(X).
If a subclass of units or a matched treatment-control pair are homogeneous in e(X), then the treatment and control units have the same distribution of X.
Rosenbaum and Rubin (1983) show that all balancing scores are a function of e(X).
If a subclass of units or a matched treatment-control pair are homogeneous in e(X), then the treatment and control units have the same distribution of X.
The balancing property is a statement on the distribution of X, NOT on assignment mechanism or potential outcomes.
Property 2. If W is unconfounded given X, then W is unconfounded given e(X), i.e.,
That is, if
Yi(0),Yi(1)⊥Wi|Xi
holds, then
Yi(0),Yi(1)⊥Wi|e(Xi),
also holds.
Property 2. If W is unconfounded given X, then W is unconfounded given e(X), i.e.,
That is, if
Yi(0),Yi(1)⊥Wi|Xi
holds, then
Yi(0),Yi(1)⊥Wi|e(Xi),
also holds.
Given a vector of covariates that ensure unconfoundedness, adjustment for differences in propensity scores removes all biases associated with differences in the covariates.
e(X) can be viewed as a summary score of the observed covariates.
This is great because causal inference can then be drawn through stratification, matching, regression, etc. using the scalar e(X) instead of the high dimensional covariates.
e(X) can be viewed as a summary score of the observed covariates.
This is great because causal inference can then be drawn through stratification, matching, regression, etc. using the scalar e(X) instead of the high dimensional covariates.
The propensity score balances the observed covariates, but does not generally balance unobserved covariates.
e(X) can be viewed as a summary score of the observed covariates.
This is great because causal inference can then be drawn through stratification, matching, regression, etc. using the scalar e(X) instead of the high dimensional covariates.
The propensity score balances the observed covariates, but does not generally balance unobserved covariates.
In most observational studies, the propensity score e(X) is unknown and thus needs to be estimated.
Propensity score analysis (in observational studies) typically involves two stages:
Propensity score analysis (in observational studies) typically involves two stages:
Propensity score analysis (in observational studies) typically involves two stages:
Stage 1. Estimate the propensity score: by a logistic regression or machine learning methods.
Stage 2. Given the estimated propensity score, estimate the causal effects through one of these methods:
The main purpose of estimating propensity score is to ensure overlap and balance of covariates between treatment groups, instead of “finding a perfect fit" of propensity score.
As long as the important covariates are balanced, model overfitting is not a concern; underfit can be a problem however.
The main purpose of estimating propensity score is to ensure overlap and balance of covariates between treatment groups, instead of “finding a perfect fit" of propensity score.
As long as the important covariates are balanced, model overfitting is not a concern; underfit can be a problem however.
Essentially any balancing score (not necessarily propensity score) would be good enough for practical use.
A standard procedure for estimating propensity scores includes:
initial fit;
discarding outliers (with too large or too small propensity scores);
A standard procedure for estimating propensity scores includes:
initial fit;
discarding outliers (with too large or too small propensity scores);
check covariate balance; and
A standard procedure for estimating propensity scores includes:
initial fit;
discarding outliers (with too large or too small propensity scores);
check covariate balance; and
re-fit if necessary.
W_i | X_i \sim \textrm{Bernoulli}(\pi_i); \ \ \ \ \textrm{log}\left(\dfrac{\pi_i}{1-\pi_i}\right) = X_i\boldsymbol{\beta}.
Step 1. Estimate propensity score using a logistic regression:
W_i | X_i \sim \textrm{Bernoulli}(\pi_i); \ \ \ \ \textrm{log}\left(\dfrac{\pi_i}{1-\pi_i}\right) = X_i\boldsymbol{\beta}.
Include all covariates in this initial model or do a stepwise selection on the covariates and interactions to get an initial estimate of the propensity scores. That is,
\hat{e}^0(X_i) = \dfrac{e^{X_i\hat{\boldsymbol{\beta}}}}{1 + e^{X_i\hat{\boldsymbol{\beta}}}}.
Step 1. Estimate propensity score using a logistic regression:
W_i | X_i \sim \textrm{Bernoulli}(\pi_i); \ \ \ \ \textrm{log}\left(\dfrac{\pi_i}{1-\pi_i}\right) = X_i\boldsymbol{\beta}.
Include all covariates in this initial model or do a stepwise selection on the covariates and interactions to get an initial estimate of the propensity scores. That is,
\hat{e}^0(X_i) = \dfrac{e^{X_i\hat{\boldsymbol{\beta}}}}{1 + e^{X_i\hat{\boldsymbol{\beta}}}}.
Can also use machine learning methods.
Step 2. Check overlap of propensity score between treatment groups. If necessary, discard the observations with non-overlapping propensity scores.
Step 3. Assess balance given by initial model in Step 1.
Step 2. Check overlap of propensity score between treatment groups. If necessary, discard the observations with non-overlapping propensity scores.
Step 3. Assess balance given by initial model in Step 1.
Step 4. If one or more covariates are seriously unbalanced, include some of their higher order terms and/or interactions to re-fit the propensity score model and repeat Steps 1-3, until most covariates are balanced.
Step 2. Check overlap of propensity score between treatment groups. If necessary, discard the observations with non-overlapping propensity scores.
Step 3. Assess balance given by initial model in Step 1.
Step 4. If one or more covariates are seriously unbalanced, include some of their higher order terms and/or interactions to re-fit the propensity score model and repeat Steps 1-3, until most covariates are balanced.
Note: There are situations where some important covariates will still not be completely balanced after repeated trials. Then they should be taken into account in Stage 2 (outcome stage) of propensity score analysis.
In practice, balance checking in the PS estimation stage can be done via sub-classification/stratification, matching or weighting.
The workflow is the same: fit initial model, check balance (sub-classification, matching or weighting), refit.
Given the estimated propensity score, we can estimate the causal estimands through sub-classification/stratification, weighting or matching.
Let's start with stratification.
Given the estimated propensity score, we can estimate the causal estimands through sub-classification/stratification, weighting or matching.
Let's start with stratification.
Recall that the result of 5 strata of a single covariate removes 90% bias.
Given the estimated propensity score, we can estimate the causal estimands through sub-classification/stratification, weighting or matching.
Let's start with stratification.
Recall that the result of 5 strata of a single covariate removes 90% bias.
Stratification using propensity score as the summary score should have approximately the same effects.
Divide the subjects in to K strata by the corresponding quantiles of the estimated propensity scores.
ATE: estimate ATE within each stratum and then average by the block size. That is,
\hat{\tau}^{ATE} = \sum_{k=1}^K \left(\bar{Y}_{k,1} - \bar{Y}_{k,0} \right) \dfrac{N_{k,1}+N_{k,0}}{N},
with N_{k,1} and N_{k,0} being the numbers of units in class k under treated and control, respectively.
Divide the subjects in to K strata by the corresponding quantiles of the estimated propensity scores.
ATE: estimate ATE within each stratum and then average by the block size. That is,
\hat{\tau}^{ATE} = \sum_{k=1}^K \left(\bar{Y}_{k,1} - \bar{Y}_{k,0} \right) \dfrac{N_{k,1}+N_{k,0}}{N},
with N_{k,1} and N_{k,0} being the numbers of units in class k under treated and control, respectively.
ATT: weight within-block ATE by proportion of treated units N_{k,1}/N_1.
Divide the subjects in to K strata by the corresponding quantiles of the estimated propensity scores.
ATE: estimate ATE within each stratum and then average by the block size. That is,
\hat{\tau}^{ATE} = \sum_{k=1}^K \left(\bar{Y}_{k,1} - \bar{Y}_{k,0} \right) \dfrac{N_{k,1}+N_{k,0}}{N},
with N_{k,1} and N_{k,0} being the numbers of units in class k under treated and control, respectively.
ATT: weight within-block ATE by proportion of treated units N_{k,1}/N_1.
A variance estimator for \hat{\tau}^{ATE} is
\mathbb{Var}\left[\hat{\tau}^{ATE}\right] = \sum_{k=1}^K \left(\mathbb{Var}[\bar{Y}_{k,1}] - \mathbb{Var}[\bar{Y}_{k,0}] \right) \left(\dfrac{N_{k,1}+N_{k,0}}{N}\right)^2,
or use bootstrap.
5 blocks is usually not enough, consider higher number such as 10.
Stratification is a coarsened version of matching.
5 blocks is usually not enough, consider higher number such as 10.
Stratification is a coarsened version of matching.
Empirical results from real applications and situations: usually not as good as matching or weighting.
5 blocks is usually not enough, consider higher number such as 10.
Stratification is a coarsened version of matching.
Empirical results from real applications and situations: usually not as good as matching or weighting.
Good for cases with extreme outliers (smoothing): less sensitive, but also less efficient.
5 blocks is usually not enough, consider higher number such as 10.
Stratification is a coarsened version of matching.
Empirical results from real applications and situations: usually not as good as matching or weighting.
Good for cases with extreme outliers (smoothing): less sensitive, but also less efficient.
Can be combined with regression: first estimate causal effects using regression within each block and then average the within-subclass estimates.
In propensity score matching, potential matches are compared using (estimated) propensity score.
1-to-n closest neighbor matching is common when the control group is large compared to treatment group.
In propensity score matching, potential matches are compared using (estimated) propensity score.
1-to-n closest neighbor matching is common when the control group is large compared to treatment group.
In most software packages, the default is actually 1-to-1 closest neighbor matching.
In propensity score matching, potential matches are compared using (estimated) propensity score.
1-to-n closest neighbor matching is common when the control group is large compared to treatment group.
In most software packages, the default is actually 1-to-1 closest neighbor matching.
Pros: robust, matched pairs (so you can do within pair analysis).
In propensity score matching, potential matches are compared using (estimated) propensity score.
1-to-n closest neighbor matching is common when the control group is large compared to treatment group.
In most software packages, the default is actually 1-to-1 closest neighbor matching.
Pros: robust, matched pairs (so you can do within pair analysis).
Sometimes, dimension reduction via the propensity score may be too drastic, recent methods advocate matching on the multivariate covariates directly.
In propensity score matching, potential matches are compared using (estimated) propensity score.
1-to-n closest neighbor matching is common when the control group is large compared to treatment group.
In most software packages, the default is actually 1-to-1 closest neighbor matching.
Pros: robust, matched pairs (so you can do within pair analysis).
Sometimes, dimension reduction via the propensity score may be too drastic, recent methods advocate matching on the multivariate covariates directly.
Nonetheless, this is what we will focus on for our minimum wage data.
{Y_i(0), Y_i(1)} \perp W_i | X_i \ \ \Rightarrow \ \ {Y_i(0), Y_i(1)} \perp W_i | e(X_i)
Remember the key propensity score property:
{Y_i(0), Y_i(1)} \perp W_i | X_i \ \ \Rightarrow \ \ {Y_i(0), Y_i(1)} \perp W_i | e(X_i)
Idea: in a regression estimator, adjusting for e(X) instead of the whole X; thus in regression models of Y(w) use e(X) as the single predictor.
Remember the key propensity score property:
{Y_i(0), Y_i(1)} \perp W_i | X_i \ \ \Rightarrow \ \ {Y_i(0), Y_i(1)} \perp W_i | e(X_i)
Idea: in a regression estimator, adjusting for e(X) instead of the whole X; thus in regression models of Y(w) use e(X) as the single predictor.
Clearly, modeling \mathbb{Pr}(Y(w)|\hat{e}(X)) is simpler than modeling \mathbb{Pr}(Y(w)|X); effectively more data to estimate essential parameters due to the dimension reduction.
Remember the key propensity score property:
{Y_i(0), Y_i(1)} \perp W_i | X_i \ \ \Rightarrow \ \ {Y_i(0), Y_i(1)} \perp W_i | e(X_i)
Idea: in a regression estimator, adjusting for e(X) instead of the whole X; thus in regression models of Y(w) use e(X) as the single predictor.
Clearly, modeling \mathbb{Pr}(Y(w)|\hat{e}(X)) is simpler than modeling \mathbb{Pr}(Y(w)|X); effectively more data to estimate essential parameters due to the dimension reduction.
However,
Remember the key propensity score property:
{Y_i(0), Y_i(1)} \perp W_i | X_i \ \ \Rightarrow \ \ {Y_i(0), Y_i(1)} \perp W_i | e(X_i)
Idea: in a regression estimator, adjusting for e(X) instead of the whole X; thus in regression models of Y(w) use e(X) as the single predictor.
Clearly, modeling \mathbb{Pr}(Y(w)|\hat{e}(X)) is simpler than modeling \mathbb{Pr}(Y(w)|X); effectively more data to estimate essential parameters due to the dimension reduction.
However,
we lose interpretation of the effects of individual covariates, e.g. age, sex; and
reduction to the one-dimensional propensity score may be too drastic.
Idea: instead of using the estimated \hat{e}(X) as the single predictor, use it as an additional predictor in the model. That is, \mathbb{Pr}(Y(w)|X,\hat{e}(X)).
Turns out that \mathbb{Pr}(Y(w)|X,\hat{e}(X)) gives both efficiency and robustness.
Idea: instead of using the estimated \hat{e}(X) as the single predictor, use it as an additional predictor in the model. That is, \mathbb{Pr}(Y(w)|X,\hat{e}(X)).
Turns out that \mathbb{Pr}(Y(w)|X,\hat{e}(X)) gives both efficiency and robustness.
Also, if we are unable to achieve full balance on some of the predictors, using \mathbb{Pr}(Y(w)|X,\hat{e}(X)) will help further control for those unbalance predictors.
Idea: instead of using the estimated \hat{e}(X) as the single predictor, use it as an additional predictor in the model. That is, \mathbb{Pr}(Y(w)|X,\hat{e}(X)).
Turns out that \mathbb{Pr}(Y(w)|X,\hat{e}(X)) gives both efficiency and robustness.
Also, if we are unable to achieve full balance on some of the predictors, using \mathbb{Pr}(Y(w)|X,\hat{e}(X)) will help further control for those unbalance predictors.
Empirical evidences (e.g. simulations) support.
Idea: instead of using the estimated \hat{e}(X) as the single predictor, use it as an additional predictor in the model. That is, \mathbb{Pr}(Y(w)|X,\hat{e}(X)).
Turns out that \mathbb{Pr}(Y(w)|X,\hat{e}(X)) gives both efficiency and robustness.
Also, if we are unable to achieve full balance on some of the predictors, using \mathbb{Pr}(Y(w)|X,\hat{e}(X)) will help further control for those unbalance predictors.
Empirical evidences (e.g. simulations) support.
Why it works? Continuous version of regression after stratification
Now let's actually see how this works with the minimum wage example from last class.
Variables | Description |
---|---|
NJ.PA | indicator for which state the restaurant is in (1 if NJ, 0 if PA) |
EmploymentPre | measures employment for each restaurant before the minimum wage raise in NJ |
EmploymentPost | measures employment for each restaurant after the minimum wage raise in NJ |
WagePre | measures the hourly wage for each restaurant before the minimum wage raise |
BurgerKing | indicator for Burger King |
KFC | indicator for KFC |
Roys | indicator for Roys |
Wendys | indicator for Wendys |
In-class analysis: move to the R script here.
These slides contain materials adapted from courses taught by Dr. Fan Li.
Reminder: please remember to respond to survey II.
No more HW5; more time to work on individual projects.
Final lab next Friday; will cover causal inference.
Propensity scores: definition and properties
Estimation
PS stratification
PS matching
PS regression
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |