Due: 11:59pm, Nov 18, 2019
The purpose of this lab is to help you gain some experience doing causal inference.
Right Heart Catheterization (RHC) is a procedure for directly measuring how well the heart is pumping blood to the lungs. The procedure involves an insertion of a catheter (hollow tube) into right side of heart and through to pulmonary artery to monitor heart function and blood flow from the heart to lungs. RHC is often applied to critically ill patients for directing immediate and subsequent treatment. However, administering RHC may cause serious complications, though the risks are usually small. There is some debate whether the use of RHC actually leads to improved treatment.
In this lab you will work within sub-groups of your MIDS-assigned teams. If your MIDS-assigned team has four members, split the team into two groups of two, so that you get to work in pairs. If your MIDS-assigned team has five members, also split the team into two groups, but for your team, one group will have three members and the other group will have two members. You can split your MIDS teams and select the groups as you like, however, NO STUDENT SHOULD WORK ALONE! If you cannot agree on the splits, reach out to me as soon as possible. The names of the two/three students in each group MUST be on the group’s lab report. Each group MUST submit ONLY ONE report for this lab. Gradescope will let you select your group members when submitting, so MAKE SURE TO DO SO. Only one person needs to submit the report on behalf of the group.
You all should have R and RStudio installed on your computers by now. If you do not, first install the latest version of R here: https://cran.rstudio.com (remember to select the right installer for your operating system). Next, install the latest version of RStudio here: https://www.rstudio.com/products/rstudio/download/. Scroll down to the “Installers for Supported Platforms” section and find the right installer for your operating system.
You are required to use R Markdown to type up this lab report. If you do not already know how to use R markdown, here is a very (very!) basic R Markdown template: https://akandelanre.github.io/IDS702_F19/labs/resources/LabReport.Rmd. Refer to the resources tab of the course website (here: https://akandelanre.github.io/IDS702_F19/resources/) for links to help you learn how to use R markdown.
You MUST submit both your .Rmd and .pdf files (again, just one copy to be submitted by only one of you) to the course site on Gradescope here: https://www.gradescope.com/courses/57701/assignments. Make sure to knit to pdf and not html; ask the TA about knitting to pdf if you cannot figure it out. Be sure to submit under the right assignment entry.
Download the data file (named rhc.txt
) from Sakai and save it locally to the same directory as your R markdown file. To find the data file on Sakai, go to Resources \(\rightarrow\) Datasets \(\rightarrow\) Lab Datasets \(\rightarrow\) Lab 6. Once you have downloaded the data file into the SAME folder as your R markdown file, load the data and packages needed by using the following R code.
Remember to double-check the dimensions and first few rows of the data to confirm you have the right file. Also, convert the two variables treatment
and dth30
to binary variables. You MUST exclude the variable surv2md1
from the analysis, as shown in the code, so that you are left with 51 covariates.
library(ggplot2)
library(cobalt)
library(MatchIt)
library(randomForest)
# Read in the data
RHC <- read.table("rhc.txt",head=T)
RHC <- RHC[,-which(names(RHC)=="surv2md1")]
The dataset contains data on 5735 hospitalized adult patients at five medical centers in the U.S. The variable rhc
indicates whether RHC was applied within 24 hours of admission (TRUE/FALSE). Each patient was followed up with some treatment procedures that may have been influenced by the RHC result if it was performed on the patient. The outcome variable is dth30
– death at 30 days (TRUE: dead; FALSE: alive). Based on information from a panel of experts, a set of 52 variables were identified that are potentially related to both the decision to use RHC and the outcome. The goal is to estimate the treatment effect of RHC on death at 30 days. Let pr(Y(w) = 1) = pw be the fraction of patients with dth30=1
for treatment group w (w=0 or 1, that is rhc=FALSE
or rhc=TRUE
respectively). The target estimand is the average causal effect Q = p1 - p0 for patients with rhc=TRUE
. That is, we will focus on estimating ATT and not ATE, that is, treatment effect for those who actually received RHC. Clearly, this analysis models death, which means high probabilities are BAD. If RHC works, ATT should be negative.
This data is based on the data analysis part of: https://stat.duke.edu/sites/stat.duke.edu/files/site-images/FYE13.pdf (the problem and data are actually slightly different).
Download the code book for the data (named rhc_codebook.pdf
) from Sakai; go to the same folder you downloaded the data file from.
Are the covariates in this data balanced between the two groups? If no, which covariates are not? How did you assess balance?
dth30
using the main effects of all covariates and rhc
. Using the fitted model, compute the ATT for the patients with rhc=TRUE
as follows:
rhc=TRUE
using the fitted model. This is an estimate of p1 for those patients.R
, you can use predict(model_object,type="response")
.rhc=FALSE
and their values for the other covariates unchanged, and generate new predicted probabilities for them. This is an estimate of p0 for those patients.rhc
equals TRUE
). The average is the estimated ATT.You really should be answering this question using a confidence interval, so as to quantify uncertainty. This can be done relatively easily via bootstrap
. You should try it on your own. Based on your estimated effect from question 2, are treated patients better or worse off with RHC? Given your answer to question 1, do you trust your conclusions here? Why or why not?
Estimate the propensity score e using a logistic regression with all covariates entering in the model as main effects. Assess overlap: that is, are there any observations with an estimated propensity score e that is out of the range of e in the other group? If there are only a few such outliers (less than 5), keep them; If many, discard them and report the number of the discarded observations.
Using one-to-one, nearest neighbor matching on the estimated propensity scores, check balance again. Are the covariates balanced now? If no, which ones are not?
Use a normal approximation to construct the 95% confidence interval, where the standard error should be estimated using the formula for computing standard error for difference in proportions (if you are not familiar with this, check page 280 of the OIS book we used for the online summer review). Estimate the average causal effect \(Q\) using the matched sample obtained above. Construct a 95% confidence interval and interpret your findings. Are treated patients better or worse off with RHC?
Note that the estimated average causal effect computed on the matched sample is the ATT we want (where the treated group is rhc = 1
) and not ATE, because we will discard observations in the control group for which we cannot find matches for. This is something you should keep in mind should you want to estimate ATE on your own projects and applications in the future.
Now estimate the propensity score e using Random forests with all the covariates. Assess overlap: that is, are there any observations with an estimated propensity score e that is out of the range of e in the other group? If there are only a few such outliers (less than 5), keep them; If many, discard them and report the number of the discarded observations.
Using one-to-one, nearest neighbor matching on the new estimated propensity scores from question 7, check balance again. Are the covariates balanced now? If no, which ones are not?
Estimate the average causal effect \(Q\) using the new matched sample obtained in question 8. Construct a 95% confidence interval and interpret your findings (use your code from question 6).
Note that this estimated effect is no longer an estimate of Q = p1 - p0 but intuitively, we can still draw “roughly” the same conclusions from it. Using the new matched sample obtained in question 8, fit a logistic regression to the response variable using the main effects of all covariates. Also include the new propensity scores e from question 7 as a predictor. Report the estimated causal odds ratio and its corresponding confidence interval. Interpret your findings. Are your overall findings in line with your findings from question 9?
10 points: 1 point for each question