THIS IS AN OPTIONAL HOMEWORK. IF YOU CHOOSE TO SUBMIT IT, THE SCORE WILL REPLACE THE LOWEST SCORE OUT OF THE FOUR PREVIOUS METHODS AND DATA ANALYSIS ASSIGNMENTS, AS LONG AS IT EXCEEDS AT LEAST ONE OF THE SCORES FROM THOSE PREVIOUS ASSIGNMENTS.

Instructions

This assignment involves causal inference. Note that this is an individual assignment, so you must work alone. You can discuss basic details with classmates but your final work must be yours alone! Please type your solutions using R Markdown, LaTeX or any other word processor but be sure to knit or convert the final output file to “.pdf”. Submissions should be made on gradescope: go to Assignments \(\rightarrow\) Methods and Data Analysis 5.

Questions

  1. ASTHMA PATIENTS IN CALIFORNIA.
    The data for this question can be found in the file “Asthma.txt” on Sakai.
    The data set is from a study to compare the quality of services provided by two physician groups for asthma patients in California. Specifically, for patient i, let Yi(w) be the quality of service as judged by the patient (1=satisfactory, 0=not satisfactory), if the patient is served by physician group \(w\), for \(w = 1,2\). The patients who visit the two groups can differ, and so a set of covariates are measured. The variables in the data are:

    Variable Description
    pg (treatment assignment) physician group; values = 1 and 2
    i_age age (continuous)
    i_sex sex (binary)
    i_race race (categorical)
    i_educ education (categorical)
    i_insu insurance status (categorical)
    i_drug drug coverage status (categorical)
    i_seve severity (categorical)
    com_t total number of comorbidity (numeric)
    pcs_sd standard physical comorbidity scale (continuous)
    mcs_sd standard mental comorbidity scale (continuous)
    i_aqoc (outcome) satisfaction status of patient (binary)
    Let pr(Y(w) = 1) = pw be the fraction of patients who would be satisfied with the service provided if all patients were to be served by physician group w (again, w=1 or 2). The target estimand is the average causal effect Q = p2 - p1.
    • Are the covariates in this data balanced between the two groups? If no, which covariates are not? How did you assess balance?
    • Estimate the propensity score e using a logistic regression with all pre-treatment variables entering in the model as main effects.
      • (a) Are there any observations with an estimated propensity score e that is out of the range of e in the other group? If there are only a few such outliers (less than 5), keep them; If many, discard them and report the number of the discarded observations. This is to ensure overlap!
      • (b) Using one-to-one, nearest neighbor matching on the estimated propensity scores, check balance again. Are the covariates balanced now? If no, which ones are not?
      • (c) Estimate the average causal effect \(Q\) using the matched sample obtained above. Also, report a standard error for your estimate (use the formula for computing standard error for difference in proportions; if you are not familiar with this, check page 280 of the OIS book we used for the online summer review). Construct a 95% confidence interval and interpret your findings.
      • (d) Fit a logistic regression to the response variable using the main effects of all pre-treatment variables. Also include the propensity score e as a predictor. Report the estimated causal odds ratio. If it is significant, interpret the effect in context of the problem. Note that this estimated effect is not an estimate of Q = p2 - p1 but intuitively, it still makes sense to look at it.
      • (e) Repeat parts (b) to (d) using one-to-many (five) nearest neighbor matching with replacement, instead of one-to-one nearest neighbor matching. How do your results compare to what you had before?
    • Estimate the propensity score e using Random forests with all pre-treatment variables entering the model. Repeat (a) to (e) above using the new propensity scores. Compare all your previous results to the new ones.
    • Which of the methods do you consider most reliable (or feel most comfortable with) for estimating the causal effect? Why?

    • Notes:
      • You must answer each question directly.
      • You should consider converting the treatment variable pg to a binary variable with values 0 and 1.
      • Center the three variables com_t, pcs_sd, and mcs_sd and use the centered versions for all analyses.
      • The data dictionary does not contain enough details about the predictors (for example, there are no names for the levels of the categorical variables in the data). That is fine here! Do not focus on trying to interpret those predictors, just controlling for them. The two most important variables are pg and i.aqoc, and the meaning of both are very clear here.
      • Relevel i_sex, i_educ and i_seve by doing Data$i_sex <- relevel(factor(Data$i_sex), ref = 1), Data$i_educ <- relevel(factor(Data$i_educ), ref = 5), and Data$i_seve <- relevel(factor(Data$i_seve), ref = 3).
      • Note that the estimated average causal effect computed on the matched sample is actually ATT (where the treated group is pg = 1) and not ATE, because we will discard observations in the control group for which we cannot find matches for.

Grading

20 points.