class: center, middle, inverse, title-slide # Hierarchical models III ### Dr. Olanrewaju Michael Akande ### Oct 22, 2019 --- ## Announcements - Reminder: submissions for final project proposals will not be graded. - Full instructions for the proposals will be posted by the end of the week. - Updated dates for final project presentations ## Outline - Questions from last lecture - Bernoulli versus binomial outcomes - Multilevel logistic models - Bayesian approach to multilevel logistic models using the `brms` package --- ## Bernoulli versus binomial outcomes - Before we get into multilevel logistic models, let's review logistic regression models for Bernoulli versus binomial outcomes. -- - Recall from the summer online review materials that if `\(Y \sim \textrm{Bin}(n,p)\)` (that is, `\(Y\)` is a random variable that follows a binomial distribution with parameters `\(n\)` and `\(p\)`), then `\(Y\)` follows a `\(\textrm{Bernoulli}(p)\)` distribution when `\(n = 1\)`. -- - This also implies that if `\(Z_1, \ldots, Z_n \sim \textrm{Bernoulli}(p)\)`, then `\(Y = \sum_i^n Z_i \sim \textrm{Bin}(n.p)\)`. -- - That is, the sum of `\(n\)` iid `\(\textrm{Bernoulli}(p)\)` random variables gives a random variable with the `\(\textrm{Bin}(n,p)\)` distribution. -- - The logistic regression model can be used either for Bernoulli data (as we have done so far) or for data summarized as binomial counts (that is, aggregated counts). -- - This is what you have for one part of team project 2. --- ## Bernoulli versus binomial outcomes Normally, for individual-level data, we would have .small[ ``` ## y x ## 1 0 3 ## 2 0 3 ## 3 1 2 ## 4 1 1 ## 5 1 2 ## 6 0 3 ``` ] .small[ ```r M1 <- glm(y~x,data=Data,family=binomial); summary(M1) ``` ``` ## ## Call: ## glm(formula = y ~ x, family = binomial, data = Data) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.261 -1.105 -1.030 1.251 1.332 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.1942 0.3609 0.538 0.591 ## x2 -0.3660 0.4954 -0.739 0.460 ## x3 -0.5508 0.5017 -1.098 0.272 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 138.27 on 99 degrees of freedom ## Residual deviance: 137.02 on 97 degrees of freedom ## AIC: 143.02 ## ## Number of Fisher Scoring iterations: 4 ``` ] --- ## Bernoulli versus binomial outcomes But we could also do the following with the aggregate level data instead. .small[ ``` ## x success n ## 1 1 17 31 ## 2 2 16 35 ## 3 3 14 34 ``` ] .small[ ```r M2 <- glm(cbind(success,n-success)~x,data=Data_agg,family=binomial); summary(M2) ``` ``` ## ## Call: ## glm(formula = cbind(success, n - success) ~ x, family = binomial, ## data = Data_agg) ## ## Deviance Residuals: ## [1] 0 0 0 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.1942 0.3609 0.538 0.591 ## x2 -0.3660 0.4954 -0.739 0.460 ## x3 -0.5508 0.5017 -1.098 0.272 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1.2524e+00 on 2 degrees of freedom ## Residual deviance: 1.3323e-14 on 0 degrees of freedom ## AIC: 17.868 ## ## Number of Fisher Scoring iterations: 2 ``` ] -- Same results overall! Deviance and AIC are different because of the different likelihood functions. Note that some glm functions use `n` instead of `n-success`. --- class: center, middle # Multilevel logistic models --- ## Multilevel linear models - The same idea and approach used to build multilevel models for normal data can be used to build multilevel logistic (and probit) models for binary outcomes. -- - Recall that for a .hlight[varying-intercepts linear model] with one individual-level predictor, we have .block[ .small[ $$ `\begin{split} y_{ij} & = \beta_{0} + \gamma_{0j} + \beta_1 x_{1ij} + \epsilon_{ij}; \ \ \ i = 1, \ldots, n_j; \ \ \ j = 1, \ldots, J; \\ \epsilon_{ij} & \sim N(0, \sigma^2); \\ \gamma_{0j} & \sim N(0, \tau_0^2) \end{split}` $$ ] ] where `\(x_{1ij}\)` can be replaced with `\(x_{1j}\)` for a group-level predictor. -- - This model is a compromise between complete pooling across groups of a grouping variable, such as counties in the radon example for last class (that is, same intercept for each county), and no pooling (estimating a separate intercept for each county without borrowing information). -- - The degree of pooling is determined by the amount of information within and between groups. --- ## Multilevel logistic models - We can use the same idea to build a .hlight[varying-intercepts logistic model]. - That is, .block[ .small[ $$ `\begin{split} y_{ij} | x_{ij} & \sim \textrm{Bernoulli}(\pi_{ij}); \ \ \ i = 1, \ldots, n_j; \ \ \ j = 1, \ldots, J; \\ \textrm{log}\left(\dfrac{\pi_{ij}}{1-\pi_{ij}}\right) & = \beta_{0} + \gamma_{0j} + \beta_1 x_{1ij}; \\ \gamma_{0j} & \sim N(0, \sigma_0^2) \end{split}` $$ ] ] where once again, `\(x_{1ij}\)` is an individual-level predictor which can be replaced with `\(x_{1j}\)` for a group-level predictor. -- - The Gelman and Hill book uses the following notation instead .block[ .small[ $$ `\begin{split} y_i | x_i & \sim \textrm{Bernoulli}(\pi_i); \ \ \ i = 1, \ldots, n; \ \ \ j = 1, \ldots, J; \\ \textrm{log}\left(\dfrac{\pi_i}{1-\pi_i}\right) & = \beta_{0} + \gamma_{0j[i]} + \beta_1 x_{i1}; \\ \gamma_{0j} & \sim N(0, \sigma_0^2). \end{split}` $$ ] ] -- - I will use this notation for the rest of today's class. --- ## Varying-intercepts logistic model Inverse logit functions for varying-intercepts logistic models. <img src="15-hierarchical-models-III_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- ## Multilevel logistic models - It is easy to extend this model to allow for .hlight[varying-slopes] instead or both .hlight[varying-intercepts and varying-slopes] just like we had for multilevel linear models. -- - The interpretations of the fixed effect(s) in multilevel logistic models follow directly from what we had for the standard logistic models, that is, log-odds, odds and odds ratios. -- - The only difference now is the hierarchy in our data which allows us to borrow information across groups. -- - One way to think about this is that we expect odds and odd-ratios to be more similar for observations within the same group, but we allow for some similarity across groups via partial pooling. --- ## Varying-slopes logistic model Inverse logit functions for varying-slopes logistic models. <img src="15-hierarchical-models-III_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## Varying-intercepts, varying-slopes logistic model Inverse logit functions for varying-intercepts, varying-slopes logistic models. <img src="15-hierarchical-models-III_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ## 1988 elections analysis - To illustrate how to fit and interpret the results of multilevel logistic models, we will use the election polls example from the Gelman and Hill textbook. -- - National opinion polls are conducted by a variety of organizations (e.g., media, polling organizations, campaigns) leading up to elections. -- - While many of the best opinion polls are conducted at a national level, it can also be often interesting to estimate voting opinions and preferences at the state or even local level. -- - Well-designed polls are generally based on national random samples with corrections for nonresponse based on a variety of demographic factors (e.g., sex, ethnicity, race, age, education). --- ## 1988 elections analysis - The data is from CBS News surveys conducted during the week before the 1988 election. -- - Respondents were asked about their preferences for either the Republican candidate (Bush Sr.) or the Democratic candidate (Dukakis). -- - Both voting turnout and preferences often depend on a complex combination of demographic factors. -- - In our example dataset, we have demographic factors such as biological sex, race, age, education, which we may all want to look at by state, resulting in `\(2 \times 2 \times 4 \times 4 \times 51 = 3264\)` potential categories of respondents. -- - Clearly, without a very large survey (most political survey poll around 1000 people), we will need to make assumptions in order to even obtain estimates in each category. -- - We usually cannot include all interactions; we should therefore select those to explore (through EDA and background knowledge). --- ## 1988 elections analysis The dataset includes 2193 observations from one of eight surveys (the most recent CBS News survey right before the election) in the original full data. .small[ Variable | Description :------------- | :------------ org | cbsnyt = CBS/NYT bush | 1 = preference for Bush Sr., 0 = otherwise state | 1-51: 50 states including DC (number 9) edu | education: 1=No HS, 2=HS, 3=Some College, 4=College Grad age | 1=18-29, 2=30-44, 3=45-64, 4=65+ female | 1=female, 0=male black | 1=black, 0=otherwise region | 1=NE, 2=S, 3=N, 4=W, 5=DC v_prev | average Republican vote share in the three previous elections (adjusted for home-state and home-region effects in the previous elections) ] The data is in the file `polls_subset.txt` on Sakai. --- ## 1988 elections analysis ```r ###### Load the data polls_subset <- read.table("data/polls_subset.txt",header=TRUE) polls_subset$v_prev <- polls_subset$v_prev*100 #rescale polls_subset$region_label <- factor(polls_subset$region,levels=1:5, labels=c("NE","S","N","W","DC")) #we consider DC as a separate region due to its distinctive voting patterns polls_subset$edu_label <- factor(polls_subset$edu,levels=1:4, labels=c("No HS","HS","Some College","College Grad")) polls_subset$age_label <- factor(polls_subset$age,levels=1:4, labels=c("18-29","30-44","45-64","65+")) #the data includes states but without the names, which we will need, #so let's grab that from R datasets data(state) #"state" is an R data file (type ?state from the R command window for info) state.abb #does not include DC, so we will create ours ``` ``` ## [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" ## [15] "IA" "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" ## [29] "NH" "NJ" "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" ## [43] "TX" "UT" "VT" "VA" "WA" "WV" "WI" "WY" ``` ```r #In the polls data, DC is the 9th "state" in alphabetical order state_abbr <- c (state.abb[1:8], "DC", state.abb[9:50]) polls_subset$state_label <- factor(polls_subset$state,levels=1:51,labels=state_abbr) rm(list = ls(pattern = "state")) #remove unnecessary values in the environment ``` --- ## 1988 elections analysis ```r ###### View properties of the data head(polls_subset) ``` ``` ## org survey bush state edu age female black region v_prev ## 1 cbsnyt 9158 NA 7 3 1 1 0 1 56.66333 ## 2 cbsnyt 9158 1 39 4 2 1 0 1 52.65667 ## 3 cbsnyt 9158 0 31 2 4 1 0 1 56.41667 ## 4 cbsnyt 9158 0 7 3 1 1 0 1 56.66333 ## 5 cbsnyt 9158 1 33 2 2 1 0 1 52.43666 ## 6 cbsnyt 9158 1 33 4 4 1 0 1 52.43666 ## region_label edu_label age_label state_label ## 1 NE Some College 18-29 CT ## 2 NE College Grad 30-44 PA ## 3 NE HS 65+ NJ ## 4 NE Some College 18-29 CT ## 5 NE HS 30-44 NY ## 6 NE College Grad 65+ NY ``` ```r dim(polls_subset) ``` ``` ## [1] 2193 14 ``` --- ## 1988 elections analysis ```r ###### View properties of the data str(polls_subset) ``` ``` ## 'data.frame': 2193 obs. of 14 variables: ## $ org : Factor w/ 1 level "cbsnyt": 1 1 1 1 1 1 1 1 1 1 ... ## $ survey : int 9158 9158 9158 9158 9158 9158 9158 9158 9158 9158 ... ## $ bush : int NA 1 0 0 1 1 1 1 0 0 ... ## $ state : int 7 39 31 7 33 33 39 20 33 40 ... ## $ edu : int 3 4 2 3 2 4 2 2 4 1 ... ## $ age : int 1 2 4 1 2 4 2 4 3 3 ... ## $ female : int 1 1 1 1 1 1 0 1 0 0 ... ## $ black : int 0 0 0 0 0 0 0 0 0 0 ... ## $ region : int 1 1 1 1 1 1 1 1 1 1 ... ## $ v_prev : num 56.7 52.7 56.4 56.7 52.4 ... ## $ region_label: Factor w/ 5 levels "NE","S","N","W",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ edu_label : Factor w/ 4 levels "No HS","HS","Some College",..: 3 4 2 3 2 4 2 2 4 1 ... ## $ age_label : Factor w/ 4 levels "18-29","30-44",..: 1 2 4 1 2 4 2 4 3 3 ... ## $ state_label : Factor w/ 51 levels "AL","AK","AZ",..: 7 39 31 7 33 33 39 20 33 40 ... ``` --- ## 1988 elections analysis - I will not do any substantial EDA here. -- - I expect you to be able to do this yourself! -- - Let's just take a look at the amount of data we have for "bush" and the age:edu interaction ```r ###### Exploratory data analysis table(polls_subset$bush) #well split by the two values ``` ``` ## ## 0 1 ## 891 1124 ``` ```r table(polls_subset$edu,polls_subset$age) ``` ``` ## ## 1 2 3 4 ## 1 44 42 67 96 ## 2 232 283 223 116 ## 3 141 205 99 54 ## 4 119 285 125 62 ``` --- ## 1988 elections analysis - As a start, we will consider a simple model with fixed effects of race and sex and a random effect for state (50 states + the District of Columbia). .block[ .small[ $$ `\begin{split} \textrm{bush}_i | \boldsymbol{x}_i & \sim \textrm{Bernoulli}(\pi_i); \ \ \ i = 1, \ldots, n; \ \ \ j = 1, \ldots, J=51; \\ \textrm{log}\left(\dfrac{\pi_i}{1-\pi_i}\right) & = \beta_{0} + \gamma_{0j[i]} + \beta_1 \textrm{female}_{i} + \beta_2 \textrm{black}_{i}; \\ \gamma_{0j} & \sim N(0, \sigma_{\textrm{state}}^2). \end{split}` $$ ] ] -- - We can also write .block[ .small[ $$ `\begin{split} \textrm{bush}_i | \boldsymbol{x}_i & \sim \textrm{Bernoulli}(\pi_i); \ \ \ i = 1, \ldots, n; \ \ \ j = 1, \ldots, J=51; \\ \textrm{log}\left(\dfrac{\pi_i}{1-\pi_i}\right) & = \beta_{0} + \gamma_{0j[i]}^{\textrm{state}} + \beta_{\textrm{female}} \textrm{female}_{i} + \beta_{\textrm{black}} \textrm{black}_{i}; \\ \gamma_{0j} & \sim N(0, \sigma_{\textrm{state}}^2). \end{split}` $$ ] ] -- - In `R`, we have ```r library(lme4) model1 <- glmer(bush ~ black+female+(1|state_label),family=binomial(link="logit"), data=polls_subset) summary(model1) ``` --- ## 1988 elections analysis ``` ## Generalized linear mixed model fit by maximum likelihood (Laplace ## Approximation) [glmerMod] ## Family: binomial ( logit ) ## Formula: bush ~ black + female + (1 | state_label) ## Data: polls_subset ## ## AIC BIC logLik deviance df.resid ## 2666.7 2689.1 -1329.3 2658.7 2011 ## ## Scaled residuals: ## Min 1Q Median 3Q Max ## -1.7276 -1.0871 0.6673 0.8422 2.5271 ## ## Random effects: ## Groups Name Variance Std.Dev. ## state_label (Intercept) 0.1692 0.4113 ## Number of obs: 2015, groups: state_label, 49 ## ## Fixed effects: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.44523 0.10139 4.391 1.13e-05 *** ## black -1.74161 0.20954 -8.312 < 2e-16 *** ## female -0.09705 0.09511 -1.020 0.308 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Correlation of Fixed Effects: ## (Intr) black ## black -0.119 ## female -0.551 -0.005 ``` --- ## 1988 elections analysis - Looks like we dropped some NAs. ```r c(sum(complete.cases(polls_subset)),sum(!complete.cases(polls_subset))) ``` ``` ## [1] 2015 178 ``` -- - Not ideal; we'll learn about methods for dealing with missing data soon. -- - Interpretation of results: + For a fixed state, a non-black male respondent has odds of `\(e^{0.45}=1.57\)` of supporting Bush. + For a fixed state and sex, a black respondent as `\(e^{-1.74}=0.18\)` times (an 82% decrease) the odds of supporting Bush as a non-black respondent; you are much less likely to support Bush if your race is black compared to being non-black. + For a given state and race, a female respondent has `\(e^{-0.10}=0.91\)` (a 9% decrease) times the odds of supporting Bush as a male respondent. However, this effect is not actually statistically significant! --- ## 1988 elections analysis - The state-level standard deviation is estimated at 0.41, so that the states do vary some, but not so much. -- - <div class="question"> We no longer have a term for residual standard deviation (residual standard error). Why is that? </div> -- - I expect that you will be able to interpret the corresponding confidence intervals. ``` ## Computing profile confidence intervals ... ``` ``` ## 2.5 % 97.5 % ## .sig01 0.2608567 0.60403426 ## (Intercept) 0.2452466 0.64871233 ## black -2.1666001 -1.34322366 ## female -0.2837099 0.08919986 ``` --- ## 1988 elections analysis - Let’s fit a more sophisticated model that includes other relevant survey factors, such as + region + prior vote history (note that this is a state-level predictor), + age, education, and the interaction between them. -- - In `R`, we have ```r model2 <- glmer(bush ~ black + female + v_prev + edu_label:age_label + (1|state_label) + (1|region_label), family=binomial(link="logit"),data=polls_subset) ``` ``` ## fixed-effect model matrix is rank deficient so dropping 1 column / coefficient ``` ``` ## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = ## control$checkConv, : Model failed to converge with max|grad| = 0.0122335 ## (tol = 0.001, component 1) ``` -- - <div class="question"> Why do we have a rank deficient model? </div> --- ## 1988 elections analysis - Also, it looks like we have a convergence issue. This can happen when dealing with multilevel models. We have so many parameters to estimate from the interaction terms `edu_label:age_label` (16 actually), and it looks like that's causing a problem. -- - Could be that we have too many `\(\textrm{bush}_i = 1\)` or `\(0\)` values for certain combinations. You should check! -- - Let's treat those as random effects instead. That is, .block[ .small[ $$ `\begin{split} \textrm{logit}\left(\Pr[\textrm{bush}_i = 1] \right) & = \beta_{0} + \gamma_{0m[i]}^{\textrm{region}} + \gamma_{0j[i]}^{\textrm{state}} + \gamma_{0k[i],l[i]}^{\textrm{age.edu}} \\ & + \beta_{\textrm{f}} \textrm{female}_{i} + \beta_{\textrm{b}} \textrm{black}_{i} + \beta_{\textrm{v_prev}} \textrm{v_prev}_{j[i]}; \\ \gamma_{0m} & \sim N(0, \sigma_{\textrm{region}}^2), \ \ \ \gamma_{0j} \sim N(0, \sigma_{\textrm{state}}^2), \ \ \ \gamma_{0k,l} \sim N(0, \sigma_{\textrm{age.edu}}^2). \end{split}` $$ ] ] - In `R`, we have ```r model3 <- glmer(bush ~ black + female + v_prev + (1|state_label) + (1|region_label) + (1|edu_label:age_label), family=binomial(link="logit"),data=polls_subset) ``` -- - This seems to run fine; we are able to borrow information which will help. --- ## 1988 elections analysis .small[ ``` ## Generalized linear mixed model fit by maximum likelihood (Laplace ## Approximation) [glmerMod] ## Family: binomial ( logit ) ## Formula: ## bush ~ black + female + v_prev + (1 | state_label) + (1 | region_label) + ## (1 | edu_label:age_label) ## Data: polls_subset ## ## AIC BIC logLik deviance df.resid ## 2644.0 2683.3 -1315.0 2630.0 2008 ## ## Scaled residuals: ## Min 1Q Median 3Q Max ## -1.8404 -1.0430 0.6478 0.8405 2.7528 ## ## Random effects: ## Groups Name Variance Std.Dev. ## state_label (Intercept) 0.03768 0.1941 ## edu_label:age_label (Intercept) 0.02993 0.1730 ## region_label (Intercept) 0.02792 0.1671 ## Number of obs: 2015, groups: ## state_label, 49; edu_label:age_label, 16; region_label, 5 ## ## Fixed effects: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.50658 1.03365 -3.392 0.000693 *** ## black -1.74530 0.21090 -8.275 < 2e-16 *** ## female -0.09956 0.09558 -1.042 0.297575 ## v_prev 0.07076 0.01853 3.820 0.000134 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Correlation of Fixed Effects: ## (Intr) black female ## black -0.036 ## female -0.049 -0.004 ## v_prev -0.992 0.027 -0.006 ``` ] --- ## 1988 elections analysis - Remember that in the first model, the state-level standard deviation was estimated as 0.41. Looks like we are now able to separate that (for the most part) into state and region effects. -- - Interpretation of results: + For a fixed state, education and age bracket, a non-black male respondent with zero prior average Republican vote share, has odds of `\(e^{-3.51}=0.03\)` of supporting Bush (no one really has 0 value for `v_prev`). + For a fixed state, sex, education level, age bracket and zero prior average Republican vote share, a black respondent has `\(e^{-1.75}=0.17\)` times (an 83% decrease) the odds of supporting Bush as a non-black respondent, which is about the same as before. + For each percentage point increase in prior average Republican vote share, residents of a given state, race, sex, education level age bracket have `\(e^{0.07}=1.07\)` times the odds of supporting Bush. --- ## 1988 elections analysis: Bayesian approach - Due to the number of categories, the inference in the frequentist model is not entirely reliable as + it does not fully account for uncertainty in the estimated variance parameters, and + it uses an approximation for inference. -- - We can fit the model under the Bayesian paradigm in the `brms` package, using mildly informative priors and quantify uncertainty based on the posterior samples. -- - In-class analysis: move to the R script [here](https://ids-702-f19.github.io/Course-Website/slides/lec-slides/Elections88.R) -- - Windows users: install Rtools for windows, then the `rstan` package in R. Mac users: install Xcode, Try `xcode-select --install` from the Terminal app or the Terminal tab of the RStudio console, then install the `rstan` package in R.