class: center, middle, inverse, title-slide # Hierarchical models II ### Dr. Olanrewaju Michael Akande ### Oct 17, 2019 --- ## Announcements - Piazza forum online ## Outline - Questions from last lecture - The radon analysis - Incorporating group-level predictors - Non-nested models --- ## The radon analysis - To illustrate how to incorporate group-level predictors, we will use the radon example from the Gelman and Hill textbook. -- - The data focuses on the goal of estimating the distribution of radon levels of houses within each of 85 counties in Minnesota. -- - The full data actually includes data for more states but let's focus on just Minnesota like the textbook. -- - The U.S. Environmental Protection Agency and the Surgeon General’s Office have estimated that as many as 20,000 lung cancer deaths are caused each year by exposure to radon (reference [here](https://www.radon.com/radon_facts/)). -- - Radon is a cancer-causing radioactive gas and is the second leading cause of lung cancer. Unfortunately, you cannot see, smell or taste it. The most commonly used device for making short-term radon measurements in homes is the charcoal canister -- - Radon occurs naturally as an indirect decay product of uranium. --- ## The radon analysis There are 919 total observations in the data. The data is in the file `Radon.txt` on Sakai. Variable | Description :------------- | :------------ radon | radon levels for each house log_radon | log(radon) state | state floor | lowest living area of each house: 0 for basement, 1 for first floor countyname | county names countyID | ID for the county names (1-85) fips | state + county fips code uranium | county-level soil uranium log_uranium | log(uranium) --- ## The radon analysis ```r Radon <- read.csv("data/Radon.txt", header = T,sep="") Radon$countyname <- factor(Radon$countyname); Radon$floor <- factor(Radon$floor) dim(Radon) ``` ``` ## [1] 919 9 ``` ```r summary(Radon[,-c(2,7)]) ``` ``` ## radon log_radon floor countyname ## Min. : 0.000 Min. :-2.3026 0:766 ST LOUIS :116 ## 1st Qu.: 1.900 1st Qu.: 0.6419 1:153 HENNEPIN :105 ## Median : 3.600 Median : 1.2809 DAKOTA : 63 ## Mean : 4.768 Mean : 1.2246 ANOKA : 52 ## 3rd Qu.: 6.000 3rd Qu.: 1.7918 WASHINGTON: 46 ## Max. :48.200 Max. : 3.8754 RAMSEY : 32 ## (Other) :505 ## countyID uranium log_uranium ## Min. : 1.00 Min. :0.4140 Min. :-0.88183 ## 1st Qu.:21.00 1st Qu.:0.6221 1st Qu.:-0.47467 ## Median :44.00 Median :0.9080 Median :-0.09652 ## Mean :43.52 Mean :0.9339 Mean :-0.13171 ## 3rd Qu.:70.00 3rd Qu.:1.2011 3rd Qu.: 0.18324 ## Max. :85.00 Max. :1.6956 Max. : 0.52802 ## ``` --- ## The radon analysis .midsmall[ ```r table(Radon$countyname) #we don't have enough data in some counties, so we should look to borrow information across counties. ``` ``` ## ## AITKIN ANOKA BECKER BELTRAMI ## 4 52 3 7 ## BENTON BIG STONE BLUE EARTH BROWN ## 4 3 14 4 ## CARLTON CARVER CASS CHIPPEWA ## 10 6 5 4 ## CHISAGO CLAY CLEARWATER COOK ## 6 14 4 2 ## COTTONWOOD CROW WING DAKOTA DODGE ## 4 12 63 3 ## DOUGLAS FARIBAULT FILLMORE FREEBORN ## 9 6 2 9 ## GOODHUE HENNEPIN HOUSTON HUBBARD ## 14 105 6 5 ## ISANTI ITASCA JACKSON KANABEC ## 3 11 5 4 ## KANDIYOHI KITTSON KOOCHICHING LAC QUI PARLE ## 4 3 7 2 ## LAKE LAKE OF THE WOODS LE SUEUR LINCOLN ## 9 4 5 4 ## LYON MAHNOMEN MARSHALL MARTIN ## 8 1 9 7 ## MCLEOD MEEKER MILLE LACS MORRISON ## 13 5 2 9 ## MOWER MURRAY NICOLLET NOBLES ## 13 1 4 3 ## NORMAN OLMSTED OTTER TAIL PENNINGTON ## 3 23 8 3 ## PINE PIPESTONE POLK POPE ## 6 4 4 2 ## RAMSEY REDWOOD RENVILLE RICE ## 32 5 3 11 ## ROCK ROSEAU SCOTT SHERBURNE ## 2 14 13 8 ## SIBLEY ST LOUIS STEARNS STEELE ## 4 116 25 10 ## STEVENS SWIFT TODD TRAVERSE ## 2 4 3 4 ## WABASHA WADENA WASECA WASHINGTON ## 7 5 4 46 ## WATONWAN WILKIN WINONA WRIGHT ## 3 1 13 13 ## YELLOW MEDICINE ## 2 ``` ] --- ## The radon analysis The raw radon levels can only take on positive values. ```r ggplot(Radon,aes(radon)) + geom_histogram(alpha=.8,fill=rainbow(15),bins=15) ``` <img src="14-hierarchical-models-II_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> -- .block[Not a normal distribution.] --- ## The radon analysis Let's look at `log_radon` instead. ```r ggplot(Radon,aes(log_radon)) + geom_histogram(alpha=.8,fill=rainbow(15),bins=15) ``` <img src="14-hierarchical-models-II_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> -- .block[Better! Let's go with log radon.] --- ## The radon analysis First, let's focus on predicting the radon levels from `floor`, the only individual-level (different observation for each house) variable we have. ```r qplot(x=floor,y=log_radon,data=Radon,xlab="Floor",ylab="Log Average Radon", geom="boxplot",fill=floor) ``` <img src="14-hierarchical-models-II_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> -- .block[Looks like radon levels are higher for houses with the basement as the lowest living area.] --- ## The radon analysis Let's look at the same relationship for a random sample of counties. Not enough data for some counties. ```r set.seed(1000) sample_county <- sample(unique(Radon$countyname),7,replace=F) ggplot(Radon[is.element(Radon$countyname,sample_county),], aes(x=countyname, y=log_radon, fill=floor)) + geom_boxplot() ``` <img src="14-hierarchical-models-II_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## The radon analysis Let's focus on counties with at least 15 houses. ```r sample_county <- which((table(Radon$countyID) > 15) == TRUE) ggplot(Radon[is.element(Radon$countyID,sample_county),], aes(x=countyname, y=log_radon, fill=floor)) + geom_boxplot() ``` <img src="14-hierarchical-models-II_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> -- .block[Even though the overall direction is the same, it looks like the actual differences between floor = 0 and floor = 1 differs for some counties.] --- ## The radon analysis - Let's fit a varying-slope, varying-intercept model to the data. -- - Let `\(y_{ij}\)` and `\(x_{1ij}\)` be the log radon level and indicator variable `floor` respectively for house `\(i\)` in county `\(j\)`. -- - Mathematically, we have .block[ .small[ $$ `\begin{split} y_{ij} & = (\beta_{0} + \gamma_{0j}) + (\beta_1 + \gamma_{1j}) x_{1ij} + \epsilon_{ij}; \ \ \ i = 1, \ldots, n_j; \ \ \ j = 1, \ldots, J \\ \epsilon_{ij} & \sim N(0, \sigma^2) \\ (\gamma_{0j},\gamma_{1j}) & \sim N_2(\boldsymbol{0}, \Sigma). \end{split}` $$ ] ] -- - We did not cover this in the last class but `\(\Sigma\)` actually takes the form .block[ .small[ $$ \Sigma = `\begin{bmatrix} \tau_0^2 & \rho \tau_0\tau_1 \\ \rho \tau_0\tau_1 & \tau_1^2 \\ \end{bmatrix}` $$ ] ] where `\(\tau_0^2\)` describes the within county variation attributed to the random intercept, `\(\tau_1^2\)` describes the within county variation attributed to the random slope (that is, floor) and `\(\rho\)` describes the correlation between `\(\gamma_{0j}\)` and `\(\gamma_{1j}\)`. --- ## The radon analysis We already know how to fit the model in R. ```r Model1 <- lmer(log_radon ~ floor + (floor | countyname), data = Radon) summary(Model1) ``` ``` ## Linear mixed model fit by REML ['lmerMod'] ## Formula: log_radon ~ floor + (floor | countyname) ## Data: Radon ## ## REML criterion at convergence: 2168.3 ## ## Scaled residuals: ## Min 1Q Median 3Q Max ## -4.4044 -0.6224 0.0138 0.6123 3.5682 ## ## Random effects: ## Groups Name Variance Std.Dev. Corr ## countyname (Intercept) 0.1216 0.3487 ## floor1 0.1180 0.3436 -0.34 ## Residual 0.5567 0.7462 ## Number of obs: 919, groups: countyname, 85 ## ## Fixed effects: ## Estimate Std. Error t value ## (Intercept) 1.46277 0.05387 27.155 ## floor1 -0.68110 0.08758 -7.777 ## ## Correlation of Fixed Effects: ## (Intr) ## floor1 -0.381 ``` --- ## The radon analysis: interpretation of fixed effects - Intuitively, we have an overall "average" regression line for all houses across all counties in Minnesota which has slope -0.68 and intercept 1.46. -- - That is, the general estimated line for any of the houses in Minnesota is: .block[ .small[ $$ \hat{\textrm{log_radon}}_i = 1.46 - 0.68 \times \textrm{floor}_i $$ ] ] - For any house in Minnesota with a basement as the lowest living area, the baseline radon level is `\(e^{1.46} = 4.31\)`. -- - Then, for any house in Minnesota, having a first floor as the lowest living area instead of a basement, reduces the radon level by a multiplicative effect of `\(e^{-0.68} = 0.51\)`, that is, about a 49% reduction. -- - However, if the house is in Dakota county for example, we also need to add on the random intercepts and slopes for that county. --- ## The radon analysis: interpretation of fixed effects - For Dakota county, we have ```r (ranef(Model1)$countyname)["DAKOTA",] ``` ``` ## (Intercept) floor1 ## DAKOTA -0.1099069 -0.08787551 ``` so that the estimated regression line for Dakota county is actually .block[ .small[ $$ \hat{\textrm{log_radon}}_i = (1.46 - 0.11) + (-0.68-0.09) \times \textrm{floor}_i = 1.35 - 0.77 \times \textrm{floor}_i $$ ] ] -- - Thus, for any house in Dakota county in Minnesota with a basement as the lowest living area, the baseline radon level is actually `\(e^{1.35} = 3.86\)`, which is lower than the overall state wide average. -- - And for any house in Dakota county in Minnesota, having the first floor be the lowest living area then reduces the radon level by a multiplicative effect of `\(e^{-0.77} = 0.46\)`, that is about a 54% reduction, more than the overall state wide effect. --- ## The radon analysis: interpretation of random effects - The estimated standard error `\(\hat{\sigma} = 0.75\)` describes the unexplained within-county variation. -- - The estimated `\(\hat{\tau_0} = 0.35\)` describes the within county variation attributed to the random intercept. -- - The estimated `\(\hat{\tau_1} = 0.34\)` describes the within-county variation attributed to the random slope (the predictor, floor). -- - Those two sources of within county variation are actually quite similar. -- - The estimated correlation between `\(\gamma_{0j}\)` and `\(\gamma_{1j}\)` is `\(\hat{\rho} = -0.34\)`. -- - You can visualize the random effects by typing `dotplot(ranef(Model1, condVar=TRUE))$countyID` in R. --- ## Including group-level predictors - We should also control for uranium since radon occurs naturally as an indirect decay product of uranium. -- - However, since each county has one single value for `uranium`, each house within that county has the exact same value. -- - Turns out that including group-level predictors is quite straightforward in R, as long as the predictor is properly represented in the data as repeated values for all observations in the same group. -- - <div class="question"> One can ask the question: with 85 counties in the dataset, how are we able to fit a regression with 85 different intercepts for each county as well as a county-level coefficient for uranium? </div> -- - .block[ The simple answer is that we are actually using all the observations within each county (along with all observations from other counties in fact), when estimating each random intercept, but yes we only use 85 distinct values to estimate the effect of uranium. ] --- ## The radon analysis: varying-intercepts - Word of caution: be careful when including random slopes. You should really include them if you absolutely have to and if you have enough data to estimate them accurately. -- - `lme4` in R uses the frequentist approach which is not fully reliable here as it uses an approximation for inference and it does not fully account for uncertainty in the estimated variance parameters. Personally, I prefer to use Bayesian models for multilevel regressions. -- - If you want to fit a multilevel model for your final project, I would suggest taking a look at the `brms` package in R for a Bayesian approach. -- - Let's use BIC to see if we can exclude the random slopes. .small[ ```r Model2 <- lmer(log_radon ~ floor + (1 | countyname), data = Radon) AIC(Model2); AIC(Model1) #same overall conclusions using BIC ``` ``` ## [1] 2179.305 ``` ``` ## [1] 2180.325 ``` ] -- - No real difference. We will exclude them going forward. You should be able to interpret the updated coefficients of the new model. --- ## The radon analysis: including uranium Turns out that it also often makes sense to use `log_uranium` instead of `uranium`. .midsmall[ ```r Model3 <- lmer(log_radon ~ floor + log(uranium) + (1 | countyname), data = Radon) ; summary(Model3) ``` ``` ## Linear mixed model fit by REML ['lmerMod'] ## Formula: log_radon ~ floor + log(uranium) + (1 | countyname) ## Data: Radon ## ## REML criterion at convergence: 2134.2 ## ## Scaled residuals: ## Min 1Q Median 3Q Max ## -4.9673 -0.6117 0.0274 0.6555 3.3848 ## ## Random effects: ## Groups Name Variance Std.Dev. ## countyname (Intercept) 0.02446 0.1564 ## Residual 0.57523 0.7584 ## Number of obs: 919, groups: countyname, 85 ## ## Fixed effects: ## Estimate Std. Error t value ## (Intercept) 1.46576 0.03794 38.633 ## floor1 -0.66824 0.06880 -9.713 ## log(uranium) 0.72027 0.09176 7.849 ## ## Correlation of Fixed Effects: ## (Intr) floor1 ## floor1 -0.357 ## log(uranim) 0.145 -0.009 ``` ] -- For any house in Minnesota with a basement as the lowest living area, every unit increase in log(uranium) increases radon levels by a multiplicative effect of `\(e^{0.72} = 2.05\)`. --- ## How much data and how many groups? - When `\(J\)`, that is the number of groups, is small, it is difficult to estimate the between-group variation; multi-level modeling often adds little in such scenarios. -- - However, it should not do any worse than including the grouping variable as a factor variable, and it can still be easier to interpret since we need not drop any level as baseline. -- - Small sample sizes within the groups can be enough to fit a multilevel model when only the intercept is varying. With varying slopes, one can easily run into convergence issues using frequentist approaches. -- - When groups do not have that many data points, the random intercepts and slopes may not be estimated accurately but the data within each group will still provide information that allows estimation of fixed effects and overall variance parameters. --- ## Extra nested levels - It is easy to envision applications where there might be more than one level of hierarchy. -- - For example + students within schools within counties within states + patients within hospitals within states + voters within voting districts within states -- - In those applications, it is straightforward to extend these ideas and create extra levels of hierarchy in the multi-level models. -- - When that is the case, I once again prefer to rely on Bayesian methods to fit those models. --- ## Non-nested models - In other applications, there can be complicated grouping structures, where observations fall into two or more different non-nested grouping variables. -- - For example + patients within `\(J\)` hospitals receiving `\(K\)` different treatments + students within `\(J\)` schools taking classes based on `\(K\)` different teaching techniques. -- - Once again, it is straightforward to incorporate these within the context of multi-level models. --- ## Non-nested models - Suppose we want to fit a multi-level model with varying-intercepts by each grouping variable but with a fixed slope for one predictor, we would have .block[ .small[ $$ `\begin{split} y_{ijk} & = (\beta_{0} + \gamma_{0j} + \eta_{0k}) + \beta_1 x_{1ijk} + \epsilon_{ijk} \\ \gamma_{0j} & \sim N(0, \tau_{\gamma(0)}^2) \\ \eta_{0k} & \sim N(0, \tau_{\eta(0)}^2) \\ \epsilon_{ij} & \sim N(0, \sigma^2) \\ i & = 1, \ldots, n_{jk}; \ \ \ j = 1, \ldots, J; \ \ \ k = 1, \ldots, K. \\ \end{split}` $$ ] ] -- - In R, we can fit the model above as follows: ```r M1 <- lmer(y ~ x + (1 | GroupVar1) + (1 | GroupVar2)) ; summary(M1) ``` -- - Adding more predictors is trivial. -- - It is easy to add more group variables but it can be hard to fit the model without enough data points.