class: center, middle, inverse, title-slide # Multinomial logistic regression ### Dr. Olanrewaju Michael Akande ### Oct 1, 2019 --- ## Announcements - If you are not enrolled in Practical Data Science 1 and want to attend the "Intro to Github" class, fill this out: https://forms.gle/U2wr8wASm2jwgdJW9 - You will use Github for your second team project so you should attend the class if you are not familiar with Github. ## Outline - Questions from last lecture - Introduction to generalized linear models - Multinomial logistic regression <!-- - Proportional odds model --> --- class: center, middle # Generalized linear models --- ## Introduction to generalized linear models - As we've seen over the last few classes, we may often need to work with outcome variables that are not continuous. -- - Clearly, the standard linear regression will not suffice in those situations. -- - Specifically, we saw how to use logistic regression to handle binary response variables. -- - In other scenarios however, our outcome variable will not be binary either. -- - How should we handle that? --- ## Introduction to generalized linear models - For example, we may want to predict + Whether someone prefers product A, B, or C (nominal) + Political ideology on an ordered 3 scale outcome, such as "very liberal", "moderate", "very conservative" (ordinal) + The number of times an event happens (counts) -- - The classes of models we can use to handle these types of responses are referred to as .hlight[generalized linear models (GLMs)]. -- - Note that GLMs includes the linear and logistic regressions we already covered. -- - Today we will focus on multinomial logistic regression. Over the new few classes, we will touch on other GLMs. --- ## Components of GLMs Generally, GLMs have three major components: -- 1. The .hlight[random component] describes the randomness of the outcome variable `\(Y\)` through a pdf or pmf `\(f\)`, with parameter `\(\theta_i\)`. That is, .block[ .small[ `$$y_i \sim f(y_i|\theta_i) \ \ \ \textrm{OR} \ \ \ y_i \sim f(y_i;\theta_i) \ \ \ \textrm{OR} \ \ \ y_i \sim f(\theta_i)$$` ] ] -- 2. The .hlight[systematic component] defines a linear component of the predictors. That is, for each observation `\(i\)`, .block[ .small[ `$$\eta_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip}$$` ] ] -- 3. The .hlight[link function] `\(g\)` connects the random and systematic components through `\(\mu_i = \mathbb{E}[Y_i]\)`, that is .block[ .small[ `$$\eta_i = g(u_i)$$` ] ] where `\(g\)` is a monotonic and differentiable function (for those with some maths background). -- In standard linear regression, `\(g\)` is the .hlight[identity link] `\(n_i=g(u_i)=u_i\)`, whereas in logistic regression, `\(g\)` is the logit function. --- ## Recall logistic regression - Recall that for logistic regression, we had .block[ .small[ $$ y_i | x_i \sim \textrm{Bernoulli}(\pi_i); \ \ \ \textrm{log}\left(\dfrac{\pi_i}{1-\pi_i}\right) = \beta_0 + \beta_1 x_i $$ ] ] for each observation `\(i = 1, \ldots, n\)`. -- - To get `\(\pi_i\)`, we solved the logit equation above to get .block[ .small[ `$$\pi_i = \dfrac{e^{\beta_0 + \beta_1 x_i}}{1 + e^{\beta_0 + \beta_1 x_i}}$$` ] ] -- - Consider `\(Y=0\)` a baseline category. Suppose `\(\Pr[y_i = 1 | x_i] = \pi_{i1}\)` and `\(\Pr[y_i = 0 | x_i] = \pi_{i0}\)`. Then, the logit expression is essentially .block[ .small[ `$$\textrm{log}\left(\dfrac{\pi_{i1}}{\pi_{i0}}\right) = \beta_0 + \beta_1 x_i$$` ] ] -- - `\(e^{\beta_1}\)` is thus the (multiplicative) change in odds of `\(y = 1\)` over the baseline `\(y = 0\)` when increasing `\(x\)` by one unit. --- ## Multinomial logistic regression - Suppose we have a nominal-scale response variable `\(Y\)` with `\(J\)` categories. First, for the .hlight[random component], we need a distribution to describe `\(Y\)`. -- - A standard option for this is the .hlight[multinomial distribution], which is essentially a generalization of the binomial distribution. Read about the multinomial distribution [here](https://akandelanre.github.io/STA111-Summer2018-Course-Wesbite/Lectures/Lecture6.pdf) and [here](https://en.wikipedia.org/wiki/Multinomial_distribution). -- - .hlight[Multinomial distribution] gives us a way to characterize .block[ .small[ `$$\Pr[y_i = 1] = \pi_1, \ Pr[y_i = 2] = \pi_2, \ \ldots, \ \Pr[y_i = J] = \pi_J, \ \ \ \textrm{where} \ \ \ \sum^J_{j=1} \pi_j = 1.$$` ] ] -- - When there are no predictors, the best guess for each `\(\pi_j\)` is the sample proportion of cases with `\(y_i = j\)`, that is, .block[ .small[ `$$\hat{\pi}_j = \dfrac{\mathbf{1}[y_i = j]}{n}$$` ] ] -- - When we have predictors, then we want .block[ .small[ `$$\Pr[y_i = 1 | \boldsymbol{x}_i] = \pi_{i1}, \ \Pr[y_i = 2 | \boldsymbol{x}_i] = \pi_{i2}, \ \ldots, \ \Pr[y_i = J | \boldsymbol{x}_i] = \pi_{iJ}.$$` ] ] --- ## Multinomial logistic regression - That is, we want the `\(\pi_j\)`'s to be functions of the predictors, like in logistic regression. -- - Turns out we can use the same .hlight[link function], that is the logit function, if we set one of the levels as the baseline. -- - Pick a baseline outcome level, say `\(Y=1\)`. -- - Then, the multinomial logistic regression is defined as a set of logistic regression models for each probability `\(\pi_j\)`, compared to the baseline, where `\(j\geq 2\)`. That is, .block[ .small[ `$$\textrm{log}\left(\dfrac{\pi_{ij}}{\pi_{i1}}\right) = \beta_{0j} + \beta_{1j} x_{i1} + \beta_{2j} x_{i2} + \ldots + \beta_{pj} x_{ip},$$` ] ] where `\(j\geq 2\)`. -- - We therefore have `\(J-1\)` separate logistic regressions in this setup. --- ## Multinomial logistic regression - The equation for each `\(\pi_{ij}\)` is given by .block[ .small[ `$$\pi_{ij} = \dfrac{e^{\beta_{0j} + \beta_{1j} x_{i1} + \beta_{2j} x_{i2} + \ldots + \beta_{pj} x_{ip}}}{1 + \sum^J_{j=2} e^{\beta_{0j} + \beta_{1j} x_{i1} + \beta_{2j} x_{i2} + \ldots + \beta_{pj} x_{ip}}} \ \ \ \textrm{for} \ \ \ j > 1$$` ] ] and .block[ .small[ `$$\pi_{i1} = 1-\sum^J_{j=2} \pi_{ij}$$` ] ] -- - Also, we can extract the log odds for comparing other pairs of the response categories `\(j\)` and `\(j^\star\)`, since .block[ .small[ $$ `\begin{split} \textrm{log}\left(\dfrac{\pi_{ij}}{\pi_{ij^\star}}\right) & = \textrm{log}\left(\pi_{ij}\right) - \textrm{log}\left(\pi_{ij^\star}\right) \\ & = \textrm{log}\left(\pi_{ij}\right) - \textrm{log}\left(\pi_{i1}\right) - \textrm{log}\left(\pi_{ij^\star}\right) + \textrm{log}\left(\pi_{i1}\right) \\ & = \left[ \textrm{log}\left(\pi_{ij}\right) - \textrm{log}\left(\pi_{i1}\right) \right] - \left[ \textrm{log}\left(\pi_{ij^\star}\right) - \textrm{log}\left(\pi_{i1}\right) \right] \\ & = \textrm{log}\left(\dfrac{\pi_{ij}}{\pi_{i1}}\right) - \textrm{log}\left(\dfrac{\pi_{ij^\star}}{\pi_{i1}}\right). \end{split}` $$ ] ] --- ## Multinomial logistic regression - Each coefficient has to be interpreted relative to the baseline. -- - That is, for a continuous predictor, + `\(\beta_{1j}\)` is the increase (or decrease) in the log-odds of `\(Y=j\)` versus `\(Y=1\)` when increasing `\(x_1\)` by one unit. + `\(e^{\beta_{1j}}\)` is the multiplicative increase (or decrease) in the odds of `\(Y=j\)` versus `\(Y=1\)` when increasing `\(x_1\)` by one unit. -- - Whereas, for a binary predictor, + `\(\beta_{1j}\)` is the log-odds of `\(Y=j\)` versus `\(Y=1\)` for the group with `\(x_1 = 1\)` compared to the group with `\(x_1 = 0\)`. + `\(e^{\beta_{1j}}\)` is the odds of `\(Y=j\)` versus `\(Y=1\)` for the group with `\(x_1 = 1\)` compared to the group with `\(x_1 = 0\)`. -- - Exponentiate confidence intervals from log-odds scale to get on the odds scale. --- ## Multinomial logistic regression: significance tests - For multinomial logistic regression (and the other GLMs we will touch on such as Poisson regression), use the change in deviance test to compare models and test significance, just like we had for the logistic regression. -- - Fit model with and without some predictor `\(x_k\)`. -- - Perform a change in deviance test to compare the two models. -- - Interpret p-value as evidence about whether the coefficients excluded from the smaller model are equal to zero. --- ## Multinomial logistic regression: model diagnostics - Use binned residuals like in logistic regression. -- - Each outcome level has its own raw residual. For each outcome level `\(j\)`, + make an indicator variable equal to one whenever `\(Y = j\)` and equal to zero otherwise + compute the predicted probability that `\(Y=j\)` for each record (using the `fitted` command) + compute the raw residual = indicator value - predicted probability -- - For each outcome level, make bins of predictor values and plot average value of predictor versus the average raw residual. Look for patterns. -- - Notice that we can still compute .hlight[accuracy] just like we did for the logistic regression. ROC on the other hand is not so straightforward; we can draw a different ROC curve for each level of the response variable. We can also draw pairwise ROC curves. --- ## Multinomial logistic regression in R - Install the package .hlight[nnet] from CRAN. - Load the library: `library(nnet)`. - The command for running the multinomial logistic regression in R looks like: ```r Modelfit <- multinom (response ~ x_1 + x_2 + ... + x_p, data = Data) ``` - Use `fitted(Modelfit)` to get predicted probabilities for observed cases. --- ## Analysis of Sesame Street Data - The television series Sesame Street is concerned mainly with teaching preschool skills to children age 3-5, with special emphasis on reaching economically disadvantaged children. -- - In the early 1970s, researchers at Educational Testing Service (the company that runs the SAT) ran a study to evaluate Sesame Street. -- - To ensure the study contained a group of children that watched Sesame Street regularly, they randomly assigned children either to receive encouragement to watch Sesame Street or not to receive encouragement. -- - Those assigned to encouragement were given promotional materials, and received weekly visits and phone calls from ETS staff. Those assigned not to receive encouragement did not get this attention. -- - The children were tested on a variety of cognitive variables, including knowledge of body parts, knowledge about letters, knowledge about numbers, etc., both before and after viewing the series. -- - .hlight[Let's predict how often the kids watch sesame street, with focus on if encouragement pushes them towards more viewing] --- ## Analysis of Sesame Street Data The data is in the file `sesame.txt` on Sakai. R script for this analysis is [here](https://ids-702-f19.github.io/Course-Website/slides/lec-slides/Sesame.R). .mini[ Variable | Description :------------- | :------------ viewcat | 1=rarely watched the show <br/> 2=once or twice a week <br/> 3=three to five times a week <br/> 4=watched the show on average more than 5 times a week viewenc | 1=child encouraged to watch, 2=child not encouraged to watch site | 1 =Three to five year old disadvantaged children from inner city areas in various parts of the country. <br/> 2 = Four year old advantaged suburban children. <br/> 3 = Advantaged rural children. <br/> 4 = Disadvantaged rural children. <br/> 5 = Disadvantaged Spanish speaking children. sex | male=1, female=2 age | age in months setting | setting in which Sesame Street was viewed, 1=home 2=school prebody | pretest on knowledge of body parts (scores range from 0-32) prelet | pretest on letters (scores range from 0-58) preform | pretest on forms (scores range from 0-20) prenumb | pretest on numbers (scores range from 0-54) prerelat | pretest on relational terms (scores range from 0-17) preclasf | pretest on classification skills ] --- class: center, middle # In-class analysis: move to the R script --- class: center, middle # Interpreting the results of Multinomial logistic regression --- ## Interpretating the results Using the coefficients of `viewenc` in final model on the R script, - For a child who is encouraged to watch Sesame Street, the odds of watching Sesame Street once or twice a week versus watching it rarely are 18.5 times higher (95% CI: 6.3 to 55.0) than the corresponding odds for a child not encouraged to watch Sesame Street. -- - For a child who is encouraged to watch Sesame Street, the odds of watching Sesame Street three to five times a week versus watching it rarely are 12.8 times higher (95% CI: 4.3 to 38.0) than the corresponding odds for a child not encouraged to watch Sesame Street. -- - For a child who is encouraged to watch Sesame Street, the odds of watching Sesame Street more than five times a week versus watching it rarely are 10.4 times higher (95% CI: 3.4 to 31.4) than the corresponding odds for a child not encouraged to watch Sesame Street. -- .block[Try to interpret the other significant predictors the same way.]