class: center, middle, inverse, title-slide # Logistic regression: model assessment and validation ### Dr. Olanrewaju Michael Akande ### Sept 19, 2019 --- ## Announcements - Check the course schedule for reading materials on logistic regression. - Homework 3 is now online. It is due next Friday at 11:59pm. ## Outline - Questions from last lecture - Checking model fit: + Binned residuals + Confusion matrix + ROC curves - Logistic regression with multiple predictors --- class: center, middle # Checking model fit --- ## Predicting nba wins - Let's fit a logistic regression with one predictor to NBA data for 4 seasons from the 2014/2015 season to the 2017/2018 season. - Suppose we want to see how the amount of points a team let's the opponents score affects their odds of winning. - For this simple example, we will focus on data from four teams: GSW (Golden State Warriors), SAS (San Antonio Spurs), CAV (Cleveland Cavaliers), and TOR (Toronto Raptors). - The data is in the file `nba_games_stats_reduced.csv` on Sakai. - Ideally, we should use more information (and that data is actually available) to predict wins but let's continue for illustrative purposes. --- ## Predicting nba wins ```r nba <- read.csv("data/nba_games_stats_reduced.csv",header=T) colnames(nba)[3] <- "Opp" nba$win <- rep(0,nrow(nba)); nba$win[nba$WINorLOSS=="W"] <- 1 head(nba); dim(nba) ``` ``` ## Team WINorLOSS Opp win ## 1 CLE L 95 0 ## 2 CLE W 108 1 ## 3 CLE L 101 0 ## 4 CLE L 102 0 ## 5 CLE W 101 1 ## 6 CLE W 111 1 ``` ``` ## [1] 1312 4 ``` ```r summary(nba[,-4]) ``` ``` ## Team WINorLOSS Opp ## CLE:328 L:391 Min. : 68.0 ## GSW:328 W:921 1st Qu.: 93.0 ## SAS:328 Median :101.0 ## TOR:328 Mean :101.5 ## 3rd Qu.:109.2 ## Max. :148.0 ``` --- ## Predicting nba wins .block[ .small[ $$ \textrm{win}_i | \textrm{Opp}_i \sim \textrm{Bernoulli}(\pi_i); \ \ \ \textrm{log}\left(\dfrac{\pi_i}{1-\pi_i}\right) = \beta_0 + \beta_1 \textrm{Opp}_i $$ ] ] ```r nbareg <- glm(win~Opp,family=binomial(link=logit),data=nba); summary(nbareg) ``` ``` ## ## Call: ## glm(formula = win ~ Opp, family = binomial(link = logit), data = nba) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.3377 -0.9406 0.5255 0.8033 2.0770 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 9.545835 0.667832 14.29 <2e-16 *** ## Opp -0.083911 0.006323 -13.27 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1598.5 on 1311 degrees of freedom ## Residual deviance: 1373.1 on 1310 degrees of freedom ## AIC: 1377.1 ## ## Number of Fisher Scoring iterations: 4 ``` .block[For every additional point an opponent scores in a game, the odds of winning decreases by 8%: exp(-0.08) = 0.92.] --- ## Residuals There are various options for residuals when working with glms. For logistic regression in particular, we have - .hlight[Response residuals] .block[ .small[ `$$e_i = y_i - \hat{\pi}_i.$$` ] ] - .hlight[Pearson residuals] .block[ .small[ `$$e_i^P = \dfrac{y_i - \hat{\pi}_i}{\hat{\pi}_i(1-\hat{\pi}_i)}$$` ] ] which should be more uniform. - .hlight[Deviance residuals] .block[ .small[ `$$e_i^D = \textrm{sign}(y_i - \hat{\pi}_i) \times 2\left(y_i \textrm{log}\dfrac{y_i}{\hat{\pi}_i} + (1-y_i) \textrm{log}\dfrac{1-y_i}{1-\hat{\pi}_i} \right).$$` ] ] They are the default in R when using the .hlight[residuals()] function. We will talk a bit more about deviance later. --- ## Checking the fit of model - Unlike what we had for linear regression, just looking at the residuals does not work well here. + They are always positive when `\(Y=1\)` and always negative when `\(Y=0\)`. + Also, constant variance is not an assumption of logistic regression. <div class="question"> Why is that the case? </div> Because the variance of the Bernoulli distribution in this setup depends on each observation `\(i\)`: `\(\mathbb{V}[Y_i] = \pi_i(1-\pi_i)\)`. + We also do not have normality of residuals to work with either. - What we can do is check to see if the function of predictors is well specified using .hlight[binned residuals]. - We can also see how well our model predicts using + Confusion matrix + ROC curves --- ## Binned residuals - Compute raw (response) residuals for fitted logistic regression. - Order observations by values of predicted probabilities (or predictor values) from the fitted regression. - Using ordered data, form `\(g\)` bins of (approximately) equal size. Default: `\(g = \sqrt{n}\)`. - Compute average residual in each bin. - Plot average residual versus average predicted probability (or average predictor value) for each bin. - Use the .hlight[arm] package in R. --- ## NBA analysis ```r plot(nbareg,which=1) ``` <img src="08-logistic-reg-model-assess_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> The residuals are the deviance residuals, while the predicted values are on the linear (logit) scale, that is, `\(\beta_0 + \beta_1 x_i\)`. Look to see which cases have large absolute values for cases that don't fit well, but not too useful otherwise. --- ## NBA analysis Plot binned raw residuals versus predicted probabilities. Load the .hlight[arm] package. Useful as a "one-stop shopping" plot; especially with many predictors and you want an initial look at model adequacy. Mostly good, although model seems to struggle for fitted values over 0.9. ```r binnedplot(fitted(nbareg),residuals(nbareg,"resp"),xlab="Pred. probabilities",col.int="red4", ylab="Avg. residuals",main="Binned residual plot",col.pts="navy") ``` <img src="08-logistic-reg-model-assess_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> <!-- The grey lines represent `\(\pm 2\)` SE bands, which we would expect to contain about 95% of the observations. Too few points here to draw any conclusions! --> --- ## NBA analysis Plot binned raw residuals versus individual predictors. <!-- Useful as a "one-stop shopping" plot; especially with many predictors and you want an initial look at model adequacy. --> Mostly good, although model again seems to struggle for low values of opponent's points. Also, too many points outside the bands. We know some important predictors are missing by construction. ```r binnedplot(nba$Opp,residuals(nbareg,"resp"),xlab="Opponent's points (centered)", col.int="red4",ylab="Avg. residuals",main="Binned residual plot",col.pts="navy") ``` <img src="08-logistic-reg-model-assess_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> <!-- Again, too few points here to draw any conclusions! You want at least 100 data points before these plots start being useful. --> --- ## Confusion matrix - Estimated probabilities can be used to predict outcomes. - For example, we could decide to predict `\(Y=1\)` when the predicted probability exceeds `\(0.5\)` and predict `\(Y=0\)` otherwise. - We then can determine how many cases we classify correctly and incorrectly. - Resulting `\(2 \times 2\)` table called the .hlight[confusion matrix]. - When mis-classification rates are high, model may not be an especially good fit to the data. --- ## Confusion matrix <table> <tr> <th> </th> <th> </th> <th colspan="2">Observed</th> </tr> <tr> <th colspan="2"></th> <td style="text-align:center">Y=1</td> <td style="text-align:center">Y=0</td> </tr> <tr> <th rowspan="2">Predicted</th> <td height="50px">Y=1</td> <td style="text-align:center">TP (True Positives)</td> <td style="text-align:center">FP (False Positives)</td> </tr> <tr> <td height="50px">Y=0</td> <td style="text-align:center">FN (False Negatives)</td> <td style="text-align:center">TN (True Negatives)</td> </tr> </table> - .hlight[True positive rate (TPR)] = `\(\dfrac{TP}{TP+FN}\)` (also known as .hlight[sensitivity]) - .hlight[False negative rate (FNR)] = `\(\dfrac{FN}{TP+FN}\)` - .hlight[True negative rate (TNR)] = `\(\dfrac{TN}{FP+TN}\)` (also known as .hlight[specificity]) - .hlight[False positive rate (FPR)] = `\(\dfrac{FP}{FP+TN}\)` (1 - .hlight[specificity]) --- ## ROC Curves - .block[We want high values of sensitivity and low values of (1 - specificity)!] - The .hlight[receiver operating characteristic (ROC)] curve plots + Sensitivity on Y axis + 1 - specificity on X axis - Evaluated at lots of different values for the threshold - Good fitting logistic regression curves toward the upper left corner, with .hlight[area under the curve (AUC)] near one. - Make ROC curves in R using the pROC package. - By the way, we also often define .hlight[accuracy] as `\(\dfrac{TP + TN}{TP+FN+FP+TN}\)`. This estimates how well the model predicts correctly overall. --- ## NBA analysis Let's look at the confusion matrix for the nba data. Load the .hlight[arm] and .hlight[e1071] packages. ```r Conf_mat <- confusionMatrix(as.factor(ifelse(fitted(nbareg) >= 0.5, "W","L")), nba$WINorLOSS,positive = "W") Conf_mat$table ``` ``` ## Reference ## Prediction L W ## L 126 86 ## W 265 835 ``` ```r Conf_mat$overall["Accuracy"]; ``` ``` ## Accuracy ## 0.7324695 ``` ```r Conf_mat$byClass[c("Sensitivity","Specificity")] ``` ``` ## Sensitivity Specificity ## 0.9066232 0.3222506 ``` .hlight[confusionMatrix] produces a lot of output. Print the .hlight[Conf_mat] object to see all of them. --- ## NBA analysis ```r invisible(roc(nba$win,fitted(nbareg),plot=T,print.thres=c(0.3,0.5,0.7),legacy.axes=T, print.auc =T,col="red3")) ``` <img src="08-logistic-reg-model-assess_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ## NBA analysis ```r invisible(roc(nba$win,fitted(nbareg),plot=T,print.thres="best",legacy.axes=T, print.auc =T,col="red3")) ``` <img src="08-logistic-reg-model-assess_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- class: center, middle # Logistic regression with multiple predictors --- ## Logistic regression with multiple predictors: motivating example - In many developing countries, people get their drinking water from wells. -- - Sometimes these wells are contaminated with the chemical arsenic, which when consumed in high concentrations causes skin and bladder cancer, as well as cardiovascular disease. -- - Fortunately, in many cases people living near contaminated wells have the opportunity to get water from nearby uncontaminated wells. --- ## The contaminated wells analysis - In one study, several researchers measured the concentrations of arsenic in wells in a particular region of Bangladesh. -- - They labeled wells as safe or unsafe based on the measurements. -- - The researchers encouraged people drinking from unsafe wells to switch to safe wells. -- - Several years later, the researchers returned to the area with the goal of seeing who had switched from unsafe to safe wells. -- - They recorded information on a sample of 3020 individuals who had wells at their homes that were unsafe. -- - Let's address the question: what predicts why people switch wells? -- - The data is in the file `arsenic.csv` on Sakai. --- ## The contaminated wells analysis Data description Variable | Description :------------- | :------------ Switch | 1 = if respondent switched to a safe well <br /> 0 = if still using own unsafe well Arsenic | amount of arsenic in well at respondent's home (100s of micro-grams per liter) Dist | distance in meters to the nearest known safe well Assoc | 1 = if any members of household are active in community organizations <br /> 0 = otherwise Educ | years of schooling of the head of household Treat switch as the response variable and others as potential explanatory variables --- ## Logistic regression with multiple predictors: We can then formally extend the .hlight[logistic regression model] we had before to allow for multiple predictors. We still have .block[ .small[ `$$\Pr[y_i = 1 | x_i] = \pi_i \ \ \textrm{and} \ \ \Pr[y_i = 0 | x_i] = 1-\pi_i,$$` ] ] or .block[ .small[ `$$y_i | x_i \sim \textrm{Bernoulli}(\pi_i)$$` ] ] as before, but with .block[ .small[ `$$\textrm{log}\left(\dfrac{\pi_i}{1-\pi_i}\right) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip}$$` ] ] now in both cases. --- ## The contaminated wells analysis: EDA ```r arsenic <- read.csv("data/arsenic.csv",header=T, colClasses=c("numeric","numeric","numeric","factor","numeric")) head(arsenic) ``` ``` ## switch arsenic dist assoc educ ## 1 1 2.36 16.826 0 0 ## 2 1 0.71 47.322 0 0 ## 3 0 2.07 20.967 0 10 ## 4 1 1.15 21.486 0 12 ## 5 1 1.10 40.874 1 14 ## 6 1 3.90 69.518 1 9 ``` ```r summary(arsenic[,-1]) ``` ``` ## arsenic dist assoc educ ## Min. :0.510 Min. : 0.387 0:1743 Min. : 0.000 ## 1st Qu.:0.820 1st Qu.: 21.117 1:1277 1st Qu.: 0.000 ## Median :1.300 Median : 36.761 Median : 5.000 ## Mean :1.657 Mean : 48.332 Mean : 4.828 ## 3rd Qu.:2.200 3rd Qu.: 64.041 3rd Qu.: 8.000 ## Max. :9.650 Max. :339.531 Max. :17.000 ``` ```r table(arsenic$switch) ``` ``` ## ## 0 1 ## 1283 1737 ``` --- ## The contaminated wells analysis: EDA ```r par(mfrow=c(1,3)) boxplot(arsenic~switch,data=arsenic,ylab="Amount of arsenic in well",pch=25,xaxt='n', xlab="Switched to safe well?",col=c("red3","yellow3"),cex = 0.85,main ="All") axis(1,at=c(1,2),labels=c("No","Yes")) boxplot(arsenic~switch,data=arsenic,subset= assoc==1,ylab="Amount of arsenic in well", xlab="Switched to safe well?",col=c("red3","yellow3"),xaxt='n', pch = 25, cex = 0.85,main ="Active in community") axis(1,at=c(1,2),labels=c("No","Yes")) boxplot(arsenic~switch,data=arsenic,subset= assoc==0,ylab="Amount of arsenic in well", xlab="Switched to safe well?",col=c("red3","yellow3"),xaxt='n', pch = 25, cex = 0.85,main ="Not active in community") axis(1,at=c(1,2),labels=c("No","Yes")) ``` <img src="08-logistic-reg-model-assess_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- ## The contaminated wells analysis: EDA ```r par(mfrow=c(1,3)) boxplot(dist~switch,data=arsenic,ylab="Distance to nearest safe well",pch=25,xaxt='n', xlab="Switched to safe well?",col=c("red3","yellow3"),cex = 0.85,main ="All") axis(1,at=c(1,2),labels=c("No","Yes")) boxplot(dist~switch,data=arsenic,subset= assoc==1,ylab="Distance to nearest safe well", xlab="Switched to safe well?",col=c("red3","yellow3"),xaxt='n', pch = 25, cex = 0.85,main ="Active in community") axis(1,at=c(1,2),labels=c("No","Yes")) boxplot(dist~switch,data=arsenic,subset= assoc==0,ylab="Distance to nearest safe well", xlab="Switched to safe well?",col=c("red3","yellow3"),xaxt='n', pch = 25, cex = 0.85,main ="Not active in community") axis(1,at=c(1,2),labels=c("No","Yes")) ``` <img src="08-logistic-reg-model-assess_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- ## The contaminated wells analysis: EDA ```r par(mfrow=c(1,3)) boxplot(educ~switch,data=arsenic,ylab="Years of schooling of head",pch=25,xaxt='n', xlab="Switched to safe well?",col=c("red3","yellow3"),cex = 0.85,main ="All") axis(1,at=c(1,2),labels=c("No","Yes")) boxplot(educ~switch,data=arsenic,subset= assoc==1,ylab="Years of schooling of head", xlab="Switched to safe well?",col=c("red3","yellow3"),xaxt='n', pch = 25, cex = 0.85,main ="Active in community") axis(1,at=c(1,2),labels=c("No","Yes")) boxplot(educ~switch,data=arsenic,subset= assoc==0,ylab="Years of schooling of head", xlab="Switched to safe well?",col=c("red3","yellow3"),xaxt='n', pch = 25, cex = 0.85,main ="Not active in community") axis(1,at=c(1,2),labels=c("No","Yes")) ``` <img src="08-logistic-reg-model-assess_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- ## The contaminated wells analysis: EDA For *switch* vs *assoc*, let's look at a table. ```r table(arsenic[,c("switch","assoc")])/sum(table(arsenic[,c("switch","assoc")])) ``` ``` ## assoc ## switch 0 1 ## 0 0.2364238 0.1884106 ## 1 0.3407285 0.2344371 ``` We can also do a chi-squared test for independence. ```r chisq.test(table(arsenic[,c("switch","assoc")])) ``` ``` ## ## Pearson's Chi-squared test with Yates' continuity correction ## ## data: table(arsenic[, c("switch", "assoc")]) ## X-squared = 3.7497, df = 1, p-value = 0.05282 ``` Not enough evidence to show that *assoc* influences *switch*. --- ## The inverse logit function The relationship between `\(\pi_i\)` and `\(x_i\)` is defined by the .hlight[inverse logit function] .block[ .small[ $$ \pi_i = \dfrac{e^{\beta_0 + \beta_1 x_i}}{1 + e^{\beta_0 + \beta_1 x_i}} $$ ] ] <!-- Let's compare this to the original logit function. --> ```r curve(invlogit((x)),xlim=c(-5,5),col="blue3") curve(invlogit((2*x)),xlim=c(-5,5),ylim=c(0,1),col="red3",add=T) curve(invlogit((0.5*x)),xlim=c(-5,5),ylim=c(0,1),col="yellow3",add=T) curve(invlogit((-x)),xlim=c(-5,5),ylim=c(0,1),col="green3",add=T) ``` <img src="08-logistic-reg-model-assess_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> Notice the symmetry at the ends. --- ## The contaminated wells analysis: EDA .hlight[Binned plots] are also quite informative when visualizing binary responses vs continuous predictors. First *switch* vs *arsenic*. ```r binnedplot(y=arsenic$switch,arsenic$arsenic,xlab="Arsenic",ylim=c(0,1),col.pts="navy", ylab ="Switched to safe well?",main="Binned Arsenic and Switch cases",col.int="white") ``` <img src="08-logistic-reg-model-assess_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> Not really an inverse logit curve. We may need to transform *arsenic* or add a higher order term. --- ## The contaminated wells analysis: EDA Now, *switch* vs *dist*. ```r binnedplot(y=arsenic$switch,arsenic$dist,xlab="Distance",ylim=c(0,1),col.pts="navy", ylab ="Switched to safe well?",main="Binned Distance and Switch cases",col.int="white") ``` <img src="08-logistic-reg-model-assess_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> Linear relationships are fine and this looks mostly linear. We will re-examine using residual plots. --- ## The contaminated wells analysis: EDA Next, *switch* vs *educ*. ```r binnedplot(y=arsenic$switch,arsenic$educ,xlab="Education",ylim=c(0,1),col.pts="navy", ylab ="Switched to safe well?",main="Binned Education and Switch cases",col.int="white") ``` <img src="08-logistic-reg-model-assess_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> This also looks fine, although, we might consider collapsing some levels of education. --- ## The contaminated wells analysis Now let's try a logistic regression that has a main effect for every variable and linear predictors. Let's center the continuous predictors. ```r arsenic$arsenic_c = arsenic$arsenic - mean(arsenic$arsenic) arsenic$dist_c = arsenic$dist - mean(arsenic$dist) arsreg1 <- glm(switch~arsenic_c+dist_c+assoc+educ,data=arsenic,family=binomial); summary(arsreg1) ``` ``` ## ## Call: ## glm(formula = switch ~ arsenic_c + dist_c + assoc + educ, family = binomial, ## data = arsenic) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.5942 -1.1976 0.7541 1.0632 1.6739 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.184004 0.068243 2.696 0.00701 ** ## arsenic_c 0.467022 0.041602 11.226 < 2e-16 *** ## dist_c -0.008961 0.001046 -8.569 < 2e-16 *** ## assoc1 -0.124300 0.076966 -1.615 0.10631 ## educ 0.042447 0.009588 4.427 9.55e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 4118.1 on 3019 degrees of freedom ## Residual deviance: 3907.8 on 3015 degrees of freedom ## AIC: 3917.8 ## ## Number of Fisher Scoring iterations: 4 ``` --- ## The contaminated wells analysis Let's do some model assessment ```r binnedplot(fitted(arsreg1),residuals(arsreg1,"resp"),xlab="Pred. probabilities", col.int="red4",ylab="Avg. residuals",main="Binned residual plot",col.pts="navy") ``` <img src="08-logistic-reg-model-assess_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> -- <div class="question"> What do you think? </div> Looks like the residuals are well spread out, with no obvious trend. --- ## The contaminated wells analysis Plot binned raw residuals versus individual predictors. ```r binnedplot(arsenic$arsenic_c,residuals(arsreg1,"resp"),xlab="Arsenic centered", col.int="red4",ylab="Avg. residuals",main="Binned residual plot",col.pts="navy") ``` <img src="08-logistic-reg-model-assess_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> -- <div class="question"> What do you think? </div> Not good! There's a pattern here we need to account for! --- ## The contaminated wells analysis ```r binnedplot(arsenic$dist_c,residuals(arsreg1,"resp"),xlab="Distance centered", col.int="red4",ylab="Avg. residuals",main="Binned residual plot",col.pts="navy") ``` <img src="08-logistic-reg-model-assess_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> -- <div class="question"> What do you think? </div> There are too many observations outside the SD lines, but not much else to worry about. --- ## The contaminated wells analysis ```r binnedplot(arsenic$educ,residuals(arsreg1,"resp"),xlab="Education", col.int="red4",ylab="Avg. residuals",main="Binned residual plot",col.pts="navy") ``` <img src="08-logistic-reg-model-assess_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> -- <div class="question"> What do you think? </div> Looks like there might be different trends for different sections of education. Maybe we should collapse the levels? --- ## The contaminated wells analysis ```r Conf_mat <- confusionMatrix(as.factor(ifelse(fitted(arsreg1) >= 0.5, "1","0")), as.factor(arsenic$switch),positive = "1") Conf_mat$table ``` ``` ## Reference ## Prediction 0 1 ## 0 470 346 ## 1 813 1391 ``` ```r Conf_mat$overall["Accuracy"]; ``` ``` ## Accuracy ## 0.6162252 ``` ```r Conf_mat$byClass[c("Sensitivity","Specificity")] ``` ``` ## Sensitivity Specificity ## 0.8008060 0.3663289 ``` .hlight[confusionMatrix] produces a lot of output. Print the .hlight[Conf_mat] object to see all of them. Clearly, we can calculate these out-of-sample as well, which can be useful. For out-of-sample, you should replace .hlight[fitted(arsreg1)] with .hlight[predict(arsreg1,newdata,type="response)]. --- ## The contaminated wells analysis ```r invisible(roc(arsenic$switch,fitted(arsreg1),plot=T,print.thres="best",legacy.axes=T, print.auc =T,col="red3")) ``` <img src="08-logistic-reg-model-assess_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> We can also do this for out-of-sample predictions. --- ## Change in deviance statistic - Suppose we want to test whether a particular subset of `\(q\)` of the coefficients are zero. That is, .block[ .small[ $$ `\begin{split} H_0 & : \beta_{p-q+1} = \beta_{p-q+2} = \ldots = \beta_p = 0\\ H_A & : \textrm{At least one } \beta_{j} \neq 0; \textrm{ where } j = p-q+1, \ldots, p. \end{split}` $$ ] ] - In linear regression we use the nested F-test to compare two nested models. - In logistic regression we use the .hlight[change in deviance test]. - Expression for change in deviance on next slide for those interested, but formula not essential for our course. - Can use the anova command in R! Hooray!! --- ## Change in deviance statistic - Let `\(y_i = 1\)` when the outcome variable for record `\(i\)` equals 1, and `\(y_i = 0\)` when the outcome variable for record `\(i\)` equals 0. - Let `\(\hat{\pi}_{1i}\)` and `\(\hat{\pi}_{2i}\)` be the be the estimated probabilities that `\(y_i = 1\)` for Model 1 and Model 2, respectively, where Model 1 is nested within Model 2. - For a logistic regression, the change in deviance test statistic is defined as: .block[ .small[ `$$D = 2 \left(\sum_{i=1}^n \left[y_i \textrm{log}(\hat{\pi}_{1i}) + (1-y_i) \textrm{log}(1-\hat{\pi}_{1i})\right] - \sum_{i=1}^n \left[y_i \textrm{log}(\hat{\pi}_{2i}) + (1-y_i) \textrm{log}(1-\hat{\pi}_{2i})\right] \right)$$` ] ] - In large samples, `\(D\)` has approximately a chi-squared distribution with degrees of freedom equal to the difference in the number of predictors in Model 1 and Model 2. --- ## The contaminated wells analysis Suppose we want to test the interaction between *arsenic* and *assoc*. ```r arsreg2 <- glm(switch~arsenic_c+dist_c+assoc+educ+arsenic_c:assoc,data=arsenic,family=binomial) anova(arsreg1, arsreg2, test= "Chisq") ``` ``` ## Analysis of Deviance Table ## ## Model 1: switch ~ arsenic_c + dist_c + assoc + educ ## Model 2: switch ~ arsenic_c + dist_c + assoc + educ + arsenic_c:assoc ## Resid. Df Resid. Dev Df Deviance Pr(>Chi) ## 1 3015 3907.8 ## 2 3014 3905.0 1 2.7895 0.09488 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ```r AIC(arsreg1); AIC(arsreg2) ``` ``` ## [1] 3917.826 ``` ``` ## [1] 3917.036 ``` At the 0.05 level, we can conclude that the interaction term is not significant, confirming our EDA. The AIC values also confirms that. --- ## The contaminated wells analysis Let's compare ROC curves and AUC values for the two models. ```r invisible(roc(arsenic$switch,fitted(arsreg1),plot=T,print.thres="best",legacy.axes=T, print.auc =T,col="red1")) invisible(roc(arsenic$switch,fitted(arsreg2),plot=T,print.thres="best",legacy.axes=T, print.auc =T,col="blue4",add=T)) ``` <img src="08-logistic-reg-model-assess_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" /> --- ## The contaminated wells analysis ```r Conf_mat <- confusionMatrix(as.factor(ifelse(fitted(arsreg1) >= 0.5, "1","0")), as.factor(arsenic$switch),positive = "1") Conf_mat$overall["Accuracy"]; Conf_mat$byClass[c("Sensitivity","Specificity")] ``` ``` ## Accuracy ## 0.6162252 ``` ``` ## Sensitivity Specificity ## 0.8008060 0.3663289 ``` ```r Conf_mat2 <- confusionMatrix(as.factor(ifelse(fitted(arsreg2) >= 0.5, "1","0")), as.factor(arsenic$switch),positive = "1") Conf_mat2$overall["Accuracy"]; Conf_mat2$byClass[c("Sensitivity","Specificity")] ``` ``` ## Accuracy ## 0.6155629 ``` ``` ## Sensitivity Specificity ## 0.8008060 0.3647701 ``` These results again confirm our EDA and the test. We still need to take a step back and fix the relationship with arsenic. We will do so in the next class and build our final model a bit more carefully.