class: center, middle, inverse, title-slide # MLR: model assessment, diagnostics and validation I ### Dr. Olanrewaju Michael Akande ### Sept 3, 2019 --- ## Outline - Questions from last lecture - Checking model assumptions - Questions --- ## Introduction - In the last class, we started our introduction to multiple linear regression. -- - We saw how to write down and fit MLR models. However, we ignored a very important part of fitting any model: model assessment and validation. -- - We need to assess whether or not the assumptions of the model actually hold for the data at hand. That's exactly what we will dive in today. -- - In the next class, we will cover model validation via in-sample and out-of-sample predictions. -- - Later on, we will also touch on transformations, multicollinearity and heteroscedasticity. --- class: center, middle # Assumptions for MLR --- ## Assumptions for MLR - Inference (CIs, p-values, or predictions) can only be meaningful when the regression assumptions are plausible. -- - <div class="question"> Can you list the assumptions for MLR? </div> -- - The main assumptions are: 1. Linearity 2. Independence of errors 3. Equal variance 4. Normality --- ## First, think about validity of the model - Validity is about whether the data and model are even suitable for answering the research question. -- - To quote the Gelman and Hill book, .block[ "Optimally, this means that the outcome measure should accurately reflect the phenomenon of interest, the model should include all relevant predictors, and the model should generalize to cases to which it will be applied. For example, a model of earnings will not necessarily tell you about pattern of total assets, neither will a model of test scores necessarily tell you about child intelligence or cognitive development." ] -- - You should always keep in mind the types of questions you can and cannot answer reliably from both the data and model. --- ## Checking assumptions - Checking all four assumptions usually requires examining the residuals after model fitting. -- - For .hlight[linearity], a plot of the residuals versus each predictor will often do. -- <div class="question"> What should we look out for in such a plot? </div> -- - Note that the residuals contain information about the response variable variable `\(y\)` that has not been explained by the covariates in the model. -- - Thus, when we plot the residuals against each predictor, we should NOT expect to see any pattern. -- - .block[ Some pattern in any of the plots is usually an indication of a relationship (often nonlinear) between `\(y\)` and that predictor, which has not been captured yet in the model. ] --- ## Checking assumptions - To check both .hlight[independence] of the errors and the .hlight[equal variance] assumption, usually a plot of the residuals versus the fitted values will do. -- - .block[ The points in the plot should look random (for independence) and be "roughly" equally spread out around zero (for equal variance). ] -- - To check .hlight[normality], it is often sufficient to look at a qq-plot (quantile-quantile plot) which compares the distribution of the standardized residuals to the theoretical quantiles of a standard normal distribution. -- - .block[ Clustering of the points around the 45 degree line of the qq-plot usually implies the normality assumption is not violated. ] -- - One can also look at a histogram of the residuals, but it is much harder to judge deviations from normality through histograms. --- ## Checking assumptions Let's revisit the same data from last class. Recall the model we fit to the data and the results: .small[ `$$\textrm{bsal}_i = \beta_0 + \beta_1 \textrm{sex}_i + \beta_2 \textrm{senior}_i + \beta_3 \textrm{age}_i + \beta_4 \textrm{educ}_i + \beta_5 \textrm{exper}_i + \epsilon_i$$` ] ``` ## ## Call: ## lm(formula = bsal ~ sex + senior + age + educ + exper, data = wages) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1217.36 -342.83 -55.61 297.10 1575.53 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6277.8934 652.2713 9.625 2.36e-15 *** ## sexFemale -767.9127 128.9700 -5.954 5.39e-08 *** ## senior -22.5823 5.2957 -4.264 5.08e-05 *** ## age 0.6310 0.7207 0.876 0.383692 ## educ 92.3060 24.8635 3.713 0.000361 *** ## exper 0.5006 1.0553 0.474 0.636388 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 508.1 on 87 degrees of freedom ## Multiple R-squared: 0.5152, Adjusted R-squared: 0.4873 ## F-statistic: 18.49 on 5 and 87 DF, p-value: 1.811e-12 ``` --- ## Checking linearity Now, let's plot the residuals against each predictor. First, let's look at .hlight[senior]. ```r plot(y=regwage$residual, x = wages$senior, xlab = "Seniority", ylab = "Residual") abline(0,0) ``` <img src="03-reg-model-assess-I_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> -- <div class="question"> Do you see any clear violations of the linearity assumption? </div> --- ## Checking linearity Next, let's look at .hlight[age]. ```r plot(y=regwage$residual, x = wages$age, xlab = "Age", ylab = "Residual") abline(0,0) ``` <img src="03-reg-model-assess-I_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> -- <div class="question"> Do you see any clear violations of the linearity assumption? </div> --- ## Checking linearity Next, let's look at .hlight[educ]. ```r plot(y=regwage$residual, x = wages$educ, xlab = "Education", ylab = "Residual") abline(0,0) ``` <img src="03-reg-model-assess-I_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> -- <div class="question"> Do you see any clear violations of the linearity assumption? </div> --- ## Checking linearity Next, let's look at .hlight[exper]. ```r plot(y=regwage$residual, x = wages$exper, xlab = "Experience", ylab = "Residual") abline(0,0) ``` <img src="03-reg-model-assess-I_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> -- <div class="question"> Do you see any clear violations of the linearity assumption? </div> --- ## Checking independence and equal variance ```r plot(regwage,which=1) ``` <img src="03-reg-model-assess-I_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> -- <div class="question"> Do you see any clear violations of the independence and equal variance assumptions? </div> --- ## Checking normality ```r plot(regwage,which=2) ``` <img src="03-reg-model-assess-I_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> -- <div class="question"> Do you see any clear violations of the normality assumption? </div> --- ## Takeaways from residual plots - Looks like we may have to worry about the assumption of linearity being violated for age and experience. -- - There appears to be some quadratic trend for both variables and possible non-constant variance, so let's improve the model by adding the squared term for each variable. -- - First, let's mean-center the continuous predictor to improve interpretation of outputs (especially the intercept). -- - Centering does not really improve model fit, however it does help with interpretability. -- - Intercepts especially are often hard to interpret, because they represent value when all predictors equal zero. This may be an unrealistic or uninteresting case. --- ## Centering - Instead, for each continuous predictor, we can subtract its mean from every value, and use these mean centered predictors in regression instead. -- - The intercept can now be interpreted as average value of `\(Y\)` at the average value of `\(X\)`, which is much more interpretable. -- - Centering can be especially useful in models with interactions (which we are yet to explore). -- - Centering can also help with multicollinearity (which we will dive into properly in the next class). -- - Essentially, a transformed variable `\(x_j^2\)` may be highly correlated with the untransformed counterpart `\(x_j\)`, which we want to avoid. Centering `\(x_j\)` before taking the square helps with that. -- - From now on for this example, we will mean center continuous predictors. --- ## Centering ```r wages$agec <- c(scale(wages$age,scale=F)) wages$seniorc <- c(scale(wages$senior,scale=F)) wages$experc <- c(scale(wages$exper,scale=F)) wages$educc <- c(scale(wages$educ,scale=F)) regwagec <- lm(bsal~ sex + seniorc + agec + educc + experc, data= wages) summary(regwagec) ``` ``` ## ## Call: ## lm(formula = bsal ~ sex + seniorc + agec + educc + experc, data = wages) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1217.36 -342.83 -55.61 297.10 1575.53 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5924.0072 99.6588 59.443 < 2e-16 *** ## sexFemale -767.9127 128.9700 -5.954 5.39e-08 *** ## seniorc -22.5823 5.2957 -4.264 5.08e-05 *** ## agec 0.6310 0.7207 0.876 0.383692 ## educc 92.3060 24.8635 3.713 0.000361 *** ## experc 0.5006 1.0553 0.474 0.636388 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 508.1 on 87 degrees of freedom ## Multiple R-squared: 0.5152, Adjusted R-squared: 0.4873 ## F-statistic: 18.49 on 5 and 87 DF, p-value: 1.811e-12 ``` --- ## Centering - Notice that the coefficients for the predictors have not changed but the intercept has changed. -- - We interpret the intercept as the average baseline salary for male employees who are 474 months old, have 82 months of seniority, 12.5 years of education, and 101 months of experience. ```r colMeans(wages[,c("age","senior","educ","exper")]) ``` ``` ## age senior educ exper ## 474.39785 82.27957 12.50538 100.92742 ``` -- - Now, back to the model diagnostics and refinement. -- - Let's add the squared terms of the centered age and centered experience predictors to the dataset and refit the model. ```r wages$agec2 <- wages$agec^2 wages$experc2 <- wages$experc^2 ``` --- ## Re-fitting the model ```r regwagecsquares <- lm(bsal~sex+seniorc+agec+agec2+educc+experc+experc2,data=wages) summary(regwagecsquares) ``` ``` ## ## Call: ## lm(formula = bsal ~ sex + seniorc + agec + agec2 + educc + experc + ## experc2, data = wages) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1086.84 -267.84 -8.71 304.92 1642.44 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.098e+03 1.123e+02 54.313 < 2e-16 *** ## sexFemale -7.684e+02 1.211e+02 -6.343 1.04e-08 *** ## seniorc -1.764e+01 5.265e+00 -3.351 0.00120 ** ## agec -3.473e-01 7.814e-01 -0.444 0.65783 ## agec2 7.195e-04 4.045e-03 0.178 0.85925 ## educc 7.561e+01 2.406e+01 3.142 0.00231 ** ## experc 4.035e+00 1.479e+00 2.729 0.00772 ** ## experc2 -2.298e-02 7.592e-03 -3.027 0.00326 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 477 on 85 degrees of freedom ## Multiple R-squared: 0.5825, Adjusted R-squared: 0.5481 ## F-statistic: 16.94 on 7 and 85 DF, p-value: 8.011e-14 ``` --- ## How well does the model fit the data. The first model: .hlight[NO quadratic term] <img src="03-reg-model-assess-I_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- ## How well does the model fit the data. Now the latest model: .hlight[add quadratic term] <img src="03-reg-model-assess-I_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- ## Interpreting the new model Clearly, .hlight[experc2] is an important predictor given all other predictors. However, it can be tough to interpret its effect using the coefficient. Instead, let's visualize the effect of changing experience. ```r #First, make the 20 values of experience that you want to examine newexper <- seq(from=0,to=400,by=5) newexperc <- newexper - mean(wages$exper) newexperc2 <- newexperc^2 newdata <- data.frame(matrix(0, nrow=length(newexper), ncol=7)) names(newdata) <- c("sex","seniorc","agec","agec2","educc","experc","experc2") newdata$experc <- newexperc; newdata$experc2 <- newexperc2; newdata$sex <- "Male" #Since we use mean-centered predictors, the rows in the new dataset correspond to #people with average values of seniority, age, and education. preds_male <- predict(regwagecsquares,newdata,interval="confidence"); preds_male[1:3,] ``` ``` ## fit lwr upr ## 1 5457.022 4983.617 5930.426 ## 2 5499.822 5049.929 5949.715 ## 3 5541.473 5114.091 5968.854 ``` ```r newdata$sex <- "Female"; preds_female <- predict(regwagecsquares,newdata,interval="confidence"); preds_female[1:3,] ``` ``` ## fit lwr upr ## 1 4688.582 4269.895 5107.269 ## 2 4731.382 4337.545 5125.219 ## 3 4773.033 4403.057 5143.009 ``` --- ## Interpreting the new model ```r plot(y=preds_male[,"fit"],x=newexper,xlab="Experience",ylab="Predicted Wages", main="Expected Change in B.Wages with Experience",col="darkblue",ylim=c(4700,6300)) points(y=preds_female[,"fit"], x=newexper,col="orange") legend("bottomright",c("Male","Female"),col=c("darkblue","orange"),lty=c(2,2)) #Remember that this is with average values of other predictors. ``` <img src="03-reg-model-assess-I_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> -- <div class="question"> Why do we have the exact same trend for male and female? </div> --- ## Interpreting the new model ```r #if you want to get the 95% confidence bands on the plot as well, you can do the following plot(y=preds_male[,"fit"], x=newexper, xlab="Experience", ylab="Predicted Wages", main="Expected Change in B.Wages with Experience",col="darkblue",ylim=c(4200,6700)) points(y=preds_female[,"fit"], x=newexper,col="orange") legend("bottomright",c("Male","Female"),col=c("darkblue","orange"),lty=c(1,1)) lines(y=preds_male[,"lwr"],x=newexper,col="darkblue",lty=2) lines(y=preds_male[,"upr"],x=newexper,col="darkblue",lty=2) lines(y=preds_female[,"lwr"],x=newexper,col="orange",lty=2) lines(y=preds_female[,"upr"],x=newexper,col="orange",lty=2) ``` <img src="03-reg-model-assess-I_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> --- ## Final notes - .block[ Generally it is a good idea to start with exploratory data analysis (which we did a bit of in the last class) rather than jumping right into modeling. ] -- - After fitting your model, model assessment is A MUST! -- - In this class and outside of it, you should always assess your models! -- - In the next class, we will continue with model assessment by exploring leverage, influence, and standardized residuals. -- - We will also dive into model validation.