MLR: model assessment, diagnostics and validation I

class: center, middle, inverse, title-slide

# MLR: model assessment, diagnostics and validation I
### Dr. Olanrewaju Michael Akande
### Sept 3, 2019

---

## Outline

- Questions from last lecture

- Checking model assumptions

- Questions

---
## Introduction

- In the last class, we started our introduction to multiple linear regression.

- We saw how to write down and fit MLR models. However, we ignored a very important part of fitting any model: model assessment and validation.

- We need to assess whether or not the assumptions of the model actually hold for the data at hand. That's exactly what we will dive in today.

- In the next class, we will cover model validation via in-sample and out-of-sample predictions.

- Later on, we will also touch on transformations, multicollinearity and heteroscedasticity.

---
class: center, middle

# Assumptions for MLR

---
## Assumptions for MLR

- Inference (CIs, p-values, or predictions) can only be meaningful when the regression assumptions are plausible.

- <div class="question">
Can you list the assumptions for MLR?
</div>

- The main assumptions are:
  
  1. Linearity
  
  2. Independence of errors
  
  3. Equal variance

4. Normality

---
## First, think about validity of the model

- Validity is about whether the data and model are even suitable for answering the research question.

- To quote the Gelman and Hill book,

.block[
  "Optimally, this means that the outcome measure should accurately reflect the phenomenon of interest, the model should include all relevant predictors, and the model should generalize to cases to which it will be applied. 
  
  For example, a model of earnings will not necessarily tell you about pattern of total assets, neither will a model of test scores necessarily tell you about child intelligence or cognitive development."
  ]

- You should always keep in mind the types of questions you can and cannot answer reliably from both the data and model.

---
## Checking assumptions

- Checking all four assumptions usually requires examining the residuals after model fitting.

- For .hlight[linearity], a plot of the residuals versus each predictor will often do.

<div class="question">
What should we look out for in such a plot?
</div>

- Note that the residuals contain information about the response variable variable `$y$` that has not been explained by the covariates in the model.

- Thus, when we plot the residuals against each predictor, we should NOT expect to see any pattern.

- .block[
Some pattern in any of the plots is usually an indication of a relationship (often nonlinear) between `$y$` and that predictor, which has not been captured yet in the model.
]

---
## Checking assumptions

- To check both .hlight[independence] of the errors and the .hlight[equal variance] assumption, usually a plot of the residuals versus the fitted values will do.

- .block[
The points in the plot should look random (for independence) and be "roughly" equally spread out around zero (for equal variance).
]

- To check .hlight[normality], it is often sufficient to look at a qq-plot (quantile-quantile plot) which compares the distribution of the standardized residuals to the theoretical quantiles of a standard normal distribution.

- .block[
Clustering of the points around the 45 degree line of the qq-plot usually implies the normality assumption is not violated.
]

- One can also look at a histogram of the residuals, but it is much harder to judge deviations from normality through histograms.

---
## Checking assumptions

Let's revisit the same data from last class. Recall the model we fit to the data and the results:
.small[
`$$\textrm{bsal}_i = \beta_0 + \beta_1 \textrm{sex}_i + \beta_2 \textrm{senior}_i + \beta_3 \textrm{age}_i + \beta_4 \textrm{educ}_i + \beta_5 \textrm{exper}_i + \epsilon_i$$`
]

```
## 
## Call:
## lm(formula = bsal ~ sex + senior + age + educ + exper, data = wages)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1217.36  -342.83   -55.61   297.10  1575.53 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6277.8934   652.2713   9.625 2.36e-15 ***
## sexFemale   -767.9127   128.9700  -5.954 5.39e-08 ***
## senior       -22.5823     5.2957  -4.264 5.08e-05 ***
## age            0.6310     0.7207   0.876 0.383692    
## educ          92.3060    24.8635   3.713 0.000361 ***
## exper          0.5006     1.0553   0.474 0.636388    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 508.1 on 87 degrees of freedom
## Multiple R-squared:  0.5152,	Adjusted R-squared:  0.4873 
## F-statistic: 18.49 on 5 and 87 DF,  p-value: 1.811e-12
```

---
## Checking linearity

Now, let's plot the residuals against each predictor. First, let's look at .hlight[senior].

```r
plot(y=regwage$residual, x = wages$senior, xlab = "Seniority", ylab = "Residual")
abline(0,0)
```

--
<div class="question">
Do you see any clear violations of the linearity assumption?
</div>

---
## Checking linearity

Next, let's look at .hlight[age].

```r
plot(y=regwage$residual, x = wages$age, xlab = "Age", ylab = "Residual")
abline(0,0)
```

--
<div class="question">
Do you see any clear violations of the linearity assumption?
</div>

---
## Checking linearity

Next, let's look at .hlight[educ].

```r
plot(y=regwage$residual, x = wages$educ, xlab = "Education", ylab = "Residual")
abline(0,0)
```

--
<div class="question">
Do you see any clear violations of the linearity assumption?
</div>

---
## Checking linearity

Next, let's look at .hlight[exper].

```r
plot(y=regwage$residual, x = wages$exper, xlab = "Experience", ylab = "Residual")
abline(0,0)
```

--
<div class="question">
Do you see any clear violations of the linearity assumption?
</div>

---
## Checking independence and equal variance

```r
plot(regwage,which=1)
```

--
<div class="question">
Do you see any clear violations of the independence and equal variance assumptions?
</div>

---
## Checking normality

```r
plot(regwage,which=2)
```

--
<div class="question">
Do you see any clear violations of the normality assumption?
</div>

---
## Takeaways from residual plots

- Looks like we may have to worry about the assumption of linearity being violated for age and experience.

- There appears to be some quadratic trend for both variables and possible non-constant variance, so let's improve the model by adding the squared term for each variable.

- First, let's mean-center the continuous predictor to improve interpretation of outputs (especially the intercept).

- Centering does not really improve model fit, however it does help with interpretability.

- Intercepts especially are often hard to interpret, because they represent value when all predictors equal zero. This may be an unrealistic or uninteresting case.

---
## Centering

- Instead, for each continuous predictor, we can subtract its mean from every value, and use these mean centered predictors in regression instead.

- The intercept can now be interpreted as average value of `$Y$` at the average value of `$X$`, which is much more interpretable.

- Centering can be especially useful in models with interactions (which we are yet to explore).

- Centering can also help with multicollinearity (which we will dive into properly in the next class).

- Essentially, a transformed variable `$x_j^2$` may be highly correlated with the untransformed counterpart `$x_j$`, which we want to avoid. Centering `$x_j$` before taking the square helps with that.

- From now on for this example, we will mean center continuous predictors.

---
## Centering

```r
wages$agec <- c(scale(wages$age,scale=F))
wages$seniorc <- c(scale(wages$senior,scale=F))
wages$experc <- c(scale(wages$exper,scale=F))
wages$educc <- c(scale(wages$educ,scale=F))
regwagec <- lm(bsal~ sex + seniorc + agec + educc + experc, data= wages)
summary(regwagec)
```

```
## 
## Call:
## lm(formula = bsal ~ sex + seniorc + agec + educc + experc, data = wages)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1217.36  -342.83   -55.61   297.10  1575.53 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5924.0072    99.6588  59.443  < 2e-16 ***
## sexFemale   -767.9127   128.9700  -5.954 5.39e-08 ***
## seniorc      -22.5823     5.2957  -4.264 5.08e-05 ***
## agec           0.6310     0.7207   0.876 0.383692    
## educc         92.3060    24.8635   3.713 0.000361 ***
## experc         0.5006     1.0553   0.474 0.636388    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 508.1 on 87 degrees of freedom
## Multiple R-squared:  0.5152,	Adjusted R-squared:  0.4873 
## F-statistic: 18.49 on 5 and 87 DF,  p-value: 1.811e-12
```

---
## Centering

- Notice that the coefficients for the predictors have not changed but the intercept has changed.

- We interpret the intercept as the average baseline salary for male employees who are 474 months old, have 82 months of seniority, 12.5 years of education, and 101 months of experience.
  
  ```r
  colMeans(wages[,c("age","senior","educ","exper")])
  ```
  
  ```
  ##       age    senior      educ     exper 
  ## 474.39785  82.27957  12.50538 100.92742
  ```

- Now, back to the model diagnostics and refinement.

- Let's add the squared terms of the centered age and centered experience predictors to the dataset and refit the model. 
  
  ```r
  wages$agec2 <- wages$agec^2
  wages$experc2 <- wages$experc^2
  ```

---
## Re-fitting the model

```r
regwagecsquares <- lm(bsal~sex+seniorc+agec+agec2+educc+experc+experc2,data=wages)
summary(regwagecsquares)
```

```
## 
## Call:
## lm(formula = bsal ~ sex + seniorc + agec + agec2 + educc + experc + 
##     experc2, data = wages)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1086.84  -267.84    -8.71   304.92  1642.44 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.098e+03  1.123e+02  54.313  < 2e-16 ***
## sexFemale   -7.684e+02  1.211e+02  -6.343 1.04e-08 ***
## seniorc     -1.764e+01  5.265e+00  -3.351  0.00120 ** 
## agec        -3.473e-01  7.814e-01  -0.444  0.65783    
## agec2        7.195e-04  4.045e-03   0.178  0.85925    
## educc        7.561e+01  2.406e+01   3.142  0.00231 ** 
## experc       4.035e+00  1.479e+00   2.729  0.00772 ** 
## experc2     -2.298e-02  7.592e-03  -3.027  0.00326 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 477 on 85 degrees of freedom
## Multiple R-squared:  0.5825,	Adjusted R-squared:  0.5481 
## F-statistic: 16.94 on 7 and 85 DF,  p-value: 8.011e-14
```

---
## How well does the model fit the data.

The first model: .hlight[NO quadratic term]
<img src="03-reg-model-assess-I_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" />

---
## How well does the model fit the data.

Now the latest model: .hlight[add quadratic term]
<img src="03-reg-model-assess-I_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" />

---
## Interpreting the new model

Clearly, .hlight[experc2] is an important predictor given all other predictors. However, it can be tough to interpret its effect using the coefficient. Instead, let's visualize the effect of changing experience.

```r
#First, make the 20 values of experience that you want to examine
newexper <- seq(from=0,to=400,by=5)
newexperc <- newexper - mean(wages$exper)
newexperc2 <- newexperc^2
newdata <- data.frame(matrix(0, nrow=length(newexper), ncol=7))
names(newdata) <- c("sex","seniorc","agec","agec2","educc","experc","experc2")
newdata$experc <- newexperc; newdata$experc2 <- newexperc2; newdata$sex <- "Male"
#Since we use mean-centered predictors, the rows in the new dataset correspond to 
#people with average values of seniority, age, and education. 
preds_male <- predict(regwagecsquares,newdata,interval="confidence"); preds_male[1:3,]
```

```
##        fit      lwr      upr
## 1 5457.022 4983.617 5930.426
## 2 5499.822 5049.929 5949.715
## 3 5541.473 5114.091 5968.854
```

```r
newdata$sex <- "Female"; 
preds_female <- predict(regwagecsquares,newdata,interval="confidence"); preds_female[1:3,]
```

```
##        fit      lwr      upr
## 1 4688.582 4269.895 5107.269
## 2 4731.382 4337.545 5125.219
## 3 4773.033 4403.057 5143.009
```

---
## Interpreting the new model

```r
plot(y=preds_male[,"fit"],x=newexper,xlab="Experience",ylab="Predicted Wages",
     main="Expected Change in B.Wages with Experience",col="darkblue",ylim=c(4700,6300))
points(y=preds_female[,"fit"], x=newexper,col="orange")
legend("bottomright",c("Male","Female"),col=c("darkblue","orange"),lty=c(2,2))
#Remember that this is with average values of other predictors.
```

--
<div class="question">
Why do we have the exact same trend for male and female?
</div>

---
## Interpreting the new model

```r
#if you want to get the 95% confidence bands on the plot as well, you can do the following
plot(y=preds_male[,"fit"], x=newexper, xlab="Experience", ylab="Predicted Wages",
     main="Expected Change in B.Wages with Experience",col="darkblue",ylim=c(4200,6700))
points(y=preds_female[,"fit"], x=newexper,col="orange")
legend("bottomright",c("Male","Female"),col=c("darkblue","orange"),lty=c(1,1))
lines(y=preds_male[,"lwr"],x=newexper,col="darkblue",lty=2)
lines(y=preds_male[,"upr"],x=newexper,col="darkblue",lty=2)
lines(y=preds_female[,"lwr"],x=newexper,col="orange",lty=2)
lines(y=preds_female[,"upr"],x=newexper,col="orange",lty=2)
```

---
## Final notes

- .block[
  Generally it is a good idea to start with exploratory data analysis (which we did a bit of in the last class) rather than jumping right into modeling.
  ]
  
--

- After fitting your model, model assessment is A MUST!

- In this class and outside of it, you should always assess your models!

- In the next class, we will continue with model assessment by exploring leverage, influence, and standardized residuals.

- We will also dive into model validation.