class: center, middle, inverse, title-slide # MLR: special predictors and multicollinearity ### Dr. Olanrewaju Michael Akande ### Sept 10, 2019 --- ## Announcements - Due dates for assignments will now be specified on an assignment to assignment basis. - Homework 2 is online; due date is 11:59pm next Tuesday, September 17. - Submit on Gradescope. ## Outline - Questions from last lecture - Special predictors: higher order terms, indicator/dummy variables, and interaction terms - Multicollinearity - In-class example --- ## Introduction - In the last class, we covered model assessment and validation. -- - Today, we will move on to controlling for and interpreting special predictors in regression models. -- - Specifically, we will talk about higher order terms, indicator/dummy variables and interaction terms. -- - We have already had some introduction to the first two, so we will simply formalize the concepts. -- - This is lead us to start to talk about model selection. -- - In the next class, we will dive properly into model building, selection and comparison for MLR models. --- class: center, middle # Special predictors --- ## Special predictors: higher order terms - From the Harris Trust and Savings Bank example, we have already seen that the relationships between a response variable and some of the predictors can be potentially nonlinear. -- - Sometimes our outcome of interest can appear to have quadratic or even higher order polynomial trends with some predictors. -- - Whenever this is the case, we should look to include squared terms or higher order powers for predictors to capture trends. -- - In that particular example, we included squared terms for both age and experience. -- - .block[General practice: include all lower order terms when including higher order ones (even if the lower order terms are not significant). This aids interpretation. ] -- - As we have seen before, the best way to present results when including quadratic/polynomial trends is to plot the predicted average of `\(Y\)` for different values of `\(X\)`. --- ## Special predictors: indicator/dummy variables - From the Harris Trust and Savings Bank example, we have also seen how to include binary variables in a MLR model with the variable `sex`. -- - In the example, we could actually have used the variable `fsex` (where 1=female and 0=male) instead of `sex` to give us the same exact results. -- - That means that we also could have made a variable equal to one for all males and zero for all females, instead. -- - The value of that coefficient would be `\(767\)` instead of `\(-767\)` like we had. All other statistics stay the same (SE, t-stat, p-value). Other coefficients also remain the same. -- - Turns out that we cannot include indicator variables for the two values of the same binary variable when we also include the intercept. --- ## Special predictors: indicator/dummy variables - It is not possible to estimate all three of these parameters in the same model uniquely. -- - The exact same problem arises for any set of predictors such that one is an exact linear combination of the others. -- - Example: Consider a regression model with dummy variables for both males and females, plus an intercept. .block[ .small[ `$$y_i = \beta_0 + \beta_1 \textrm{M}_i + \beta_2 \textrm{F}_i + \epsilon_i = \beta_0*1 + \beta_1 \textrm{M}_i + \beta_2 \textrm{F}_i + \epsilon_i$$` ] ] -- - Note that `\(\textrm{M}_i + \textrm{F}_i = 1\)` for all cases. Thus, .block[ .small[ `$$y_i = \beta_0*(\textrm{M}_i + \textrm{F}_i) + \beta_1 \textrm{M}_i + \beta_2 \textrm{F}_i + \epsilon_i = (\beta_0+\beta_1) \textrm{M}_i + (\beta_0+\beta_2) \textrm{F}_i + \epsilon_i.$$` ] ] We can estimate `\((\beta_0+\beta_1)\)` and `\((\beta_0+\beta_2)\)` but not all three uniquely. -- - .block[Side note: there is no need to mean center dummy variables, since they have a natural interpretation at zero.] --- ## Special predictors: indicator/dummy variables - What if a categorical variable has `\(k > 2\)` levels? -- - Make `\(k\)` dummy variables, one for each level. -- - Use only `\(k-1\)` of the levels in the regression model, since we cannot uniquely estimate all `\(k\)` at once if we also include an intercept (see previous slide). -- - Excluded level is called the baseline. -- - R will actually do this for you automatically; that is, make the `\(k-1\)` dummy variables and set the first level as the baseline. -- - Values of coefficients of dummy variables are interpreted as changes in average `\(Y\)` over the baseline. -- - We will go through an in-class example today. --- ## Special predictors: interaction terms - Sometimes relationship of some predictor with `\(Y\)` depends on values of other predictors. This is called an .hlight[interaction effect]. -- - Sometimes, the question we wish to answer would require including interactions in the model, even though they might not be significant. -- - An example of interaction effect for the Harris Bank dataset would be if the effect of age on baseline income was different for male versus female. -- - That is, what if older males are paid more starting salaries than younger males but the reverse is actually the case for females? -- - How do we account for such interaction effects? Make an interaction predictor: multiply one predictor times the other predictor. -- - .block[General practice is to include all main effects (each variable without interaction) when including interactions.] --- ## Testing if groups of coefficients are equal to zero - With so many variables (polynomial terms, dummy variables and interactions) in a linear model, we may want to test if multiple coefficients are equal to zero or not. -- - We can do so using an F test (a nested F test in this case). -- - First, we fit a MLR model with all `\(p\)` predictors. That is, .block[ .small[ `$$\textrm{M}_1: \ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip} + \epsilon_i; \ \ \epsilon_i \overset{iid}{\sim} N(0, \sigma^2).$$` ] ] -- - We can compute the sum of squares of the errors `\((\textrm{SSE}_1)\)` or residual sum of squares `\((\textrm{RSS}_1)\)` for the FULL model, that is, .block[ .small[ `$$\textrm{RSS}_1 = \sum^n_{i=1} \left(y_i - \hat{y}_i \right)^2.$$` ] ] --- ## Testing if groups of coefficients are equal to zero - Now suppose we want to test that a particular subset of `\(q\)` of the coefficients are zero. .block[ .small[ `$$H_0: \beta_{p-q+1} = \beta_{p-q+2} = \ldots = \beta_p = 0.$$` ] ] -- - We fit a reduced model that uses all the variables except the last `\(q\)`, that is, .block[ .small[ `$$\textrm{M}_{0}: \ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{i(p-q)} + \epsilon_i; \ \ \epsilon_i \overset{iid}{\sim} N(0, \sigma^2).$$` ] ] -- - Let's call the residual sum of squares for that model `\(\textrm{RSS}_0\)`. -- <div class="question"> Which of the two RSS values would be larger? Why? </div> -- - Then the appropriate F-statistic is .block[ .small[ `$$F = \dfrac{(\textrm{RSS}_0 - \textrm{RSS}_1)/q}{\textrm{RSS}_1/(n-p-1)}.$$` ] ] --- ## Testing if groups of coefficients are equal to zero - To calculate the p-value, look for the area under the `\(F\)` curve with `\(q\)` degrees of freedom in the numerator, and `\((n-p-1)\)` degrees of freedom in the denominator. -- - Guess what? As is the case with pretty much everything else we do in this class, this is so easy to do in R! -- <img src="img/surprised-baby.jpeg" height="300px" style="display: block; margin: auto;" /> --- class: center, middle # Multicollinearity --- ## The problem of multicollinearity - Just like we had with the dummy variables, you cannot include two variables with a perfect linear association as predictors in regression. -- - Example: suppose the true population line is .block[ .small[ `$$\textrm{Avg. y} = 3 + 4x.$$` ] ] -- - Suppose we try to include `\(x\)` and `\(z = x/10\)` as predictors in our own model, - Example: suppose the true population line is .block[ .small[ `$$\textrm{Avg. y} = \beta_0 + \beta_1 x + \beta_2 z,$$` ] ] and estimate all coefficients. Since `\(z = x/10\)`, we have .block[ .small[ `$$\textrm{Avg. y} = \beta_0 + \beta_1 x + \beta_2 \dfrac{x}{10} = \beta_0 + \left( \beta_1 + \dfrac{\beta_2}{10} \right) x$$` ] ] -- - We could set `\(\beta_1\)` and `\(\beta_2\)` to ANY two numbers such that `\(\beta_1 + \beta_2/10 = 4\)`. The data cannot pick from the possible combinations. --- ## The problem of multicollinearity - In real data, when we get “close” to perfect colinearities we see standard errors inflate, sometimes massively. -- - When might we get close: + Very high correlations `\((|\rho| > 0.9)\)` among two (or more) predictors in modest sample sizes. + When one or more variables are nearly a linear combination of the others. + Including quadratic terms as predictors without first mean centering the values before squaring. + Including interactions involving continuous variables. --- ## The problem of multicollinearity - How to diagnose: + Look at a correlation matrix of all the predictors (including dummy variables). Look for values near -1 or 1. + If you are suspicious that some predictor is a near linear combination of others, run a regression of that predictor on all other predictors (not including Y) to see if R squared is near 1. + If the R squared is near 1, you should think about centering your variables or maybe even excluding that variable from your regression in some cases. + Take a look at the .hlight[variance inflation factor]. + Variance inflation factor measures how much the multicollinearity between a variable and other variables in the model inflates the variance of the regression coefficient for that variable. --- ## Variance inflation factor - .block[ .small[ `$$\textrm{VIF}_j = \dfrac{1}{1-R^2_{X_j | X_{-j}}}$$` ] ] where `\(R^2_{X_j | X_{-j}}\)` is the R-squared from the regression of predictor `\(X_j\)` on all other predictors `\((X_1, \ldots,X_{j-1},X_{j+1},\ldots, X_{p})\)`. -- - Since R-squared always lies between 0 and 1, + the denominator `\(1-R^2_{X_j | X_{-j}} \leq 1\)` + which implies that `\(\textrm{VIF} \geq 1\)` -- - Generally, VIF of + 1 = not correlated. Why? + between 1 and 5 = moderately correlated. + greater than 5 = highly correlated. - Typically, we start to get worried when VIF > 10. --- ## We see multicollinearity... so what? - Multicollinearity is really only a problem if standard errors for the involved coefficients are too large to be useful for interpretation, and you actually care about interpreting those coefficients. -- - In the Harris Bank example, + The main coefficient of interest is the one for `sex`. + The remaining variables are really just "control variables". That is, those variables may be correlated with both `bsal` and `sex`, and so we want to account for their effects in our model. + Recall that the correlation between `age` and `exper` was actually 0.8. + Even with this correlation, it is still okay to keep both in the model since we want to simply account for them but do not care about interpreting either. - Another scenario is prediction: including highly correlated predictors can increase prediction uncertainty. --- ## What to do about multicollinearity? - What if you do want to interpret the coefficients involved in the multicollinearity, and the SEs are inflated substantially because of it? - Easiest remedy: remove one of the "offending" predictors. - Keep the one that is easiest to explain or that has the largest T-statistic. -- - Better remedy: + Mean center (or scale) your variables. It helps but may not always solve the problem. + Use a Bayesian regression model with an informative prior distribution on the parameters (take STA602!). + Get more data! Multicollinearity tends to be unimportant in large sample sizes. --- class: center, middle # Transformations --- ## Log transformation - You have already seen through the homework and lab that transforming variables can help with linearity and normality. - The most common transformation is the natural logarithm. That is, `\(\textrm{log}_e(y)\)` or `\(\textrm{ln}(y)\)`. - This is often because it is the easiest to interpret. - Suppose .block[ .small[ `$$\textrm{ln}(y_i) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip} + \epsilon_i.$$` ] ] - Then it is easy to see that .block[ .small[ `$$y_i = e^{(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip} + \epsilon_i)} = e^\beta_0 \times e^{\beta_1 x_{i1}} \times e^{\beta_1 x_{i2}} \times \ldots \times e^{\beta_1 x_{ip}} \times e^\epsilon_i.$$` ] ] - That is, the predictors actually have a "multiplicative" effect on `\(y\)`. --- ## Log transformation - The estimated `\(\beta_j\)`'s can be interpreted in terms of approximate proportional differences. - For example, suppose `\(\beta_1 = 0.10\)`, then `\(e^\beta_1 = 1.105\)`. - Thus, a difference of 1 unit in `\(x_1\)` corresponds to an expected positive difference of about 10% in `\(y\)`. - Similarly, `\(\beta_1 = -0.10\)` means a difference of 1 unit in `\(x_1\)` corresponds to an expected negative difference of about 10% in `\(y\)`. - When making predictions using the regression of the transformed variable, remember to transform back to the original scale to make your predictions more meaningful. --- class: center, middle # Example --- ## Example: multiple regression of diamonds data - A diamond's value is often determined using four factors known as the 4Cs: color, clarity, cut (certification) and carat weight. + Color: evaluation based on absence of color; how pure the diamond is. + Clarity: evaluation based on absence of blemishes. + Certification: how well the diamond is cut; how well a diamond's facets interacts with light. + Carat weight: measures how much the diamond weighs. - We will use some data to draw inference about how these factors affect a diamond's price - The data is in the file `diamonds.csv` on Sakai. - You can read more about the 4Cs [here](https://4cs.gia.edu/en-us/4cs-diamond-quality/). --- ## Multiple regression of diamonds data ```r diamonds <- read.csv("data/diamonds.csv", header= T, colClasses = c("numeric","factor","factor","factor","numeric")) dim(diamonds) ``` ``` ## [1] 308 5 ``` ```r head(diamonds) ``` ``` ## Carats Color Clarity Certification Price ## 1 0.30 D VS2 GIA 1302 ## 2 0.30 E VS1 GIA 1510 ## 3 0.30 G VVS1 GIA 1510 ## 4 0.30 G VS1 GIA 1260 ## 5 0.31 D VS1 GIA 1641 ## 6 0.31 E VS1 GIA 1555 ``` ```r summary(diamonds) ``` ``` ## Carats Color Clarity Certification Price ## Min. :0.1800 D:16 IF :44 GIA:151 Min. : 638 ## 1st Qu.:0.3500 E:44 VS1 :81 HRD: 79 1st Qu.: 1625 ## Median :0.6200 F:82 VS2 :53 IGI: 78 Median : 4215 ## Mean :0.6309 G:65 VVS1:52 Mean : 5019 ## 3rd Qu.:0.8500 H:61 VVS2:78 3rd Qu.: 7446 ## Max. :1.1000 I:40 Max. :16008 ``` --- ## Multiple regression of diamonds data Let's do exploratory data analysis. First, the distribution of `price`. ```r hist(diamonds$Price,xlab="Price",main="Distribution of price",col=rainbow(10)) ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> -- <div class="question"> Is this distribution normal? </div> --- ## Multiple regression of diamonds data Next, explore the relationship between `price` and `carats` ```r plot(diamonds$Price~diamonds$Carats,xlab="Carats",ylab="Price",col='cyan4') abline(lm(Price~Carats,data=diamonds),col='red3',lty=10,lwd=2) ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> -- <div class="question"> What do you think of this plot? </div> --- ## Multiple regression of diamonds data Since the distribution of `carats` is not really normal, let's transform the variable. This is the natural log, you can try other bases on your own. ```r hist(log(diamonds$Price),xlab="Log Price",main="Distribution of log price",col=rainbow(10)) ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> -- <div class="question"> Is this distribution normal? </div> --- ## Multiple regression of diamonds data Now the relationship between `log(price)` and `carats` ```r plot(log(diamonds$Price)~diamonds$Carats,xlab="Carats",ylab="Log Price",col='cyan4') abline(lm(log(Price)~Carats,data=diamonds),col='red3',lty=10,lwd=2) ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> -- <div class="question"> Is this more "linear" than what we had before? </div> --- ## Multiple regression of diamonds data Next, `price` and `color`. ```r boxplot(Price~Color,data=diamonds,ylab="Price",xlab="Color",col=rainbow(15)) ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> -- <div class="question"> What do you think of this plot? </div> --- ## Multiple regression of diamonds data How about, `log(price)` and `color` instead ```r boxplot(log(Price)~Color,data=diamonds,ylab="Log Price",xlab="Color",col=rainbow(15)) ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> -- <div class="question"> How about this plot? </div> --- ## Multiple regression of diamonds data Next, `price` and `clarity`. ```r boxplot(Price~Clarity,data=diamonds,ylab="Price",xlab="Clarity",col=rainbow(15)) ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> -- <div class="question"> What do you think of this plot? </div> --- ## Multiple regression of diamonds data How about, `log(price)` and `clarity` instead ```r boxplot(log(Price)~Clarity,data=diamonds,ylab="Log Price",xlab="Clarity",col=rainbow(15)) ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> -- <div class="question"> How about this plot? </div> --- ## Multiple regression of diamonds data Next, `price` and `Certification`. ```r boxplot(Price~Certification,data=diamonds,ylab="Price",xlab="Certification",col=rainbow(15)) ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> -- <div class="question"> What do you think of this plot? </div> --- ## Multiple regression of diamonds data How about, `log(price)` and `Certification` instead ```r boxplot(log(Price)~Certification,data=diamonds,ylab="Log Price",xlab="Certification", col=rainbow(15)) ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> -- <div class="question"> How about this plot? </div> --- ## Multiple regression of diamonds data Let's make some plots to explore interactions. First, `log(price)` and `Carats` by `Color` ```r xyplot(log(Price)~Carats|Color,data=diamonds,group=Color,type=c("p","r"),col.line="red4") ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> -- <div class="question"> Is there an interaction effect? </div> --- ## Multiple regression of diamonds data Next, `log(price)` and `Carats` by `Clarity` ```r xyplot(log(Price)~Carats|Clarity,data=diamonds,group=Clarity,type=c("p","r"),col.line="red4") ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> -- <div class="question"> Is there an interaction effect? </div> --- ## Multiple regression of diamonds data Next, `log(price)` and `Carats` by `Certification` ```r xyplot(log(Price)~Carats|Certification,data=diamonds,group=Certification,type=c("p","r"), col.line="red4") ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> -- <div class="question"> Is there an interaction effect? </div> --- ## Multiple regression of diamonds data We also can examine interactions among categorical variables ```r bwplot(log(Price)~Clarity|Color,data = diamonds) ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> -- <div class="question"> Is there an interaction effect? </div> --- ## Multiple regression of diamonds data We also can examine interactions among categorical variables ```r bwplot(log(Price)~Certification|Color,data = diamonds) ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> -- <div class="question"> Is there an interaction effect? </div> --- ## Multiple regression of diamonds data We also can examine interactions among categorical variables ```r bwplot(log(Price)~Certification|Clarity,data = diamonds) ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> -- <div class="question"> Is there an interaction effect? </div> --- ## Multiple regression of diamonds data - We see some evidence of non-normality, non-linearity (with `Carats`) and non-constant variance overall. - Taking the log transformation helped with those. - We might need a quadratic term for `Carats`. Maybe? - We might also consider interactions between `Clarity` and `Color`. - Let's mean center the numerical predictors (just `Carat`) to avoid multicollinearity. ```r diamonds$CaratsCent <- diamonds$Carats - mean(diamonds$Carats) diamonds$CaratsCent2 <- diamonds$CaratsCent^2 ``` - Based on our EDA, our candidate model is .block[ .small[ `$$\boldsymbol{y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}; \ \ \boldsymbol{\epsilon} \sim N(0, \sigma^2 \boldsymbol{I}).$$` ] ] where `\(\boldsymbol{X}\)` contains Carats, `\(\textrm{Carats}^2\)`, Color, Clarity, Certification, and Color:Clarity. --- ## Diamonds data: basic MLR First, a MLR model with only main effects .large[ ```r Model1 <- lm(Price~CaratsCent+Color+Clarity+Certification,data=diamonds);summary(Model1) ``` ``` ## ## Call: ## lm(formula = Price ~ CaratsCent + Color + Clarity + Certification, ## data = diamonds) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1740.0 -428.8 -128.3 314.3 3634.1 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 8223.61 232.20 35.417 < 2e-16 *** ## CaratsCent 12766.40 190.02 67.183 < 2e-16 *** ## ColorE -1439.09 207.98 -6.919 2.83e-11 *** ## ColorF -1841.69 195.23 -9.433 < 2e-16 *** ## ColorG -2176.67 200.39 -10.862 < 2e-16 *** ## ColorH -2747.15 202.91 -13.538 < 2e-16 *** ## ColorI -3313.10 212.71 -15.575 < 2e-16 *** ## ClarityVS1 -1474.57 159.67 -9.235 < 2e-16 *** ## ClarityVS2 -1792.01 171.19 -10.468 < 2e-16 *** ## ClarityVVS1 -689.29 159.93 -4.310 2.23e-05 *** ## ClarityVVS2 -1191.16 148.76 -8.007 2.73e-14 *** ## CertificationHRD 15.23 107.25 0.142 0.887 ## CertificationIGI 141.26 128.26 1.101 0.272 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 710.4 on 295 degrees of freedom ## Multiple R-squared: 0.9581, Adjusted R-squared: 0.9564 ## F-statistic: 562.5 on 12 and 295 DF, p-value: < 2.2e-16 ``` ] --- ## Multiple regression of diamonds data By the way, we can change the baseline levels for the categorical variables. .large[ ```r Model1 <- lm(Price~CaratsCent+relevel(Color,ref="E")+Clarity+Certification,data=diamonds);summary(Model1) ``` ``` ## ## Call: ## lm(formula = Price ~ CaratsCent + relevel(Color, ref = "E") + ## Clarity + Certification, data = diamonds) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1740.0 -428.8 -128.3 314.3 3634.1 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6784.53 178.80 37.944 < 2e-16 *** ## CaratsCent 12766.40 190.02 67.183 < 2e-16 *** ## relevel(Color, ref = "E")D 1439.09 207.98 6.919 2.83e-11 *** ## relevel(Color, ref = "E")F -402.61 133.69 -3.011 0.00283 ** ## relevel(Color, ref = "E")G -737.59 140.73 -5.241 3.04e-07 *** ## relevel(Color, ref = "E")H -1308.06 142.92 -9.152 < 2e-16 *** ## relevel(Color, ref = "E")I -1874.02 158.44 -11.828 < 2e-16 *** ## ClarityVS1 -1474.57 159.67 -9.235 < 2e-16 *** ## ClarityVS2 -1792.01 171.19 -10.468 < 2e-16 *** ## ClarityVVS1 -689.29 159.93 -4.310 2.23e-05 *** ## ClarityVVS2 -1191.16 148.76 -8.007 2.73e-14 *** ## CertificationHRD 15.23 107.25 0.142 0.88719 ## CertificationIGI 141.26 128.26 1.101 0.27163 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 710.4 on 295 degrees of freedom ## Multiple R-squared: 0.9581, Adjusted R-squared: 0.9564 ## F-statistic: 562.5 on 12 and 295 DF, p-value: < 2.2e-16 ``` ] --- ## Diamonds data: model assessment Now some model assessment ```r plot(Model1$residual~diamonds$CaratsCent,ylab="Residual",col='navy') abline(0,0,col="red4") ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> -- .block[ Clearly, some problems! ] --- ## Diamonds data: model assessment ```r boxplot(Model1$residual~diamonds$Color,ylab="Residual",col='cyan') abline(0,0,col="red4") ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> -- <div class="question"> What do you think? </div> --- ## Diamonds data: model assessment ```r boxplot(Model1$residual~diamonds$Clarity,ylab="Residual",col='cyan') abline(0,0,col="red4") ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> -- <div class="question"> What do you think? </div> --- ## Diamonds data: model assessment ```r boxplot(Model1$residual~diamonds$Certification,ylab="Residual",col='cyan') abline(0,0,col="red4") ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> -- <div class="question"> What do you think? </div> --- ## Diamonds data: better MLR Let's fit our EDA suggested model instead but without the interactions. .small[ ```r Model2 <- lm(log(Price)~CaratsCent+CaratsCent2+Clarity+Color+Certification,data=diamonds);summary(Model2) ``` ``` ## ## Call: ## lm(formula = log(Price) ~ CaratsCent + CaratsCent2 + Clarity + ## Color + Certification, data = diamonds) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.15411 -0.04120 -0.00911 0.04543 0.14158 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 8.815934 0.020227 435.844 < 2e-16 *** ## CaratsCent 3.017111 0.016455 183.350 < 2e-16 *** ## CaratsCent2 -2.102922 0.058022 -36.243 < 2e-16 *** ## ClarityVS1 -0.244470 0.013358 -18.301 < 2e-16 *** ## ClarityVS2 -0.320183 0.014279 -22.424 < 2e-16 *** ## ClarityVVS1 -0.094008 0.013574 -6.926 2.74e-11 *** ## ClarityVVS2 -0.176701 0.012592 -14.032 < 2e-16 *** ## ColorE -0.079247 0.017387 -4.558 7.59e-06 *** ## ColorF -0.155991 0.016328 -9.554 < 2e-16 *** ## ColorG -0.245033 0.016734 -14.643 < 2e-16 *** ## ColorH -0.339098 0.016969 -19.983 < 2e-16 *** ## ColorI -0.442606 0.017742 -24.947 < 2e-16 *** ## CertificationHRD -0.006223 0.008938 -0.696 0.4868 ## CertificationIGI -0.025413 0.011536 -2.203 0.0284 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.0592 on 294 degrees of freedom ## Multiple R-squared: 0.9949, Adjusted R-squared: 0.9947 ## F-statistic: 4445 on 13 and 294 DF, p-value: < 2.2e-16 ``` ] --- ## Diamonds data: model assessment Now some model assessment on the new model. ```r plot(Model2$residual~diamonds$CaratsCent,ylab="Residual",col='navy') abline(0,0,col="red4") ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" /> -- <div class="question"> What do you think? </div> --- ## Diamonds data: model assessment ```r boxplot(Model2$residual~diamonds$Color,ylab="Residual",col='cyan') abline(0,0,col="red4") ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" /> -- <div class="question"> What do you think? </div> --- ## Diamonds data: model assessment ```r boxplot(Model2$residual~diamonds$Clarity,ylab="Residual",col='cyan') abline(0,0,col="red4") ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" /> -- <div class="question"> What do you think? </div> --- ## Diamonds data: model assessment ```r boxplot(Model2$residual~diamonds$Certification,ylab="Residual",col='cyan') abline(0,0,col="red4") ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" /> -- <div class="question"> What do you think? </div> --- ## Checking independence and equal variance ```r plot(Model2,which=1) ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-31-1.png" style="display: block; margin: auto;" /> -- <div class="question"> Do you see any clear violations of the independence and equal variance assumptions? </div> --- ## Checking independence and equal variance ```r plot(Model2,which=3) ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-32-1.png" style="display: block; margin: auto;" /> -- <div class="question"> Do you see any clear violations of the independence and equal variance assumptions? </div> --- ## Checking normality ```r plot(Model2,which=2) ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-33-1.png" style="display: block; margin: auto;" /> -- <div class="question"> Do you see any clear violations of the normality assumption? </div> --- ## Outliers and influential points ```r plot(Model2,which=5) ``` <img src="05-reg-model-transformation_files/figure-html/unnamed-chunk-34-1.png" style="display: block; margin: auto;" /> -- <div class="question"> Are there any potential outliers or influential points? </div> --- ## Diamonds data: nested F test - Let's do a nested `\(F\)`-test to see if all the interaction terms between `Clarity` and `Color` are important. -- - I won't print the results because there are too many interactions but you should run `summary(Model2_inter)` yourself to see. ```r Model2_inter <- lm(log(Price)~CaratsCent+CaratsCent2+Clarity*Color+Certification, data=diamonds); anova(Model2,Model2_inter) ``` ``` ## Analysis of Variance Table ## ## Model 1: log(Price) ~ CaratsCent + CaratsCent2 + Clarity + Color + Certification ## Model 2: log(Price) ~ CaratsCent + CaratsCent2 + Clarity * Color + Certification ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 294 1.03042 ## 2 274 0.83108 20 0.19934 3.286 5.361e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` -- <div class="question"> What do we conclude? </div> -- - This confirms our EDA. --- ## Diamonds data: nested F test How about a nested `\(F\)`-test for interaction terms between `Color` and `Certification`? ```r Model2_inter3 <- lm(log(Price)~CaratsCent+CaratsCent2+Color+Clarity*Certification, data=diamonds); anova(Model2,Model2_inter3) ``` ``` ## Analysis of Variance Table ## ## Model 1: log(Price) ~ CaratsCent + CaratsCent2 + Clarity + Color + Certification ## Model 2: log(Price) ~ CaratsCent + CaratsCent2 + Color + Clarity * Certification ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 294 1.0304 ## 2 286 1.0023 8 0.028117 1.0029 0.4339 ``` -- <div class="question"> What do we conclude? </div> -- This also confirms our EDA. --- ## Diamonds data: multicollinearity Should we be worried about multicollinearity for the model with main effects? Use the .hlight[vif] function in the .hlight[rms] package. ```r Model2 <- lm(log(Price)~CaratsCent+CaratsCent2+Clarity+Color+Certification,data=diamonds) vif(Model2) ``` ``` ## CaratsCent CaratsCent2 ClarityVS1 ClarityVS2 ## 1.822328 1.317592 3.039532 2.552504 ## ClarityVVS1 ClarityVVS2 ColorE ColorF ## 2.272172 2.635256 3.252945 4.576620 ## ColorG ColorH ColorI CertificationHRD ## 4.097442 4.019112 3.125784 1.338819 ## CertificationIGI ## 2.211713 ``` -- <div class="question"> Should we be worried? </div>