class: center, middle, inverse, title-slide # Introduction to multiple linear regression ### Dr. Olanrewaju Michael Akande ### Aug 29, 2019 --- ## Announcements - Review the key to the conceptual questions. - Take a look at the assigned (optional) readings for multiple linear regression. - Homework 1 is now online. Due date is 10:05am next Thursday, September 5. - Submit on Gradescope on or before 10:05am! --- ## Outline - Motivating example - Multiple linear regression - Interpretation of coefficients - Hypothesis tests for coefficients - Confidence interval for coefficients - Predictions --- ## Introduction - In the stats module of the MIDS summer online review, you were introduced to simple linear regression. -- - From the review materials, you should know how to fit a simple linear regression model and assess whether or not the model assumptions are violated. -- - We will use those ideas as building blocks for the models we will explore throughout this class! -- - Today, we begin our series of lectures on multiple linear regression. -- - For now, we will restrict our attention to model interpretation, testing, and basic predictions under multiple linear regression. -- - In the next class, we will start to explore model specification, assessment and validation. --- class: center, middle # Motivating Example --- ## Sex discrimination - In 1970’s, Harris Trust and Savings Bank was sued for discrimination on the basis of sex. -- - Analysis of salaries of employees of one type (skilled, entry level clerical) presented as evidence by the defense. -- - The data is in the file `wagediscrim.txt` on Sakai. -- - We are interested in answering the question: **did female employees tend to receive lower base/starting salaries than similarly qualified and experienced male employees?** -- <div class="question"> Which statistical tests can we use to probe the question above? </div> --- ## Data 93 employees on data file (61 female, 32 male). Variable | Description :------------- | :------------ bsal | Annual salary at time of hire sal77 | Annual salary in 1977. educ | years of education. exper | months previous work prior to hire at bank. fsex | 1 if female, 0 if male senior | months worked at bank since hired age | months --- ## Data Let's get a sense of the data: how many rows, how many columns? ```r wages <- read.csv("data/wagediscrim.txt", header= T) dim(wages) ``` ``` ## [1] 93 8 ``` Let's take a look at the first few rows of the data ```r head(wages) ``` ``` ## bsal sal77 sex senior age educ exper fsex ## 1 5040 12420 Male 96 329 15 14.0 0 ## 2 6300 12060 Male 82 357 15 72.0 0 ## 3 6000 15120 Male 67 315 15 35.5 0 ## 4 6000 16320 Male 97 354 12 24.0 0 ## 5 6000 12300 Male 66 351 12 56.0 0 ## 6 6840 10380 Male 92 374 15 41.5 0 ``` --- ## Exploratory data analysis (EDA) How about quick summaries of each variable to know the range of data we are working with ```r wages$sex <- factor(wages$sex,levels=c("Male","Female")) wages$fsex <- factor(wages$fsex) summary(wages) ``` ``` ## bsal sal77 sex senior ## Min. :3900 Min. : 7860 Male :32 Min. :65.00 ## 1st Qu.:4980 1st Qu.: 9000 Female:61 1st Qu.:74.00 ## Median :5400 Median :10020 Median :84.00 ## Mean :5420 Mean :10393 Mean :82.28 ## 3rd Qu.:6000 3rd Qu.:11220 3rd Qu.:90.00 ## Max. :8100 Max. :16320 Max. :98.00 ## age educ exper fsex ## Min. :280.0 Min. : 8.00 Min. : 0.0 0:32 ## 1st Qu.:349.0 1st Qu.:12.00 1st Qu.: 35.5 1:61 ## Median :468.0 Median :12.00 Median : 70.0 ## Mean :474.4 Mean :12.51 Mean :100.9 ## 3rd Qu.:590.0 3rd Qu.:15.00 3rd Qu.:144.0 ## Max. :774.0 Max. :16.00 Max. :381.0 ``` --- ## EDA Since we only care about comparing starting salaries for male and female employees, let's look at a boxplot of .hlight[bsal] by .hlight[sex]. ```r boxplot(bsal~sex,data=wages,ylab="Sex",horizontal=TRUE, xlab="Base Salary") ``` <img src="02-intro-reg_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> -- <div class="question"> What do you think? What can you infer from this plot? </div> --- ## T.test? We could go further and try a t-test for the hypotheses. .small[ `$$H_0: \mu_{\textrm{male}} - \mu_{\textrm{female}} \leq 0 \ \ \textrm{vs.} \ \ H_A: \mu_{\textrm{male}} - \mu_{\textrm{female}} > 0$$` ] ```r t.test(bsal~sex,data=wages,alternative="greater") ``` ``` ## ## Welch Two Sample t-test ## ## data: bsal by sex ## t = 5.83, df = 51.329, p-value = 1.855e-07 ## alternative hypothesis: true difference in means is greater than 0 ## 95 percent confidence interval: ## 582.9857 Inf ## sample estimates: ## mean in group Male mean in group Female ## 5956.875 5138.852 ``` -- <div class="question"> Is a t.test sufficient here? Any concerns? </div> --- ## Simple linear regression (SLR)? How about fitting a simple linear regression to the two variables. .small[ $$ \textrm{bsal}_i = \beta_0 + \beta_1 \textrm{sex}_i + \epsilon_i; \ \ \epsilon_i \overset{iid}{\sim} N(0, \sigma^2).$$ ] ```r model1 <- lm(bsal~sex,data=wages); summary(model1) ``` ``` ## ## Call: ## lm(formula = bsal ~ sex, data = wages) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1336.88 -338.85 43.12 261.15 2143.12 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5956.9 105.3 56.580 < 2e-16 *** ## sexFemale -818.0 130.0 -6.293 1.08e-08 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 595.6 on 91 degrees of freedom ## Multiple R-squared: 0.3032, Adjusted R-squared: 0.2955 ## F-statistic: 39.6 on 1 and 91 DF, p-value: 1.076e-08 ``` -- <div class="question"> What can we infer from these results? </div> --- ## EDA - .block[ T-test shows men started at higher salaries than women `\((t=5.83, p < .0001)\)`; same conclusion from the regression. ] - But one could argue this is so because both methods **do not** control for other characteristics. Indeed, we have ignored the other variables. - There are other variables that are correlated with .hlight[bsal]. Here's the correlation matrix of all numerical variables using the .hlight[corr] function in R. <table class="table" style="font-size: 20px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> bsal </th> <th style="text-align:right;"> sal77 </th> <th style="text-align:right;"> senior </th> <th style="text-align:right;"> age </th> <th style="text-align:right;"> educ </th> <th style="text-align:right;"> exper </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> bsal </td> <td style="text-align:right;"> 1.00 </td> <td style="text-align:right;"> 0.42 </td> <td style="text-align:right;"> -0.29 </td> <td style="text-align:right;"> 0.03 </td> <td style="text-align:right;"> 0.41 </td> <td style="text-align:right;"> 0.17 </td> </tr> <tr> <td style="text-align:left;"> sal77 </td> <td style="text-align:right;"> 0.42 </td> <td style="text-align:right;"> 1.00 </td> <td style="text-align:right;"> 0.13 </td> <td style="text-align:right;"> -0.55 </td> <td style="text-align:right;"> 0.42 </td> <td style="text-align:right;"> -0.37 </td> </tr> <tr> <td style="text-align:left;"> senior </td> <td style="text-align:right;"> -0.29 </td> <td style="text-align:right;"> 0.13 </td> <td style="text-align:right;"> 1.00 </td> <td style="text-align:right;"> -0.18 </td> <td style="text-align:right;"> 0.06 </td> <td style="text-align:right;"> -0.07 </td> </tr> <tr> <td style="text-align:left;"> age </td> <td style="text-align:right;"> 0.03 </td> <td style="text-align:right;"> -0.55 </td> <td style="text-align:right;"> -0.18 </td> <td style="text-align:right;"> 1.00 </td> <td style="text-align:right;"> -0.23 </td> <td style="text-align:right;"> 0.80 </td> </tr> <tr> <td style="text-align:left;"> educ </td> <td style="text-align:right;"> 0.41 </td> <td style="text-align:right;"> 0.42 </td> <td style="text-align:right;"> 0.06 </td> <td style="text-align:right;"> -0.23 </td> <td style="text-align:right;"> 1.00 </td> <td style="text-align:right;"> -0.10 </td> </tr> <tr> <td style="text-align:left;"> exper </td> <td style="text-align:right;"> 0.17 </td> <td style="text-align:right;"> -0.37 </td> <td style="text-align:right;"> -0.07 </td> <td style="text-align:right;"> 0.80 </td> <td style="text-align:right;"> -0.10 </td> <td style="text-align:right;"> 1.00 </td> </tr> </tbody> </table> --- ## EDA - So, let's take a look at scatter plots of all variables - First, recall the description of all the variables. Variable | Description :------------- | :------------ bsal | Annual salary at time of hire sal77 | Annual salary in 1977. educ | years of education. exper | months previous work prior to hire at bank. fsex | 1 if female, 0 if male senior | months worked at bank since hired age | months --- ## EDA ```r pairs(wages[,!is.element(colnames(wages),c("sal77","sex","fsex"))]) ``` <img src="02-intro-reg_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- ## EDA The plot looks a bit busy! Let's take a closer look one variable at a time. First, .hlight[bsal] vs. .hlight[senior]. ```r #Load the lattice package xyplot(bsal~senior,groups=wages$sex,data=wages,xlab="Seniority",ylab="Base Salary", panel=function(x,y,...) {panel.xyplot(x,y,...) panel.abline(lm(y~x)) }, auto.key=list(x=0.75,y=0.95,text=c("Male","Female"),points=F,lines=T)) ``` <img src="02-intro-reg_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- ## EDA Next, .hlight[bsal] vs. .hlight[age] ```r xyplot(bsal~age,groups=wages$sex,data=wages,xlab="Age",ylab="Base Salary", panel=function(x,y,...) {panel.xyplot(x,y,...) panel.abline(lm(y~x)) }, auto.key=list(x=0.75,y=0.95,text=c("Male","Female"),points=F,lines=T)) ``` <img src="02-intro-reg_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- ## EDA .hlight[bsal] vs. .hlight[educ] ```r xyplot(bsal~educ,groups=wages$sex,data=wages,xlab="Education",ylab="Base Salary", panel=function(x,y,...) {panel.xyplot(x,y,...) panel.abline(lm(y~x)) }, auto.key=list(x=0.05,y=0.95,text=c("Male","Female"),points=F,lines=T)) ``` <img src="02-intro-reg_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- ## EDA Finally, .hlight[bsal] vs. .hlight[exper] ```r xyplot(bsal~exper,groups=wages$sex,data=wages,xlab="Experience",ylab="Base Salary", panel=function(x,y,...) {panel.xyplot(x,y,...) panel.abline(lm(y~x)) }, auto.key=list(x=0.75,y=0.95,text=c("Male","Female"),points=F,lines=T)) ``` <img src="02-intro-reg_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- ## EDA - Clearly, they are other variables that may be relevant in explaining baseline salary. -- - We need to explore other statistical methods than the t-test and simple linear regression. -- - We need methods that can explore the relationship between baseline salary and sex while also controlling for the other variables that clearly may be relevant. -- - This brings us to multiple linear regression (which I'll henceforth refer to as MLR). -- - .block[ Something to keep in mind, the overall conclusion may not change after using a better model for this data. In general, this should never stop you from exploring and reporting the results from better models; you should always be rigorous when doing analyses and be honest when reporting the results! ] --- class: center, middle # Multiple Linear Regression --- ## MLR - Mathematically, MLR assumes the following distribution for a response variable `\(y_i\)` given multiple covariates/predictors `\(\boldsymbol{x}_i = (x_{i1}, x_{i2}, \ldots, x_{ip})\)`. .block[ .small[ `$$y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip} + \epsilon_i; \ \ \epsilon_i \overset{iid}{\sim} N(0, \sigma^2).$$` ] ] -- - We can also write the model as: .block[ .small[ `$$y_i \overset{iid}{\sim} N(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip}, \sigma^2).$$` `$$f(y_i | \boldsymbol{x}_i) = N(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip}, \sigma^2).$$` ] ] -- - MLR assumes that the conditional average or expected value of a response variable is a linear function of potential predictors; the linearity is in terms of the "unknown" parameters (intercept and slopes). -- - Just like in SLR, MLR also assumes values of the response variable follow a normal curve within any combination of predictors. --- ## MLR - Just as we had under SLR, here each `\(\beta_j\)` represents the true "unknown" value of the parameter, while `\(\hat{\beta}_j\)` represents the estimate of `\(\beta_j\)`. -- - Similarly, `\(y_i\)` represents the true value of the response variable, while `\(\hat{y}_i\)` represents the predicted value. That is, .block[ .small[ `$$\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_{i1} + \hat{\beta} x_{i2} + \ldots + \hat{\beta} x_{ip}.$$` ] ] -- - Also, the residuals `\(e_i\)` are our estimates of the true "unobserved" errors `\(\epsilon_i\)`. Thus, .block[ .small[ `$$e_i = y_i - \left[\hat{\beta}_0 + \hat{\beta}_1 x_{i1} + \hat{\beta} x_{i2} + \ldots + \hat{\beta} x_{ip}\right] = y_i - \hat{y}_i.$$` ] ] -- - Since the `\(e_i\)`'s estimate the `\(\epsilon_i\)`'s, we expect them to also be independent, centered at zero, and have constant variance. We will get into this more under model assessment. --- ## MLR: math - Estimated coefficients are found by taking partial derivatives of .small[ `$$\sum^n_{i=1} \left(y_i - \left[\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots + \beta_p x_{ip} \right] \right)^2.$$` ] -- - Resulting formulas are too messy to write down, although there is a very nice matrix algebra representation, which we will see on the next slide. -- - Estimated value of regression variance is .block[ .small[ `$$\hat{\sigma}^2 = s_e^2 = \sum^n_{i=1} \dfrac{\left(y_i - \hat{y}_i \right)^2}{n-(p+1)} = \sum^n_{i=1} \dfrac{e_i^2}{n-(p+1)}.$$` ] ] -- - Most software packages will estimate `\(s_e^2\)` automatically. -- - R squared has same interpretation under both SLR and MLR. --- ## MLR: matrix representation - Let .small[ $$ \boldsymbol{y} = `\begin{bmatrix} y_1 \\ y_2 \\ \vdots\\ y_n \\ \end{bmatrix}` \hspace{0.5em} \boldsymbol{X} = `\begin{bmatrix} 1 & x_{11} & x_{12} & \ldots & x_{1p} \\ 1 & x_{21} & x_{22} & \ldots & x_{2p} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{n1} & x_{n2} & \ldots & x_{np} \\ \end{bmatrix}` \hspace{0.5em} \boldsymbol{\beta} = `\begin{bmatrix} \beta_0\\ \beta_1\\ \beta_2 \\ \vdots \\ \beta_p \\ \end{bmatrix}` \hspace{0.5em} \boldsymbol{\epsilon} = `\begin{bmatrix} \epsilon_1\\ \epsilon_2 \\ \vdots \\ \epsilon_n \\ \end{bmatrix}` \hspace{0.5em} \boldsymbol{I} = `\begin{bmatrix} 1 & 0 & \ldots & 0 \\ 0 & 1 & \ldots & 0 \\ \vdots & \vdots & \vdots & \vdots \\ 0 & 0 & \ldots & 1 \\ \end{bmatrix}` $$ ] -- - Then, we can write the MLR model as .block[ .small[ `$$\boldsymbol{y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}; \ \ \boldsymbol{\epsilon} \sim N(0, \sigma^2 \boldsymbol{I}).$$` ] ] -- - The OLS estimates of all `\((p+1)\)` coefficients (intercept plus `\(p\)` slopes) is then given by .block[ .small[ `$$\hat{\boldsymbol{\beta}} = \left(\boldsymbol{X}^T \boldsymbol{X}\right)^{-1} \boldsymbol{X}^T \boldsymbol{y}.$$` ] ] -- <div class="question"> Ideally, n should be bigger than p. Why? </div> -- There are many ways around the `\(p > n\)` problem but those are beyond the scope of this class. --- ## MLR: matrix representation - The predictions can then be written as .block[ .small[ `$$\hat{\boldsymbol{y}} = \boldsymbol{X}\hat{\boldsymbol{\beta}} = \boldsymbol{X} \left[\left(\boldsymbol{X}^T \boldsymbol{X}\right)^{-1} \boldsymbol{X}^T \boldsymbol{y} \right] = \left[\boldsymbol{X} \left(\boldsymbol{X}^T \boldsymbol{X}\right)^{-1} \boldsymbol{X}^T \right] \boldsymbol{y}.$$` ] ] -- - The residuals can be written as .block[ .small[ `$$\boldsymbol{e} = \boldsymbol{y} - \hat{\boldsymbol{y}} = \boldsymbol{y} - \left[\boldsymbol{X} \left(\boldsymbol{X}^T \boldsymbol{X}\right)^{-1} \boldsymbol{X}^T \right] \boldsymbol{y} = \left[\boldsymbol{1}_n - \boldsymbol{X} \left(\boldsymbol{X}^T \boldsymbol{X}\right)^{-1} \boldsymbol{X}^T \right] \boldsymbol{y}$$` ] ] where `\(\boldsymbol{1}_n\)` is a matrix of ones -- - The `\(n \times n\)` matrix .block[ .small[ `$$\boldsymbol{H} = \boldsymbol{X} \left(\boldsymbol{X}^T \boldsymbol{X}\right)^{-1} \boldsymbol{X}^T$$` ] ] is often called the .hlight[projection matrix] or the .hlight[hat matrix]. -- - We will see some important features of the elements of `\(\boldsymbol{H}\)` in the next class. --- ## MLR: matrix representation - In matrix form, .block[ .small[ `$$s_e^2 = \sum^n_{i=1} \dfrac{\left(y_i - \hat{y}_i \right)^2}{n-(p+1)} = \dfrac{(\boldsymbol{y}-\boldsymbol{X}\hat{\boldsymbol{\beta}})^T(\boldsymbol{y}-\boldsymbol{X}\hat{\boldsymbol{\beta}})}{n-(p+1)} = \dfrac{\boldsymbol{e}^T\boldsymbol{e}}{n-(p+1)}.$$` ] ] -- - The variance of the OLS estimates of all `\((p+1)\)` coefficients (intercept plus `\(p\)` slopes) is .block[ .small[ $$\mathbb{V}ar\left[ \hat{\boldsymbol{\beta}} \right] = \sigma^2 \left(\boldsymbol{X}^T \boldsymbol{X}\right)^{-1} $$ ] ] -- - Notice that this is a covariance matrix; the square root of the diagonal elements give us the standard errors for each `\(\beta_j\)`, which we can use for hypothesis testing and interval estimation. -- <div class="question"> What are the off-diagonal elements? </div> -- - When estimating `\(\mathbb{V}ar[\hat{\boldsymbol{\beta}}]\)`, plug in `\(s_e^2\)` as an estimate of `\(\sigma^2\)`. --- ## Back to our example Let's fit the following default MLR model to our example using R. .small[ `$$\textrm{bsal}_i = \beta_0 + \beta_1 \textrm{sex}_i + \beta_2 \textrm{senior}_i + \beta_3 \textrm{age}_i + \beta_4 \textrm{educ}_i + \beta_5 \textrm{exper}_i + \epsilon_i$$` ] -- We can estimate `\(\hat{\boldsymbol{\beta}}\)` in R directly as follows: ```r X <- model.matrix(~ sex + senior + age + educ + exper, data= wages) y <- as.matrix(wages$bsal) beta_hat <- solve(t(X)%*%X)%*%t(X)%*%y; beta_hat ``` ``` ## [,1] ## (Intercept) 6277.8933861 ## sexFemale -767.9126888 ## senior -22.5823029 ## age 0.6309603 ## educ 92.3060229 ## exper 0.5006397 ``` ```r sigmasquared_hat <- t(y-X%*%beta_hat)%*%(y-X%*%beta_hat)/(nrow(X)-ncol(X)) SE_beta_hat <- sqrt(diag(c(sigmasquared_hat)*solve(t(X)%*%X))); SE_beta_hat ``` ``` ## (Intercept) sexFemale senior age educ exper ## 652.2713190 128.9700022 5.2957316 0.7206541 24.8635404 1.0552624 ``` --- ## Back to our example Let's fit the same MLR model using the .hlight[lm] command in R. ```r regwage <- lm(bsal~ sex + senior + age + educ + exper, data= wages) summary(regwage) ``` ``` ## ## Call: ## lm(formula = bsal ~ sex + senior + age + educ + exper, data = wages) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1217.36 -342.83 -55.61 297.10 1575.53 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6277.8934 652.2713 9.625 2.36e-15 *** ## sexFemale -767.9127 128.9700 -5.954 5.39e-08 *** ## senior -22.5823 5.2957 -4.264 5.08e-05 *** ## age 0.6310 0.7207 0.876 0.383692 ## educ 92.3060 24.8635 3.713 0.000361 *** ## exper 0.5006 1.0553 0.474 0.636388 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 508.1 on 87 degrees of freedom ## Multiple R-squared: 0.5152, Adjusted R-squared: 0.4873 ## F-statistic: 18.49 on 5 and 87 DF, p-value: 1.811e-12 ``` --- ## Interpretation of coefficients - Each estimated coefficient is amount `\(y\)` is expected to increase when the value of its corresponding predictor is increased by one, _holding the values of the other predictors constant_. -- - For example, the estimated coefficient of .hlight[educ] is approximately 92. .block[ _Interpretation_: For each additional year of education for employee, we expect salary to increase by about $92, holding all other variables constant. ] -- - That interpretation is a bit different when dealing with a binary variable (more generally, categorical/factor variables). -- - For example, the estimated coefficient of .hlight[sex] is approximately -768. .block[ _Interpretation_: For employees who started at the same time, had the same education and experience, and were the same age, women earned $768 less on average than men. ] --- ## Which variable is the strongest predictor of the outcome? - The coefficient that has the strongest linear association with the outcome variable is the one with the largest absolute value of `\(T\)` (referred to as `\(t\)`-value in the R output), the test statistic, which equals the coefficient over the corresponding SE. -- - `\(T\)` is NOT size of the coefficient. The size of the coefficient is sensitive to scales of predictors, but `\(T\)` is not, since it is a standardized measure. -- - Example: In our regression, seniority is a better predictor than education because it has a larger `\(T\)`. --- ## Hypothesis tests for coefficients - The reported t-values and p-values are used to test whether a particular coefficient equals 0, GIVEN that all other variables are in the model. -- - Examples: - The test for whether the coefficient of education equals zero has p-value `\(= .0004\)`. Hence, reject the null hypothesis; it appears that education is a useful predictor of baseline salary when all the other predictors are in the model. -- - The test for whether the coefficient of experience equals zero has p-value `\(= .6364\)`. Hence, we cannot reject the null hypothesis; it appears that experience is not a particularly useful predictor of baseline salary when all other predictors are in the model. --- ## Hypothesis tests for coefficients - Fortunately, R (and pretty much all statistical software) computes both the t-values and p-values for us automatically. -- <div class="question"> How do we calculate the t-values and p-values manually? </div> -- - The t-values (test statistics) have the usual form: .block[ .small[ `$$T = \dfrac{\textrm{Observed} - \textrm{Expected}}{SE} = \dfrac{\textrm{Point Estimate} - \textrm{Null Value}}{SE} = \dfrac{\hat{\beta}_j - 0}{\sqrt{\left[s^2_e \left(\boldsymbol{X}^T \boldsymbol{X}\right)^{-1}\right]_{jj}}}$$` ] ] -- - For p-value, use area under a t-distribution with `\(n-(p+1)\)` degrees of freedom, where `\(p\)` is the number of predictors (minus the intercept) in the model. -- - In this problem, the degrees of freedom equal `\(93 - 6 = 87\)`. -- .block[ You should know how to compute the p-values directly using the .hlight[pt] function in R (from the summer review materials). ] --- ## CIs for regression coefficients - A 95% CI for the coefficients is obtained in the usual way. Recall the general form for two-sided CIs from the online review material: .block[ .small[ `$$CI = pe \pm SE \times C_{\alpha}$$` ] ] where `\(pe\)` is the point estimate, and `\(C_{\alpha}\)` is a multiplier (critical value) that depends on the confidence level. -- - For MLR, we have .block[ .small[ `$$CI = \hat{\beta}_j \pm SE \times C_{\alpha} = \hat{\beta}_j \pm C_{\alpha} \times \sqrt{\left[s^2_e \left(\boldsymbol{X}^T \boldsymbol{X}\right)^{-1}\right]_{jj}} \ ,$$` ] ] and the multiplier is obtained from the t-distribution with `\(n-(p+1)\)` degrees of freedom. -- - Example: A 95% "two-sided" CI for the population regression coefficient of age equals: `\((0.63 - 1.988 \times0.72, 0.63 + 1.988 \times0.72) = (-0.80,2.06)\)`. -- .block[ Find the multiplier (1.988) in R by using the command .hlight[qt(0.975,df=87)]. ] --- ## CIs for regression coefficients - We can compute "two-sided" confidence intervals very easily in R. ```r confint(regwage,level = 0.95) ``` ``` ## 2.5 % 97.5 % ## (Intercept) 4981.4335106 7574.353262 ## sexFemale -1024.2545333 -511.570844 ## senior -33.1081429 -12.056463 ## age -0.8014178 2.063338 ## educ 42.8870441 141.725002 ## exper -1.5968086 2.598088 ``` -- - For employees with the same age, seniority, education, and experience, we expect the average starting salary for female employees to be between 511 and 1024 dollars less than the average starting salary for male employees. -- - More succinctly, for employees with the same age, seniority, education, and experience, we expect female employees’ average starting salary to be around $767 less than male employees' average salary, with 95% CI in dollars = (-1024, -512). --- ## Notes about tests and CIs - When sample size is large enough, you will probably reject the null hypothesis `\(H_0: \beta_j = 0\)`. - This is because as `\(n\)` increases, the SE will decrease, most likely blowing up the test statistic `\(T\)`. - Thus, you should consider practical significance, not just statistical significance. -- - When sample size is small, there simply may not be enough evidence to reject null hypothesis `\(H_0: \beta_j = 0\)`. - When you fail to reject the null hypothesis, don't be too hasty to say that predictor has no linear association with the outcome. - There may be an association, just not strong enough to detect with this sample (or perhaps a nonlinear one). - It may also be that the association is not significant because you are already controlling for other characteristics. --- ## Predictions - Making predictions using the fitted model is straightforward. -- - For example, suppose we want to prediction baseline salary for a 25 year old woman with 12 years of education, 10 months of seniority, and two years of experience. We can simply plugin these values into the estimated model: .block[ .small[ `$$\hat{y}_i = 6277.9 - 767.9(1) - 22.6(10) + 0.63(300) + 92.3(12) + 0.50(24) = 6592.6$$` ] ] -- - Easier to do in R using the .hlight[predict] command. We can also get confidence and prediction intervals using the same command. ```r newdata <- data.frame(sex="Female",senior=10,age=c(25*12),educ=12,exper=c(2*12)) preds <- predict(regwage,newdata,interval="confidence"); preds ``` ``` ## fit lwr upr ## 1 6593.133 5775.546 7410.721 ``` ```r preds <- predict(regwage,newdata,interval="prediction"); preds ``` ``` ## fit lwr upr ## 1 6593.133 5293.781 7892.486 ``` --- ## Final notes - We can't say for sure that our model has not violated any of the assumptions. We must do model assessment just as with SLR. -- - Also, how sure are we that this is actually a good model for this data? -- - We will address these issues and more in the next few classes. -- - .block[ Be very wary of extrapolation! Because there are several predictors, you can fall into the extrapolation trap in many ways. ] -- <div class="question"> What do we mean by extrapolation? </div> -- - .block[ Note: multiple regression shows association. It does NOT prove causality. Only a carefully designed observational study or randomized experiment can show causality. ]