Lab 1: multiple linear regression

Beer Consumption in Sao Paulo

Sept 6, 2019

Due: 6:00pm, Sept 6, 2019

Introduction

The purpose of this lab is to give you additional practice working with multiple linear regression. You will also write your own code for doing \(k\)-fold cross validation. The lab is based on the Kaggle beer consumption dataset found here: https://www.kaggle.com/dongeorge/beer-consumption-sao-paulo/. Read more about the problem and dataset under the description section of the link.

Kaggle is a great online community of data scientists. To learn more about Kaggle, follow this link: https://www.kaggle.com/getting-started/44916.

In this lab you will demonstrate the impacts of variables on beer consumption in a given region and the consumption forecast for certain scenarios. The data (sample) were collected in Sao Paulo, Brazil, in a university area, where there are some parties with groups of students from 18 to 28 years of age (average).

Housekeeping

R/RStudio

You all should have R and RStudio installed on your computers by now. If you do not, first install the latest version of R here: https://cran.rstudio.com (remember to select the right installer for your operating system). Next, install the latest version of RStudio here: https://www.rstudio.com/products/rstudio/download/. Scroll down to the “Installers for Supported Platforms” section and find the right installer for your operating system.

R Markdown

You are required to use R Markdown to type up this lab report. If you do not already know how to use R markdown, here is a very (very!) basic R Markdown template: https://akandelanre.github.io/IDS702_F19/labs/resources/LabReport.Rmd. Refer to the resources tab of the course website (here: https://akandelanre.github.io/IDS702_F19/resources/) for links to help you learn how to use R markdown.

Gradescope

You MUST submit both your .Rmd and .pdf files to the course site on Gradescope here: https://www.gradescope.com/courses/57701/assignments. Make sure to knit to pdf and not html; ask the TA about knitting to pdf if you cannot figure it out. Be sure to submit under the right assignment entry.

The Data

Download the data (named consumo_cerveja.csv) from Sakai and save it locally to the same directory as your R markdown file. To find the data file on Sakai, go to Resources \(\rightarrow\) Datasets \(\rightarrow\) Lab Datasets \(\rightarrow\) Lab 1. Once you have downloaded the data file into the SAME folder as your R markdown file, load and clean the data by using the following R code.

It is always a good idea to take a look at the first few rows of the raw file to see what the data looks like before loading the data. In this raw ‘consumo_cerveja’ file, you will notice that commas are actually used both as decimals and to separate the columns. Thus, you need to let R know by specifying the sep and dec options as in the code.

beer <- read.csv("consumo_cerveja.csv",stringsAsFactors = FALSE, sep = ",",dec=",")
# rename the variables
beer$date <- beer$Data
beer$temp_median_c <- beer$Temperatura.Media..C.
beer$temp_min_c <- beer$Temperatura.Minima..C.
beer$temp_max_c <- beer$Temperatura.Maxima..C.
beer$precip_mm <- beer$Precipitacao..mm.
beer$weekend <- factor(beer$Final.de.Semana)
beer$beer_cons_liters <- as.numeric(beer$Consumo.de.cerveja..litros.)
beer <- beer[ , 8:ncol(beer)]

After renaming the variables using the code above, your data will be saved in the object beer, and the relevant variables plus their meanings are given in the table below:

Variable Description
date Date the data for each observation was recorded.
temp_median_c Median temperature in \(^0C\).
temp_min_c Minimum temperature in \(^0C\).
temp_max_c Maximum temperature in \(^0C\).
precip_mm Precipitation in \(mm\).
weekend Indicator variable for weekend: 1 = weekend, 0 = weekday.
beer_cons_liters Beer consumption in liters.

Exercises

Treat the variable beer_cons_liters as your response variable and the other variables as potential predictors.

  1. Make a histogram of beer_cons_liters. Describe the distribution. Is the normality assumption a plausible one here? If you think the histogram does not look normal enough, make a histogram of log(beer_cons_liters). Does that look more “normal” than beer_cons_liters?

    If no, use beer_cons_liters as your response variable for the remaining questions. If yes, use log(beer_cons_liters) as your response variable instead for the remaining questions.

  2. Make exploratory plots of beer_cons_liters (or log(beer_cons_liters)) versus each potential predictor. Are all the relationships linear? If any one of them is nonlinear, describe the relationship.

  3. Does it make sense to include all three of temp_median_c, temp_min_c and temp_max_c as predictors in a MLR model for predicting beer_cons_liters (or log(beer_cons_liters))? Justify your response in one or two sentences.

  4. Fit a linear model for beer_cons_liters (or log(beer_cons_liters)) using weekend, precip_mm, and temp_median_c as your predictors. Interpret all the parameters of the fitted regression model in context of the data. What percent of the variability in beer_cons_liters (or log(beer_cons_liters)) is explained by your model?

  5. Which of the variables appears to be the best covariate for explaining or predicting beer consumption? Why?

  6. Are there any potential limitations of the model you have fit? If yes, what are two potential limitations?

  7. Compute the in-sample root mean squared error (RMSE) for the regression model in question 4. Refer back to the class notes for details on how to compute in-sample (or within-sample) RMSE.

  8. Write a code for doing \(k\)-fold cross validation. Refer back to the class notes for details on \(k\)-fold cross validation. Let \(k=10\) and use average RMSE as the metric for quantifying predictive error. What is the average RMSE for the model in question 4 above?

    Hint: if you are not sure how to begin writing your code for doing the cross validation, you should consider writing a “for loop”. A “for loop” is actually not the most efficient way to get this done but it will work just fine here. If you don’t know how to write a “for loop”, use the skeleton code below as a guide for writing your own “for loop” for this question.

  9. Extend the model in question 4 to include interaction terms between weekend and the other two predictors. Are the interaction terms significant?

    To include an interaction term between a continuous/numerical variable x1 and a factor variable x2 in your linear model, use

  10. Use your code for the \(k\)-fold cross validation from question 8 to compute the average RMSE for the new model in question 9. Is the new RMSE model lower or higher? What can you infer from that?

Grading

10 points: 1 point for each question

Acknowledgement

This lab is based on ideas proposed by Sam Voisin.