Due: 6:00pm, Sept 6, 2019
The purpose of this lab is to give you additional practice working with multiple linear regression. You will also write your own code for doing \(k\)-fold cross validation. The lab is based on the Kaggle beer consumption dataset found here: https://www.kaggle.com/dongeorge/beer-consumption-sao-paulo/. Read more about the problem and dataset under the description section of the link.
Kaggle is a great online community of data scientists. To learn more about Kaggle, follow this link: https://www.kaggle.com/getting-started/44916.
In this lab you will demonstrate the impacts of variables on beer consumption in a given region and the consumption forecast for certain scenarios. The data (sample) were collected in Sao Paulo, Brazil, in a university area, where there are some parties with groups of students from 18 to 28 years of age (average).
You all should have R and RStudio installed on your computers by now. If you do not, first install the latest version of R here: https://cran.rstudio.com (remember to select the right installer for your operating system). Next, install the latest version of RStudio here: https://www.rstudio.com/products/rstudio/download/. Scroll down to the “Installers for Supported Platforms” section and find the right installer for your operating system.
You are required to use R Markdown to type up this lab report. If you do not already know how to use R markdown, here is a very (very!) basic R Markdown template: https://akandelanre.github.io/IDS702_F19/labs/resources/LabReport.Rmd. Refer to the resources tab of the course website (here: https://akandelanre.github.io/IDS702_F19/resources/) for links to help you learn how to use R markdown.
You MUST submit both your .Rmd and .pdf files to the course site on Gradescope here: https://www.gradescope.com/courses/57701/assignments. Make sure to knit to pdf and not html; ask the TA about knitting to pdf if you cannot figure it out. Be sure to submit under the right assignment entry.
Download the data (named consumo_cerveja.csv
) from Sakai and save it locally to the same directory as your R markdown file. To find the data file on Sakai, go to Resources \(\rightarrow\) Datasets \(\rightarrow\) Lab Datasets \(\rightarrow\) Lab 1. Once you have downloaded the data file into the SAME folder as your R markdown file, load and clean the data by using the following R code.
It is always a good idea to take a look at the first few rows of the raw file to see what the data looks like before loading the data. In this raw ‘consumo_cerveja’ file, you will notice that commas are actually used both as decimals and to separate the columns. Thus, you need to let R know by specifying the sep and dec options as in the code.
beer <- read.csv("consumo_cerveja.csv",stringsAsFactors = FALSE, sep = ",",dec=",")
# rename the variables
beer$date <- beer$Data
beer$temp_median_c <- beer$Temperatura.Media..C.
beer$temp_min_c <- beer$Temperatura.Minima..C.
beer$temp_max_c <- beer$Temperatura.Maxima..C.
beer$precip_mm <- beer$Precipitacao..mm.
beer$weekend <- factor(beer$Final.de.Semana)
beer$beer_cons_liters <- as.numeric(beer$Consumo.de.cerveja..litros.)
beer <- beer[ , 8:ncol(beer)]
After renaming the variables using the code above, your data will be saved in the object beer
, and the relevant variables plus their meanings are given in the table below:
Variable | Description |
---|---|
date | Date the data for each observation was recorded. |
temp_median_c | Median temperature in \(^0C\). |
temp_min_c | Minimum temperature in \(^0C\). |
temp_max_c | Maximum temperature in \(^0C\). |
precip_mm | Precipitation in \(mm\). |
weekend | Indicator variable for weekend: 1 = weekend, 0 = weekday. |
beer_cons_liters | Beer consumption in liters. |
Treat the variable beer_cons_liters
as your response variable and the other variables as potential predictors.
Make a histogram of beer_cons_liters
. Describe the distribution. Is the normality assumption a plausible one here? If you think the histogram does not look normal enough, make a histogram of log(beer_cons_liters)
. Does that look more “normal” than beer_cons_liters
?
If no, use beer_cons_liters
as your response variable for the remaining questions. If yes, use log(beer_cons_liters)
as your response variable instead for the remaining questions.
Make exploratory plots of beer_cons_liters
(or log(beer_cons_liters)
) versus each potential predictor. Are all the relationships linear? If any one of them is nonlinear, describe the relationship.
Does it make sense to include all three of temp_median_c
, temp_min_c
and temp_max_c
as predictors in a MLR model for predicting beer_cons_liters
(or log(beer_cons_liters)
)? Justify your response in one or two sentences.
Fit a linear model for beer_cons_liters
(or log(beer_cons_liters)
) using weekend
, precip_mm
, and temp_median_c
as your predictors. Interpret all the parameters of the fitted regression model in context of the data. What percent of the variability in beer_cons_liters
(or log(beer_cons_liters)
) is explained by your model?
Which of the variables appears to be the best covariate for explaining or predicting beer consumption? Why?
Are there any potential limitations of the model you have fit? If yes, what are two potential limitations?
Compute the in-sample root mean squared error (RMSE) for the regression model in question 4. Refer back to the class notes for details on how to compute in-sample (or within-sample) RMSE.
Write a code for doing \(k\)-fold cross validation. Refer back to the class notes for details on \(k\)-fold cross validation. Let \(k=10\) and use average RMSE as the metric for quantifying predictive error. What is the average RMSE for the model in question 4 above?
Hint: if you are not sure how to begin writing your code for doing the cross validation, you should consider writing a “for loop”. A “for loop” is actually not the most efficient way to get this done but it will work just fine here. If you don’t know how to write a “for loop”, use the skeleton code below as a guide for writing your own “for loop” for this question.
# Suppose your data is stored in the object "Data"
# First set a seed to ensure your results are reproducible
set.seed(...) # use whatever number you want
# Now randomly re-shuffle the data
Data <- Data[sample(nrow(Data)),]
# Define the number of folds you want
K <- ...
# Define a matrix to save your results into
RSME <- matrix(0,nrow=K,ncol=1)
# Split the row indexes into k equal parts
kth_fold <- cut(seq(1,nrow(Data)),breaks=K,labels=FALSE)
# Now write the for loop for the k-fold cross validation
for(k in 1:K){
# Split your data into the training and test datasets
test_index <- which(kth_fold==k)
train <- Data[-test_index,]
test <- Data[test_index,]
# Now that you've split the data,
RSME[k,] <- ... # write your code for computing RMSE for each k here
# You should consider using your code for question 7 above
}
... #Calculate the average of all values in the RSME matrix here.
Extend the model in question 4 to include interaction terms between weekend
and the other two predictors. Are the interaction terms significant?
To include an interaction term between a continuous/numerical variable x1 and a factor variable x2 in your linear model, use
Use your code for the \(k\)-fold cross validation from question 8 to compute the average RMSE for the new model in question 9. Is the new RMSE model lower or higher? What can you infer from that?
10 points: 1 point for each question
This lab is based on ideas proposed by Sam Voisin.