Due: 11:59pm, Sept 23, 2019
The purpose of this lab is to give you additional practice working with logistic regression. The lab is based on the NBA Team Game Stats dataset found here: https://www.kaggle.com/ionaskel/nba-games-stats-from-2014-to-2018/. Read more about the problem and dataset under the description section of the link.
Kaggle is a great online community of data scientists. To learn more about Kaggle, follow this link: https://www.kaggle.com/getting-started/44916.
In this lab you will work in pairs. Pick one other student you will like to work with. If you are not familiar with the NBA or the sport of basketball, try to pair up with someone who is. If you cannot find such a teammate, no problem, just follow the instructions. Each pair of students will pick one NBA team, perform a logistic regression using some of the variables as predictors, perform some inference to suggest a coaching strategy and then do some prediction to see how their model performs on a test set of games.
The names of the two students on each team MUST be on the team’s lab report. Each team should submit only one report for this lab. Gradescope will let you select your team mate when submitting, so make sure to do so. Only one person needs to submit the report.
You all should have R and RStudio installed on your computers by now. If you do not, first install the latest version of R here: https://cran.rstudio.com (remember to select the right installer for your operating system). Next, install the latest version of RStudio here: https://www.rstudio.com/products/rstudio/download/. Scroll down to the “Installers for Supported Platforms” section and find the right installer for your operating system.
You are required to use R Markdown to type up this lab report. If you do not already know how to use R markdown, here is a very (very!) basic R Markdown template: https://akandelanre.github.io/IDS702_F19/labs/resources/LabReport.Rmd. Refer to the resources tab of the course website (here: https://akandelanre.github.io/IDS702_F19/resources/) for links to help you learn how to use R markdown.
You MUST submit both your .Rmd and .pdf files (again, just one copy to be submitted by only one of you) to the course site on Gradescope here: https://www.gradescope.com/courses/57701/assignments. Make sure to knit to pdf and not html; ask the TA about knitting to pdf if you cannot figure it out. Be sure to submit under the right assignment entry.
Download the data (named nba_games_stats.csv
) from Sakai and save it locally to the same directory as your R markdown file. To find the data file on Sakai, go to Resources \(\rightarrow\) Datasets \(\rightarrow\) Lab Datasets \(\rightarrow\) Lab 2. Once you have downloaded the data file into the SAME folder as your R markdown file, load the data by using the following R code.
You are expected to select only one NBA team in the data. Also, set aside data for the 2017/2018 season as test data.
nba <- read.csv("nba_games_stats.csv",header = TRUE,sep = ",",stringsAsFactors = FALSE)
# Set factor variables
nba$Home <- factor(nba$Home)
nba$Team <- factor(nba$Team)
nba$WINorLOSS <- factor(nba$WINorLOSS)
# Convert date to the right format
nba$Date <- as.Date(nba$Date, "%Y-%m-%d")
# Also create a binary variable from WINorLOSS.
# This is not always necessary but can be useful for R functions that prefer numeric binary variables to the original factor variables
nba$Win <- rep(0,nrow(nba))
nba$Win[nba$WINorLOSS=="W"] <- 1
# I picked the Charlotte Hornets (CHO) as an example, you should pick any team you want
nba_reduced <- nba[nba$Team == "CHO", ]
# Set aside the 2017/2018 season as your test data
nba_reduced_train <- nba_reduced[nba_reduced$Date < "2017-10-01",]
nba_reduced_test <- nba_reduced[nba_reduced$Date >= "2017-10-01",]
You will use the nba_reduced_train
and nba_reduced_test
files for your analyses.
Variable | Description |
---|---|
Team | Abbreviation for the name of the team |
Game | Game index for the season. Each team plays 82 games per season |
Date | Date of the game |
Home | Home or away game? |
Opponent | Abbreviation for the name of the opposing team |
WinorLoss | Did the team win? W = win, L = loss |
Win | Binary re-coding of WinorLoss. 1 = win, 0 = loss |
TeamPoints | Number of total points scored in the game |
OpponentPoints | Number of total points scored by the opposing team in the game |
FieldGoals | Number of field goals made in the game (also includes 3 point shots but not free throws) |
FieldGoalsAttempted | Number of field goals attempted in the game (also includes 3 point shots but not free throws) |
FieldGoals. | FieldGoals/FieldGoalsAttempted |
X3PointShots | Number of 3 point shots made in the game |
X3PointShotsAttempted | Number of 3 point shots attempted in the game |
X3PointShots. | X3PointShots/X3PointShotsAttempted |
FreeThrows | Number of free throws made in the game |
FreeThrowsAttempted | Number of free throws attempted in the game |
FreeThrows. | FreeThrows/FreeThrowsAttempted |
OffRebounds | Number of offensive rebounds grabbed in the game |
TotalRebounds | Total number of rebounds grabbed in the game (includes OffRebounds) |
Assists | Total number of assists (passes leading to a made field goal) in the game |
Steals | Total number of steals (balls stolen from the opposing team while the opposing team has possession) in the game |
Blocks | Total number of blocks (direct prevention of a made field goal after the ball has been shot by an opposing player) in the game |
Turnovers | Total number of times the ball was lost back to the opposing team while the team had possession. |
TotalFouls | Total number of fouls committed on players on the opposing team |
Opp.FieldGoals | Number of field goals made by the opposing team in the game (also includes 3 point shots but not free throws) |
Opp.FieldGoalsAttempted | Number of field goals attempted by the opposing team in the game (also includes 3 point shots but not free throws) |
Opp.FieldGoals. | Opp.FieldGoals/Opp.FieldGoalsAttempted |
Opp.X3PointShots | Number of 3 point shots made by the opposing team in the game |
Opp.X3PointShotsAttempted | Number of 3 point shots attempted by the opposing team in the game |
Opp.X3PointShots. | Opp.X3PointShots/Opp.X3PointShotsAttempted |
Opp.FreeThrows | Number of free throws made by the opposing team in the game |
Opp.FreeThrowsAttempted | Number of free throws attempted by the opposing team in the game |
Opp.FreeThrows. | Opp.FreeThrows/Opp.FreeThrowsAttempted |
Opp.OffRebounds | Number of offensive rebounds grabbed by the opposing team in the game |
Opp.TotalRebounds | Total number of rebounds grabbed by the opposing team in the game (includes Opp.OffRebounds) |
Opp.Assists | Total number of assists (passes leading to a made field goal) by the opposing team in the game |
Opp.Steals | Total number of steals (balls stolen from the team while the team has possession) by the opposing team in the game |
Opp.Blocks | Total number of blocks (direct prevention of a made field goal after the ball has been shot by a player on the team) by the opposing team in the game |
Opp.Turnovers | Total number of times the ball was won back from the opposing team while the opposing team had possession. |
Opp.TotalFouls | Total number of fouls committed by players on the opposing team |
Abbreviation/ Acronym |
Franchise |
---|---|
ATL | Atlanta Hawks |
BOS | Boston Celtics |
BRK | Brooklyn Nets |
CHO | Charlotte Hornets |
CHI | Chicago Bulls |
CLE | Cleveland Cavaliers |
DAL | Dallas Mavericks |
DEN | Denver Nuggets |
DET | Detroit Pistons |
GSW | Golden State Warriors |
HOU | Houston Rockets |
IND | Indiana Pacers |
LAC | Los Angeles Clippers |
LAL | Los Angeles Lakers |
MEM | Memphis Grizzlies |
MIA | Miami Heat |
MIL | Milwaukee Bucks |
MIN | Minnesota Timberwolves |
NOP | New Orleans Pelicans |
NYK | New York Knicks |
OKC | Oklahoma City Thunder |
ORL | Orlando Magic |
PHI | Philadelphia 76ers |
PHO | Phoenix Suns |
POR | Portland Trail Blazers |
SAC | Sacramento Kings |
SAS | San Antonio Spurs |
TOR | Toronto Raptors |
UTA | Utah Jazz |
WAS | Washington Wizards |
Treat the variable Win
(or WinorLoss
) as your response variable and the other variables as potential predictors.
Make exploratory plots to explore the relationships between Win
and the following variables: Home
, TeamPoints
, FieldGoals.
, Assists
, Steals
, Blocks
and Turnovers
. Don’t include any of the plots, just briefly describe the relationships.
There are several combinations of variables we should not include as predictors in the logistic model. Identify at least two pairs and explain in at most two sentences, why we should not include them in the model at the same time.
Fit a logistic regression model for Win
(or WinorLoss
) using Home
, TeamPoints
, FieldGoals.
, Assists
, Steals
, Blocks
and Turnovers
. as your predictors. Using the vif
function, are there are any concerns regarding multicollinearity in this model?
Present the output of the fitted model and interpret the significant coefficients in terms of the odds of your team winning an NBA game.
Using 0.5 as your cutoff for predicting wins or losses (1 vs 0) from the predicted probabilities, what is the accuracy of this model? Plot the roc curve for the fitted model. What is the AUC value?
Now add Opp.FieldGoals.
as a predictor to the previous model. Is the coefficient significant? If yes, interpret the coefficient in the context of the question.
What is the accuracy of this new model? Plot the roc curve for the fitted model. What is the new AUC value? Which model predicts the odds of winning better?
Using the results of the model with the better predictive ability, what suggestions do you have for the coach of your team trying to improve the odds of his team winning a regular season game?
Use this model to predict out-of-sample probabilities for the nba_reduced_test
data. Using 0.5 as your cutoff for predicting wins or losses (1 vs 0) from the out-of-sample predicted probabilities, what is the out-of-sample accuracy? How well does your model do in predicting data for the 2017/2018 season?
Using the change in deviance test, test whether including Opp.Assists
and Opp.Blocks
in the model at the same time would improve the model. Is there any other variable in this dataset which we did not consider that you think might improve our model? Which one and why?
10 points: 1 point for each question
This lab is based on ideas proposed by Sam Voisin.