Boostrap and tree-based methods

class: center, middle, inverse, title-slide

# Boostrap and tree-based methods
### Dr. Olanrewaju Michael Akande
### Nov 12, 2019

---

## Outline

- Bootstrap

- Classification and regression trees

- Bagging

- Random forests

- Boosting

---
class: center, middle

# Bootstrap

---
## Introduction

- When building statistical models, we often need to quantify the uncertainty around the estimated parameters we are interested in.

- So far in this class, we have been doing so using standard errors and confidence intervals.

- Computing standard errors is often straightforward when we have closed forms.

- For example, the standard error for `$\bar{X}$` is `$\sigma/\sqrt{n}$`. When `$\sigma$` is unknown, replace with `$s = \hat{\sigma}$`.

- What to do when we do not have closed forms?

---
## Introduction

- Setting confidence intervals and conducting hypotheses testing often requires us to know the distribution of the parameter of interest.

- A key tool for doing this is the central limit theorem.

- Recall that according to CLT, for large samples, averages and sums are approximately normally distributed.

- With some work, the CLT allows confidence intervals and hypotheses testing on means, proportions, sums, intercepts, slopes, and so on.

- But...what if we want to set confidence intervals on a correlation or an sd or a ratio?

---
## Introduction

- Once neat solution is to approximate whatever distribution you have in mind via re-sampling from the true population.

- For example, suppose I would like to estimate the average income of Durham residents and quantify uncertainty around my estimate.

- First I need a sample (of course!).

- Suppose I sample 1000 residents and record their income as `$X_1, \ldots, X_{1000}$`. Then, my estimate of average income is `$\bar{X}$`.

- Next, I should quantify my uncertainty around that number. I can do so using the standard error `$\sigma/\sqrt{n}$` mentioned earlier, which relies on the CLT.

---
## Introduction

- Alternatively, I could approximate the entire distribution of average income myself as follows:

1. Generate `$B=100$` different samples of 1000 Durham residents.
  
--

2. For each set `$b = 1, \ldots, B$` of 1000 residents, compute `$\bar{X}^b$`.
  
--

3. Make a histogram of all `$\bar{X}^1, \ldots, \bar{X}^{100}$` values. This approximates the distribution of average income of Durham residents.
  
--

- Point estimate of average income is thus the mean of `$\bar{X}^1, \ldots, \bar{X}^{100}$`.

- To quantify uncertainty, can use the standard deviation of `$\bar{X}^1, \ldots, \bar{X}^{100}$`.

- For confidence intervals, use the quantiles of the histogram.

--
  
- In practice, however, the procedure above cannot be applied, because we  usually cannot generate many samples from the original population.

- What to do then? .hlight[Bootstrap]!

---
## Bootstrap

- .hlight[Bootstrap] is a very powerful statistical tool.

- It can be used to "approximate" the distribution of almost any parameter of interest.

- .block[Bootstrap allows us to mimic the process of obtaining new sample sets by repeatedly sampling observations from the original data set.]

- That is, replace step 1 of the previously outlined approach with
  1. Generate `$B=100$` different samples of 1000 Durham residents by re-sampling from the original observed sample with replacement.
 
--
 
- Can then follow the remaining steps to approximate the distribution of the parameter of interest.

- Ideally, the sample you start with should be representative of the entire population. Bootstrap relies on the original sample!

---
## Bootstrap

Here's a figure from the [ISL](http://faculty.marshall.usc.edu/gareth-james/ISL/) book illustrating the approach.

---
class: center, middle

# Tree-based methods

---
## Tree-based methods

- The regression approaches we have covered so far in this course are all .hlight[parametric].

- .hlight[Parametric] means that we need to assume an underlying probability distribution to explain the randomness.

- For example, for linear regression,
.block[
.small[
`$$y_i = \beta_0 + \beta_1 x_{i1} + \epsilon_i; \ \ \epsilon_i \overset{iid}{\sim} N(0, \sigma^2),$$`
]
]

we assume a normal distribution.
  
--

- For logistic regression,
.block[
.small[
$$
`\begin{split}
y_i | x_i \sim \textrm{Bernoulli}(\pi_i); \ \ \ & \textrm{log}\left(\dfrac{\pi_i}{1-\pi_i}\right) = \beta_0 + \beta_1 x_i,
\end{split}`
$$
]
]

we assume a Bernoulli distribution.
  
  
---
## Tree-based methods

- All the models we have covered requires specifying function for the mean or odds, and specifying distribution for randomness.

- We may not want to run the risk of mis-specifying those.

- As an alternative one can turn to .hlight[nonparametric models] that optimize certain criteria rather than specify models.
  + Classification and regression trees (CART)
  + Random forests
  + Boosting
  + Other machine learning methods
  
--

+ You would learn about machine learning methods in the machine learning course next semester.

+ Today's class is simply a brief (very!) introduction to a few of those methods.

---
## CART

- Goal: predict outcome variable from several predictors.

- Can be used for categorical outcomes (classification trees) or continuous outcomes (regression trees).

- Let `$Y$` represent the outcome and `$X$` represent the predictors.

- CART recursively partitions the predictor space in a way that can be effectively represented by a tree structure, with leaves corresponding to the subsets of units.

---
## CART for categorical outcomes

- Partition `$X$` space so that subsets of individuals formed by partitions have relatively homogeneous `$Y$`.

- Partitions from recursive binary splits of `$X$`.

- Grow tree until it reaches pre-determined maximum size (minimum number of points in leaves).

- Various ways to prune tree based on cross validation.

- Making predictions:

+ For any new `$X$`, trace down tree until you reach the appropriate leaf.
  
--

+ Use value of `$Y$` that occurs most frequently in leaf as the prediction.

---
## CART

---
## CART for categorical outcomes

- To illustrate, Figure 1 displays a fictional regression tree for 
  + an outcome variable.
  + two predictors, gender (male or female) and race/ethnicity (African-American, Caucasian, or Hispanic).
  
--

- To approximate the conditional distribution of `$Y$` for a particular gender and race/ethnicity combination, one uses the values in the corresponding leaf.

- For example, to predict a `$Y$` value for for female Caucasians, one uses the `$Y$` value that occurs most frequently in leaf `$L3$`.

---
## CART for continuous outcomes

- Same idea as for categorical outcomes: grow tree by recursive partitions on `$X$`.

- Use the variance of the `$Y$` values as a splitting criterion:
  choose the split that makes the sum of the variances of the `$Y$` values in the leaves as small as possible.

- When making predictions for new `$X$`, use the average value of `$Y$` in the leaf for that `$X$`.

---
## Model diagnostics

- Can look at residuals, but...

+ No parametric model, so for continuous outcomes we can’t check for linearity, non constant variance, normality, etc.
  
--

+ Big residuals identify `$X$` values for which the predictions are not close to the actual `$Y$` values. But...what should we do with them?
  
--

+ Could use binned residuals for logistic regression, but they only tell you where model does not give good predictions.
  
--

- Transforming the `$X$` values is irrelevant for trees (as long as transformation is monotonic, like logs)

- Can still do model validation, that is, compute and compare RMSEs, AUC, accuracy, and so on.

---
## CART vs. parametric regression: benefits

- No parametric assumptions.

- Automatic model selection.

- Multi-collinearity not problematic.

- Useful exploratory tool to find important interactions.

- In R, use `tree` or `rpart`.

---
## CART vs. parametric regression: limitations

- Regression predictions forced to range of observed `$Y$` values. May or may not be a limitation depending on the context.

- Bins continuous predictors, so fine grained relationships lost.

- Finds one tree, making it hard to interpret chance error for that tree.

- No obvious ways to assess variable importance.

- Harder to interpret effects of individual predictors.

Also, One big tree is limiting, but, we need different datasets or variables to grow more than one tree...

---
## Bagging

- Instead of using one big tree, .hlight[bagging] constructs `$B$` classification and regression trees using `$B$` bootstrapped datasets.

- Each tree is grown deep and has high variance, but low bias.

- Averaging all `$B$` trees reduces the variance.

- Improve accuracy by combining hundreds or even thousands of trees.

- To predict,
  + a continuous outcome, drop new `$X$` down each tree until getting to terminal leaf. Predicted value of `$Y$` is the average of all `$B$` predictions across all the trees.
--

+ a categorical outcome, select the most commonly occurring majority level among the `$B$` predictions.

---
## Random forests

- The trees in bagging would be correlated since they are all based on the same data (sort of!).

- .hlight[Random forests] attemps to de-correlate the trees.

- Random forests also constructs `$B$` classification and regression trees using `$B$` bootstrapped datasets but only uses a sample of the predictors for each tree.

- Doing so prevents the same variables from dominating the splitting process across all trees.

- Both bagging and random forests will not overfit for large `$B$`.

---
## Random forests

- .hlight[Random forest algorithm]:

--
  
  For `$b = 1, \ldots, B$`,
  1. Take a bootstrap sample of the original data. 
      + Alternatively, can take a sub-sample of the original data of size `$m < n$`, where `$n$` is the sample size of the collected data.
  
--

2. Take a sample of `$q < p$` predictors, where `$p$` is the total number of predictors in the dataset.
  
--

3. Using only the data in the bootstrapped sample or sub-sample, grow a tree using only the `$q$` sampled predictors. Save the tree.
  
--

- For predictions, do the same thing as in bagging.

- Variable importance measures based on how often a variable is used in splits of the trees.

---
## Random forests vs. parametric regression: benefits

- No parametric assumptions.

- Automatic model selection.

- Multi-collinearity not problematic.

- Can handle big data files, since trees are small.

- In R, use the `randomForest` package.

---
## Random forests vs. parametric regression: limitations

- Regression prediction limitations like those for CART.

- Hard to assess chance error.

- Little control over the few parameters to tweak if model does not fit the data well.

---
## Boosting

- .hlight[Boosting] works like bagging, except that the trees are grown sequentially.

- Specifically, each tree is grown using information from previously grown trees.

- After the first tree, the remaining trees are built using residuals as outcomes.

- The idea is so that boosting can slowly improve the model in areas where it does not perform well.

- Boosting does not involve bootstrap since each tree is fit on a modified version of the original data set.

- It can overfit if the number of trees is too large.

- There are so many boosting methods! This is just one of them.

---
## Boosting

- Goal: to construct a function `$\hat{f}(y|x)$` to estimate true `$f(y|x)$`.

- .hlight[Boosting algorithm]:

1. Fit a decision tree `$\hat{f}$` with `$d$` splits to the data using `$Y$` as the outcome. Compute the residuals.
  
--

2. For `$b = 2, \ldots, B$`,
      + Fit a decision tree `$\hat{f}^b$` with `$d$` splits to the data using the residuals as the outcome.
      + Add this new decision tree into the fitted function: `$\hat{f} = \hat{f} + \lambda \hat{f}^b$`.
      + Compute updated residuals.

--
  3. Output the boosted model: `$\hat{f} = \sum_{b=1}^B \lambda \hat{f}^b$`.
  
--

- The shrinkage parameter `$\lambda$` (often small, e.g. 0.01) controls the rate at which boosting learns.

---
## General advice about tree methods vs parametric regressions

- When the goal is prediction and sample sizes are large, tree methods can be effective engines for prediction.

- When the goal is interpretation of predictors, or when sample sizes are modest, use parametric models with careful model diagnostics.

- Either way, always remember the data:
  + What population, if any, are they representative of?
  + Are the definitions of variables what you wanted?
  + Are there missing values or data errors to correct?

---
## In class analysis

- Recall the study measuring the concentrations of arsenic in wells in Bangladesh.

- We already fit a logistic regression to the data.

- We will use the same data to compare these models.

- Research question: predicting why people switch from unsafe wells to safe wells.

- The data is in the file `arsenic.csv` on Sakai.

---
## In class analysis

Variable    | Description
:------------- | :------------
Switch | 1 = if respondent switched to a safe well <br /> 0 = if still using own unsafe well
Arsenic | amount of arsenic in well at respondent's home (100s of micro-grams per liter)
Dist | distance in meters to the nearest known safe well
Assoc | 1 = if any members of household are active in community organizations <br /> 0 = otherwise
Educ | years of schooling of the head of household

- Treat switch as the response variable and others as predictors.

- Move to the R script [here](https://ids-702-f19.github.io/Course-Website/slides/lec-slides/Arsenic_II.R).