class: center, middle, inverse, title-slide # Boostrap and tree-based methods ### Dr. Olanrewaju Michael Akande ### Nov 12, 2019 --- <!-- ## Announcements --> <!-- - Reminder: please remember to respond to survey II. --> <!-- - No more HW5; more time to work on individual projects. --> <!-- - Final lab next Friday; will cover causal inference. --> ## Outline - Bootstrap - Classification and regression trees - Bagging - Random forests - Boosting --- class: center, middle # Bootstrap --- ## Introduction - When building statistical models, we often need to quantify the uncertainty around the estimated parameters we are interested in. -- - So far in this class, we have been doing so using standard errors and confidence intervals. -- - Computing standard errors is often straightforward when we have closed forms. -- - For example, the standard error for `\(\bar{X}\)` is `\(\sigma/\sqrt{n}\)`. When `\(\sigma\)` is unknown, replace with `\(s = \hat{\sigma}\)`. -- - What to do when we do not have closed forms? --- ## Introduction - Setting confidence intervals and conducting hypotheses testing often requires us to know the distribution of the parameter of interest. -- - A key tool for doing this is the central limit theorem. -- - Recall that according to CLT, for large samples, averages and sums are approximately normally distributed. -- - With some work, the CLT allows confidence intervals and hypotheses testing on means, proportions, sums, intercepts, slopes, and so on. -- - But...what if we want to set confidence intervals on a correlation or an sd or a ratio? --- ## Introduction - Once neat solution is to approximate whatever distribution you have in mind via re-sampling from the true population. -- - For example, suppose I would like to estimate the average income of Durham residents and quantify uncertainty around my estimate. -- - First I need a sample (of course!). -- - Suppose I sample 1000 residents and record their income as `\(X_1, \ldots, X_{1000}\)`. Then, my estimate of average income is `\(\bar{X}\)`. -- - Next, I should quantify my uncertainty around that number. I can do so using the standard error `\(\sigma/\sqrt{n}\)` mentioned earlier, which relies on the CLT. --- ## Introduction - Alternatively, I could approximate the entire distribution of average income myself as follows: -- 1. Generate `\(B=100\)` different samples of 1000 Durham residents. -- 2. For each set `\(b = 1, \ldots, B\)` of 1000 residents, compute `\(\bar{X}^b\)`. -- 3. Make a histogram of all `\(\bar{X}^1, \ldots, \bar{X}^{100}\)` values. This approximates the distribution of average income of Durham residents. -- - Point estimate of average income is thus the mean of `\(\bar{X}^1, \ldots, \bar{X}^{100}\)`. -- - To quantify uncertainty, can use the standard deviation of `\(\bar{X}^1, \ldots, \bar{X}^{100}\)`. -- - For confidence intervals, use the quantiles of the histogram. -- - In practice, however, the procedure above cannot be applied, because we usually cannot generate many samples from the original population. -- - What to do then? .hlight[Bootstrap]! --- ## Bootstrap - .hlight[Bootstrap] is a very powerful statistical tool. -- - It can be used to "approximate" the distribution of almost any parameter of interest. -- - .block[Bootstrap allows us to mimic the process of obtaining new sample sets by repeatedly sampling observations from the original data set.] -- - That is, replace step 1 of the previously outlined approach with 1. Generate `\(B=100\)` different samples of 1000 Durham residents by re-sampling from the original observed sample with replacement. -- - Can then follow the remaining steps to approximate the distribution of the parameter of interest. -- - Ideally, the sample you start with should be representative of the entire population. Bootstrap relies on the original sample! --- ## Bootstrap Here's a figure from the [ISL](http://faculty.marshall.usc.edu/gareth-james/ISL/) book illustrating the approach. <img src="img/bootstrap.png" height="480px" style="display: block; margin: auto;" /> --- class: center, middle # Tree-based methods --- ## Tree-based methods - The regression approaches we have covered so far in this course are all .hlight[parametric]. -- - .hlight[Parametric] means that we need to assume an underlying probability distribution to explain the randomness. -- - For example, for linear regression, .block[ .small[ `$$y_i = \beta_0 + \beta_1 x_{i1} + \epsilon_i; \ \ \epsilon_i \overset{iid}{\sim} N(0, \sigma^2),$$` ] ] we assume a normal distribution. -- - For logistic regression, .block[ .small[ $$ `\begin{split} y_i | x_i \sim \textrm{Bernoulli}(\pi_i); \ \ \ & \textrm{log}\left(\dfrac{\pi_i}{1-\pi_i}\right) = \beta_0 + \beta_1 x_i, \end{split}` $$ ] ] we assume a Bernoulli distribution. --- ## Tree-based methods - All the models we have covered requires specifying function for the mean or odds, and specifying distribution for randomness. -- - We may not want to run the risk of mis-specifying those. -- - As an alternative one can turn to .hlight[nonparametric models] that optimize certain criteria rather than specify models. + Classification and regression trees (CART) + Random forests + Boosting + Other machine learning methods -- + You would learn about machine learning methods in the machine learning course next semester. -- + Today's class is simply a brief (very!) introduction to a few of those methods. --- ## CART - Goal: predict outcome variable from several predictors. -- - Can be used for categorical outcomes (classification trees) or continuous outcomes (regression trees). -- - Let `\(Y\)` represent the outcome and `\(X\)` represent the predictors. -- - CART recursively partitions the predictor space in a way that can be effectively represented by a tree structure, with leaves corresponding to the subsets of units. --- ## CART for categorical outcomes - Partition `\(X\)` space so that subsets of individuals formed by partitions have relatively homogeneous `\(Y\)`. -- - Partitions from recursive binary splits of `\(X\)`. -- - Grow tree until it reaches pre-determined maximum size (minimum number of points in leaves). -- - Various ways to prune tree based on cross validation. -- - Making predictions: -- + For any new `\(X\)`, trace down tree until you reach the appropriate leaf. -- + Use value of `\(Y\)` that occurs most frequently in leaf as the prediction. --- ## CART <img src="img/cart.png" height="480px" style="display: block; margin: auto;" /> --- ## CART for categorical outcomes - To illustrate, Figure 1 displays a fictional regression tree for + an outcome variable. + two predictors, gender (male or female) and race/ethnicity (African-American, Caucasian, or Hispanic). -- - To approximate the conditional distribution of `\(Y\)` for a particular gender and race/ethnicity combination, one uses the values in the corresponding leaf. -- - For example, to predict a `\(Y\)` value for for female Caucasians, one uses the `\(Y\)` value that occurs most frequently in leaf `\(L3\)`. --- ## CART for continuous outcomes - Same idea as for categorical outcomes: grow tree by recursive partitions on `\(X\)`. -- - Use the variance of the `\(Y\)` values as a splitting criterion: choose the split that makes the sum of the variances of the `\(Y\)` values in the leaves as small as possible. -- - When making predictions for new `\(X\)`, use the average value of `\(Y\)` in the leaf for that `\(X\)`. --- ## Model diagnostics - Can look at residuals, but... -- + No parametric model, so for continuous outcomes we can’t check for linearity, non constant variance, normality, etc. -- + Big residuals identify `\(X\)` values for which the predictions are not close to the actual `\(Y\)` values. But...what should we do with them? -- + Could use binned residuals for logistic regression, but they only tell you where model does not give good predictions. -- - Transforming the `\(X\)` values is irrelevant for trees (as long as transformation is monotonic, like logs) -- - Can still do model validation, that is, compute and compare RMSEs, AUC, accuracy, and so on. --- ## CART vs. parametric regression: benefits - No parametric assumptions. -- - Automatic model selection. -- - Multi-collinearity not problematic. -- - Useful exploratory tool to find important interactions. -- - In R, use `tree` or `rpart`. --- ## CART vs. parametric regression: limitations - Regression predictions forced to range of observed `\(Y\)` values. May or may not be a limitation depending on the context. -- - Bins continuous predictors, so fine grained relationships lost. -- - Finds one tree, making it hard to interpret chance error for that tree. -- - No obvious ways to assess variable importance. -- - Harder to interpret effects of individual predictors. -- Also, One big tree is limiting, but, we need different datasets or variables to grow more than one tree... --- ## Bagging - Instead of using one big tree, .hlight[bagging] constructs `\(B\)` classification and regression trees using `\(B\)` bootstrapped datasets. -- - Each tree is grown deep and has high variance, but low bias. -- - Averaging all `\(B\)` trees reduces the variance. -- - Improve accuracy by combining hundreds or even thousands of trees. -- - To predict, + a continuous outcome, drop new `\(X\)` down each tree until getting to terminal leaf. Predicted value of `\(Y\)` is the average of all `\(B\)` predictions across all the trees. -- + a categorical outcome, select the most commonly occurring majority level among the `\(B\)` predictions. --- ## Random forests - The trees in bagging would be correlated since they are all based on the same data (sort of!). -- - .hlight[Random forests] attemps to de-correlate the trees. -- - Random forests also constructs `\(B\)` classification and regression trees using `\(B\)` bootstrapped datasets but only uses a sample of the predictors for each tree. -- - Doing so prevents the same variables from dominating the splitting process across all trees. -- - Both bagging and random forests will not overfit for large `\(B\)`. --- ## Random forests - .hlight[Random forest algorithm]: -- For `\(b = 1, \ldots, B\)`, 1. Take a bootstrap sample of the original data. + Alternatively, can take a sub-sample of the original data of size `\(m < n\)`, where `\(n\)` is the sample size of the collected data. -- 2. Take a sample of `\(q < p\)` predictors, where `\(p\)` is the total number of predictors in the dataset. -- 3. Using only the data in the bootstrapped sample or sub-sample, grow a tree using only the `\(q\)` sampled predictors. Save the tree. -- - For predictions, do the same thing as in bagging. -- - Variable importance measures based on how often a variable is used in splits of the trees. --- ## Random forests vs. parametric regression: benefits - No parametric assumptions. -- - Automatic model selection. -- - Multi-collinearity not problematic. -- - Can handle big data files, since trees are small. -- - In R, use the `randomForest` package. --- ## Random forests vs. parametric regression: limitations - Regression prediction limitations like those for CART. -- - Hard to assess chance error. -- - Little control over the few parameters to tweak if model does not fit the data well. --- ## Boosting - .hlight[Boosting] works like bagging, except that the trees are grown sequentially. -- - Specifically, each tree is grown using information from previously grown trees. -- - After the first tree, the remaining trees are built using residuals as outcomes. -- - The idea is so that boosting can slowly improve the model in areas where it does not perform well. -- - Boosting does not involve bootstrap since each tree is fit on a modified version of the original data set. -- - It can overfit if the number of trees is too large. -- - There are so many boosting methods! This is just one of them. --- ## Boosting - Goal: to construct a function `\(\hat{f}(y|x)\)` to estimate true `\(f(y|x)\)`. -- - .hlight[Boosting algorithm]: -- 1. Fit a decision tree `\(\hat{f}\)` with `\(d\)` splits to the data using `\(Y\)` as the outcome. Compute the residuals. -- 2. For `\(b = 2, \ldots, B\)`, + Fit a decision tree `\(\hat{f}^b\)` with `\(d\)` splits to the data using the residuals as the outcome. + Add this new decision tree into the fitted function: `\(\hat{f} = \hat{f} + \lambda \hat{f}^b\)`. + Compute updated residuals. -- 3. Output the boosted model: `\(\hat{f} = \sum_{b=1}^B \lambda \hat{f}^b\)`. -- - The shrinkage parameter `\(\lambda\)` (often small, e.g. 0.01) controls the rate at which boosting learns. --- ## General advice about tree methods vs parametric regressions - When the goal is prediction and sample sizes are large, tree methods can be effective engines for prediction. -- - When the goal is interpretation of predictors, or when sample sizes are modest, use parametric models with careful model diagnostics. -- - Either way, always remember the data: + What population, if any, are they representative of? + Are the definitions of variables what you wanted? + Are there missing values or data errors to correct? --- ## In class analysis - Recall the study measuring the concentrations of arsenic in wells in Bangladesh. -- - We already fit a logistic regression to the data. -- - We will use the same data to compare these models. -- - Research question: predicting why people switch from unsafe wells to safe wells. -- - The data is in the file `arsenic.csv` on Sakai. --- ## In class analysis Variable | Description :------------- | :------------ Switch | 1 = if respondent switched to a safe well <br /> 0 = if still using own unsafe well Arsenic | amount of arsenic in well at respondent's home (100s of micro-grams per liter) Dist | distance in meters to the nearest known safe well Assoc | 1 = if any members of household are active in community organizations <br /> 0 = otherwise Educ | years of schooling of the head of household - Treat switch as the response variable and others as predictors. - Move to the R script [here](https://ids-702-f19.github.io/Course-Website/slides/lec-slides/Arsenic_II.R).