class: center, middle, inverse, title-slide # Missing data methods II ### Dr. Olanrewaju Michael Akande ### Oct 29, 2019 --- ## Announcements - Instructions for project proposals are on the course website. ## Outline - Multiple imputation + Joint Modeling + Sequential regression models + General advice + Simple illustration --- class: center, middle # Multiple imputation --- ## MI recap - Fill in dataset `\(m\)` times with imputations. -- - Analyze repeated datasets separately, then combine the estimates from each one. -- - Imputations drawn from probability models for missing data. -- <img src="img/MultipleImp.png" height="370px" style="display: block; margin: auto;" /> --- ## MI recap Rubin (1987) - Population estimand: `\(Q\)` - Sample estimate: `\(q\)` - Variance of `\(q\)`: `\(u\)` - In each imputed dataset `\(d_j\)`, where `\(j = 1,\ldots,m\)`, calculate `$$q_j = q(d_j)$$` `$$u_j = u(d_j).$$` --- ## MI recap - MI estimate of `\(Q\)`: .block[ `$$\bar{q}_m = \sum\limits_{i=1}^m \dfrac{q_i}{m}.$$` ] -- - MI estimate of variance is: .block[ `$$T_m = (1+1/m)b_m + \bar{u}_m.$$` ] where .block[ `$$b_m = \sum\limits_{i=1}^m \dfrac{(q_i - \bar{q}_m)^2}{m-1}; \ \ \ \ \bar{u}_m = \sum\limits_{i=1}^m \dfrac{u_i}{m}.$$` ] -- - Use t-distribution inference for `\(Q\)` .block[ `$$\bar{q}_m \pm t_{1-\alpha/2} \sqrt{T_m}.$$` ] --- ## Explanation of MI variance - MI estimate of variance is: .block[ `$$T_m = (1+1/m)b_m + \bar{u}_m.$$` ] - According to the law of total variance, for two random variables `\(X\)` and `\(Y\)`, .block[ `$$\mathbb{Var}[Y] = \mathbb{E}[\mathbb{Var}[Y|X]] + \mathbb{Var}[\mathbb{E}[Y|X]].$$` ] which implies that .block[ `$$\mathbb{Var}[Y|Z] = \mathbb{E}[\mathbb{Var}[Y|Z,X]] + \mathbb{Var}[\mathbb{E}[Y|Z,X]],$$` ] for a third random variable `\(Z\)`. - Consider `\(m = \infty\)`, where `\(m\)` is the number of completed datasets, Rubin (1987) shows .block[ `$$\mathbb{Var}[Q | \boldsymbol{Y}_{obs}] = \mathbb{E}[\mathbb{Var}[Q|\boldsymbol{Y}_{obs},\boldsymbol{Y}_{mis}]] + \mathbb{Var}[\mathbb{E}[Q|\boldsymbol{Y}_{obs},\boldsymbol{Y}_{mis}]] = b_\infty + \bar{u}_\infty,$$` ] and one can also condition on `\(\boldsymbol{X}\)` in the regression context where predictors are available. --- ## MI: pros and cons - Advantages -- + Straightforward estimation of uncertainty -- + Flexible modeling of missing data -- - Disadvantages (??) -- + Extra data sets to manage -- + Explicitly model-based --- ## Resources for learning more - Little and Rubin (2002), *Statistical Analysis with Missing Data*, Wiley - Schafer (1997), *Analysis of Incomplete Multivariate Data*, CRC Press - Reiter and Raghunathan (2007), "The multiple adaptations of multiple imputation," *JASA*. --- class: center, middle # Where should the imputations come from? --- ## MI: where should the imputations come from? So where should we get reasonable replacements for the missing values from? There are two general approaches: -- - Sequential modeling -- + Estimate a sequence of conditional models (think separate regressions for each variable!); -- + Impute from each model. -- - Joint modeling -- + Choose a multivariate model for all the data (we will not cover joint multivariate models in this class; consider taking STA602!); -- + Estimate the model; -- + Impute from the joint model. --- ## MI: sequential regression models Suppose the data includes three variables `\(Y_1\)`, `\(Y_2\)`, `\(Y_3\)`. -- + Step 1: fill in missing values by simulating values from regressions based on complete cases; -- + Step 2: regress `\(Y_1|Y_2,Y_3\)` using completed data; -- + Step 3: impute new values of `\(Y_1\)` from this model; -- + Step 4: repeat for `\(Y_2|Y_1,Y_3\)` and `\(Y_3|Y_1,Y_2\)` (repeat for all variables with missing data); -- + Step 5: cycle through Steps 1 to Step 4 many times; - Usually 5 times is a default but there is not theory underpinning this default. -- Final dataset is one imputed dataset. Repeat entire process `\(m\)` times to get `\(m\)` multiply-imputed datasets. --- ## Existing software for sequential modeling Free software packages + MICE for R and Stata (so many conditional models to pick from, for example, predictive mean matching, random forest, linear regression, logistic regression, and so on); + statsmodels MICE in python (only uses predictive mean matching); + MI for R; + IVEWARE for SAS. In sequential modeling, one can specify many types of conditional models and include constraints on values. --- ## Existing software for joint modeling - Multivariate normal data + R: NORM, Amelia II; + SAS: proc MI; + Stata: MI command. - Mixtures of multivariate normal distributions + R: EditImpCont (also does editing of faulty values). - Multinomial data: + R: CAT (log-linear model), NPBayesImpute (latent class model). --- ## Existing software for joint modeling - Nested Multinomial data: + R: NestedCategBayesImpute (also generates synthetic data). *update coming soon to allow for editing of faulty values* - Mixed data: + R: MIX (general location model). - Many other joint models, but often without open source software. --- ## Comparing sequential to joint modeling Advantages -- - Often easier to specify reasonable conditionals than a joint model. -- - Complex MCMC not needed. -- - Can use machine learning methods for conditionals. -- Disadvantages -- - Labor intensive to specify models. -- - Incoherent conditionals can cause odd behaviors (e.g., order matters). -- - Theoretical properties difficult to assess. --- ## What if imputation and analysis model do not match? - Imputation model more general than analysis model: .hlight[conservative inferences]. -- - Imputation model less general than analysis model: .hlight[invalid inferences]. -- - For sequential modeling, include all variables related to outcome and missing data (Schafer 1997). -- - Include design information in models (Reiter *et al.* 2006, *Survey Methodology*). --- ## Evaluating the fit of imputation models - Plots of imputed and observed values (Abayomi *et al*, 2008, *JRSS-C*) + Imputed values that don't look like the observed values could *maybe* imply poor imputation models; -- + Useful as a sensibility check - Model-specific diagnostics (Gelman *et al*, 2005, *Biometrics*) + Take a look at residual plots with marked observed and imputed values; -- + Look for obvious abnormalities. --- ## Remarks - Ignoring missing data is risky. -- - Single imputation procedures at best underestimate uncertainty and at worst fail to capture multivariate relationships. -- - Multiple imputation recommended (or other model-based methods). -- - We discussed MI for MAR data. When data are NMAR life much harder -- get experts in missing data on your team. --- ## Remarks - Incorporate all sources of uncertainty in imputations, including uncertainty in parameter estimates. -- - Want models that accurately describe the distribution of missing values. -- - Important to keep in mind that imputation model used only for cases with missing data. -- + Suppose you have 30% missing values; -- + Also, suppose your imputation model is "80% good" ("20% bad"); -- + Then, completed data are only "6% bad"! --- class: center, middle # MI in R --- ## Illustration - Simple example using data that come with the .hlight[MICE] package in R. -- - Dataset from NHANES includes 25 cases measured on 4 variables. -- - Only 13 cases with complete data. -- - We will use multiple imputation to make completed datasets and do analyses. -- - The four variables are 1. age (age group: 20-39, 40-59, 60+) 2. bmi (body mass index, in `\(kg/m^2\)`) 3. hyp (hypertension status: no, yes) 4. chl (total cholesterol, in `\(mg/dL\)`) --- ## Illustration ```r library(mice) data(nhanes2) dim(nhanes2) ``` ``` ## [1] 25 4 ``` ```r summary(nhanes2) ``` ``` ## age bmi hyp chl ## 20-39:12 Min. :20.40 no :13 Min. :113.0 ## 40-59: 7 1st Qu.:22.65 yes : 4 1st Qu.:185.0 ## 60-99: 6 Median :26.75 NA's: 8 Median :187.0 ## Mean :26.56 Mean :191.4 ## 3rd Qu.:28.93 3rd Qu.:212.0 ## Max. :35.30 Max. :284.0 ## NA's :9 NA's :10 ``` ```r str(nhanes2) ``` ``` ## 'data.frame': 25 obs. of 4 variables: ## $ age: Factor w/ 3 levels "20-39","40-59",..: 1 2 1 3 1 3 1 1 2 2 ... ## $ bmi: num NA 22.7 NA NA 20.4 NA 22.5 30.1 22 NA ... ## $ hyp: Factor w/ 2 levels "no","yes": NA 1 1 NA 1 NA 1 1 1 NA ... ## $ chl: num NA 187 187 NA 113 184 118 187 238 NA ... ``` --- ## Patterns of missing data ```r md.pattern(nhanes2) ``` <img src="img/miss-data-pattern.png" height="400px" style="display: block; margin: auto;" /> -- 5 patterns observed from `\(2^3=8\)` possible patterns --- ## Patterns of missing data <img src="img/miss-data-pattern.png" height="350px" style="display: block; margin: auto;" /> -- - First column gives number of rows with each pattern, e.g., 3 cases where only cholesterol is missing. -- - Last column gives number of variables missing in each pattern. -- - Last row gives total number of missing values by variables. --- ## Visualizing patterns of missing data ```r library(VIM) library(lattice) aggr(nhanes2,col=c("lightblue3","darkred"),numbers=TRUE,sortVars=TRUE,labels=names(nhanes2), cex.axis=.7,gap=3,ylab=c("Proportion missing","Missingness pattern")) ``` ``` ## ## Variables sorted by number of missings: ## Variable Count ## chl 0.40 ## bmi 0.36 ## hyp 0.32 ## age 0.00 ``` <img src="17-missing-data-II_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## Visualizing patterns of missing data The .hlight[marginplot] function can be used to understand how missingness affects the distribution of values on other variables. -- - Blue boxplots summarize the distribution of observed data given the other variable is observed. -- - Red box plots summarize the distribution of observed data given the other variable is missing -- - If data are MCAR, you expect the boxplots to be the same (hard to evaluate in this small sample) -- Let's look at the margin plot for the two continuous variables `bmi` and `chl`. --- ## Visualizing patterns of missing data ```r marginplot(nhanes2[,c("chl","bmi")],col=c("lightblue3","darkred"),cex.numbers=1.2,pch=19) ``` <img src="17-missing-data-II_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ## Visualizing patterns of missing data - Interpretation of the numbers in red. -- + 9 = number of observations with missingness in `bmi` + 10 = number of observations with missingness in `chl` + 7 = number of observations with missingness in both `bmi` and `chl`. -- - The scatterplot of blue points display the relationship between `bmi` and `chl` when they are both observed (13 cases). -- - The red points indicate the amount of data used to generate the red boxplots. --- ## Mice in R We will use the .hlight[mice] function to generate 10 imputed datasets. By default, .hlight[mice] uses - .hlight[pmm]: Predictive mean matching for numeric data - .hlight[logreg]: Logistic regression for factor data with 2 levels - .hlight[polyreg]: Multinomial logistic regression for factor data with > 2 levels - .hlight[polr]: Proportional odds model for factor data with > 2 ordered levels --- ## Mice in R Other commonly used methods are - .hlight[norm]: Bayesian linear regression - .hlight[sample]: Random sample from observed values - .hlight[cart]: Classification and regression trees - .hlight[rf]: Random forest Personally, I prefer to use .hlight[norm] instead of .hlight[pmm] for imputing numeric/continuous variables. For the illustration, ```r nhanes2_imp <- mice(nhanes2,m=10,defaultMethod=c("norm","logreg","polyreg","polr"),print=F) ``` --- ## Mice in R ```r methods(mice) ``` ``` ## Warning in .S3methods(generic.function, class, envir): function 'mice' ## appears not to be S3 generic; found functions that look like S3 methods ``` ``` ## [1] mice.impute.2l.bin mice.impute.2l.lmer ## [3] mice.impute.2l.norm mice.impute.2l.pan ## [5] mice.impute.2lonly.mean mice.impute.2lonly.norm ## [7] mice.impute.2lonly.pmm mice.impute.cart ## [9] mice.impute.jomoImpute mice.impute.lda ## [11] mice.impute.logreg mice.impute.logreg.boot ## [13] mice.impute.mean mice.impute.midastouch ## [15] mice.impute.norm mice.impute.norm.boot ## [17] mice.impute.norm.nob mice.impute.norm.predict ## [19] mice.impute.panImpute mice.impute.passive ## [21] mice.impute.pmm mice.impute.polr ## [23] mice.impute.polyreg mice.impute.quadratic ## [25] mice.impute.rf mice.impute.ri ## [27] mice.impute.sample mice.mids ## [29] mice.theme ## see '?methods' for accessing help and source code ``` --- ## Predictive mean matching (pmm) - Suppose `\(y\)` is subject to missing values while `\(x\)` is completely observed. The basic idea for pmm is: -- + Using complete cases, regress `\(y\)` on `\(x\)`, obtaining `\(\hat{\beta} = (\hat{\beta}_0,\hat{\beta}_1)\)`; -- + Draw a new `\(\beta^\star\)` from the posterior distribution of `\(\hat{\beta}\)` (e.g, multivariate normal); -- + Using `\(\beta^\star\)`, generate predicted values of `\(y\)` for all cases; -- + For each case with a missing `\(y\)`, identify set of donors with no missing values, who have predicted `\(y\)` values close to that of the case with missing data; -- + From among these cases, randomly select one and assign its observed value of `\(y\)` as the imputed value; -- + Repeat for all observations and imputation data sets. -- - Pmm matches the distribution of the original observed variable, as imputed values are taken from the real data. --- ## Mice in R Back to the `nhanes2_imp` object, first look at the original data ```r nhanes2 ``` ``` ## age bmi hyp chl ## 1 20-39 NA <NA> NA ## 2 40-59 22.7 no 187 ## 3 20-39 NA no 187 ## 4 60-99 NA <NA> NA ## 5 20-39 20.4 no 113 ## 6 60-99 NA <NA> 184 ## 7 20-39 22.5 no 118 ## 8 20-39 30.1 no 187 ## 9 40-59 22.0 no 238 ## 10 40-59 NA <NA> NA ## 11 20-39 NA <NA> NA ## 12 40-59 NA <NA> NA ## 13 60-99 21.7 no 206 ## 14 40-59 28.7 yes 204 ## 15 20-39 29.6 no NA ## 16 20-39 NA <NA> NA ## 17 60-99 27.2 yes 284 ## 18 40-59 26.3 yes 199 ## 19 20-39 35.3 no 218 ## 20 60-99 25.5 yes NA ## 21 20-39 NA <NA> NA ## 22 20-39 33.2 no 229 ## 23 20-39 27.5 no 131 ## 24 60-99 24.9 no NA ## 25 40-59 27.4 no 186 ``` --- ## Mice in R Look at the first imputed-dataset ```r d1 <- complete(nhanes2_imp, 1); d1 ``` ``` ## age bmi hyp chl ## 1 20-39 30.34098 no 224.2285 ## 2 40-59 22.70000 no 187.0000 ## 3 20-39 27.29756 no 187.0000 ## 4 60-99 35.61975 yes 297.0469 ## 5 20-39 20.40000 no 113.0000 ## 6 60-99 18.16587 yes 184.0000 ## 7 20-39 22.50000 no 118.0000 ## 8 20-39 30.10000 no 187.0000 ## 9 40-59 22.00000 no 238.0000 ## 10 40-59 27.84743 no 210.5014 ## 11 20-39 27.24996 no 146.9218 ## 12 40-59 28.43579 no 226.0825 ## 13 60-99 21.70000 no 206.0000 ## 14 40-59 28.70000 yes 204.0000 ## 15 20-39 29.60000 no 208.7224 ## 16 20-39 36.73795 no 230.7358 ## 17 60-99 27.20000 yes 284.0000 ## 18 40-59 26.30000 yes 199.0000 ## 19 20-39 35.30000 no 218.0000 ## 20 60-99 25.50000 yes 257.7126 ## 21 20-39 22.42268 no 127.4948 ## 22 20-39 33.20000 no 229.0000 ## 23 20-39 27.50000 no 131.0000 ## 24 60-99 24.90000 no 283.3828 ## 25 40-59 27.40000 no 186.0000 ``` --- ## Mice in R Look at the last imputed-dataset ```r d10 <- complete(nhanes2_imp, 10); d10 ``` ``` ## age bmi hyp chl ## 1 20-39 26.59731 no 115.34762 ## 2 40-59 22.70000 no 187.00000 ## 3 20-39 22.42548 no 187.00000 ## 4 60-99 25.20745 yes 268.22285 ## 5 20-39 20.40000 no 113.00000 ## 6 60-99 22.29470 no 184.00000 ## 7 20-39 22.50000 no 118.00000 ## 8 20-39 30.10000 no 187.00000 ## 9 40-59 22.00000 no 238.00000 ## 10 40-59 25.63055 yes 198.06595 ## 11 20-39 17.93695 no 90.22238 ## 12 40-59 21.62430 yes 209.92840 ## 13 60-99 21.70000 no 206.00000 ## 14 40-59 28.70000 yes 204.00000 ## 15 20-39 29.60000 no 161.64022 ## 16 20-39 18.67313 no 84.17346 ## 17 60-99 27.20000 yes 284.00000 ## 18 40-59 26.30000 yes 199.00000 ## 19 20-39 35.30000 no 218.00000 ## 20 60-99 25.50000 yes 265.20195 ## 21 20-39 25.22149 no 144.88429 ## 22 20-39 33.20000 no 229.00000 ## 23 20-39 27.50000 no 131.00000 ## 24 60-99 24.90000 no 248.27329 ## 25 40-59 27.40000 no 186.00000 ``` --- ## Illustration Let's plot imputed and observed values for continuous variables. ```r stripplot(nhanes2_imp, col=c("grey","darkred"),pch=c(1,20)) ``` <img src="17-missing-data-II_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> Grey dots are observed values and red dots are imputed values. --- ## Illustration Let's see how this would change when we use `pmm` instead of `norm`. ```r nhanes2_imp2 <- mice(nhanes2,m=10,defaultMethod=c("pmm","logreg","polyreg","polr"),print=F) stripplot(nhanes2_imp2, col=c("grey","darkred"),pch=c(1,20)) ``` <img src="17-missing-data-II_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> Easy to see that the distribution of the original observed data is preserved. --- ## Illustration Also can do plots by values of categorical variable, say `bmi` by `age`. Let's look at the imputations using `norm` ```r stripplot(nhanes2_imp, bmi~.imp|age, col=c("grey","darkred"),pch=c(1,20)) ``` <img src="17-missing-data-II_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- ## Illustration Using `pmm` instead of `norm`, we have: ```r stripplot(nhanes2_imp2, bmi~.imp|age, col=c("grey","darkred"),pch=c(1,20)) ``` <img src="17-missing-data-II_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> Going forward, let's focus only on imputations using `norm`. --- ## Illustration Scatterplot of `chl` and `bmi` for each imputed dataset. Here we can see why we should not use single imputations. The strength of the relationship can vary based on different imputed values. ```r xyplot(nhanes2_imp, bmi ~ chl | .imp,pch=c(1,20),cex = 1.4,col=c("grey","darkred")) ``` <img src="17-missing-data-II_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- ## Illustration To detect interesting differences in distribution between observed and imputed data, use the .hlight[densityplot] function. ```r densityplot(nhanes2_imp) ``` <img src="17-missing-data-II_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- ## Illustration: using a single dataset For model specification, i.e., transformations, either look at the complete cases or use one of the completed datasets. For example, to use the first dataset in a regression of `bmi` on `age`, `hyp` and `chl`, use ```r bmiregd1 <- lm(bmi~age+hyp+chl, data = d1) summary(bmiregd1) ``` ``` ## ## Call: ## lm(formula = bmi ~ age + hyp + chl, data = d1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -6.3936 -1.9368 -0.5314 2.0384 5.0614 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 11.94077 2.74697 4.347 0.000313 *** ## age40-59 -5.91646 1.50677 -3.927 0.000835 *** ## age60-99 -11.73873 2.13243 -5.505 2.18e-05 *** ## hypyes 2.43724 1.71224 1.423 0.170027 ## chl 0.09399 0.01483 6.338 3.48e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.845 on 20 degrees of freedom ## Multiple R-squared: 0.7062, Adjusted R-squared: 0.6474 ## F-statistic: 12.02 on 4 and 20 DF, p-value: 3.866e-05 ``` --- ## Illustration: using a single dataset - To check residuals, you can examine the fit of the model in one or more completed datasets - Any transformations will have to apply to all the datasets, so don't be too dataset-specific in your checks. ```r plot(bmiregd1$residual,x=d1$chl,xlab="Cholesterol",ylab="Residual"); abline(0,0) ``` <img src="17-missing-data-II_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> Looks good! --- ## Illustration: using a single dataset ```r boxplot(bmiregd1$residual ~ d1$age, xlab = "Age", ylab = "Residual") ``` <img src="17-missing-data-II_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> Pretty reasonable especially given the size of the dataset. --- ## Illustration: using a single dataset ```r boxplot(bmiregd1$residual ~ d1$hyp, xlab = "Hypertension", ylab = "Residual") ``` <img src="17-missing-data-II_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> - Good idea to repeat for more than one completed dataset. - If you decide transformations are needed, you might reconsider the imputation models too and fit them with transformed values. --- ## Illustration: using all `\(m\)` datasets ```r bmireg_imp <- with(data=nhanes2_imp, lm(bmi~age+hyp+chl)) #results for second dataset bmireg_imp[[4]][[2]] ``` ``` ## ## Call: ## lm(formula = bmi ~ age + hyp + chl) ## ## Coefficients: ## (Intercept) age40-59 age60-99 hypyes chl ## 18.44372 -6.53144 -10.51157 1.41484 0.06152 ``` ```r #results for fifth dataset bmireg_imp[[4]][[5]] ``` ``` ## ## Call: ## lm(formula = bmi ~ age + hyp + chl) ## ## Coefficients: ## (Intercept) age40-59 age60-99 hypyes chl ## 18.17017 -7.50997 -11.66474 2.31496 0.07016 ``` --- ## Illustration: using all `\(m\)` datasets Now to get the multiple imputation inferences based on the Rubin (1987) combining rules ```r bmireg <- pool(bmireg_imp) summary(bmireg) ``` ``` ## estimate std.error statistic df p.value ## (Intercept) 16.03269801 4.37392771 3.6655151 6.317986 0.009594146 ## age40-59 -6.57207247 1.97111257 -3.3341944 11.653096 0.006179856 ## age60-99 -10.83536538 2.90742365 -3.7267928 7.366405 0.006740526 ## hypyes 2.04370148 2.35795119 0.8667276 8.976479 0.408661074 ## chl 0.07394739 0.02428787 3.0446227 6.597552 0.020136806 ``` --- ## Illustration: using all `\(m\)` datasets - You can still do a nested F test (well, technically a test that is asymptotically equivalent to a nested F test) for the multiply-imputed dataset using the .hlight[pool.compare] function. - For example, suppose we want to see if age is a useful predictor, then ```r bmireg_imp <- with(data=nhanes2_imp, lm(bmi~hyp+chl+age)) bmireg_impnoage <- with(data=nhanes2_imp, lm(bmi~hyp+chl)) #type "pool.compare(bmireg_imp, bmireg_impnoage)" to see full results pool.compare(bmireg_imp, bmireg_impnoage)[c(9:12,18)] ``` ``` ## $qbar1 ## (Intercept) hypyes chl age40-59 age60-99 ## 16.03269801 2.04370148 0.07394739 -6.57207247 -10.83536538 ## ## $qbar0 ## (Intercept) hypyes chl ## 20.78380117 -0.88512898 0.03078513 ## ## $ubar1 ## [1] 8.4661183053 3.2824280668 0.0002711616 2.7734500763 4.2665578296 ## ## $ubar0 ## [1] 1.738379e+01 6.590985e+00 4.824345e-04 ## ## $pvalue ## [,1] ## [1,] 1.654266e-05 ``` --- ## Illustration: using all `\(m\)` datasets You also can fit logistic regressions. For example to predict hypertension from all the other variables, do ```r hyplogreg_imp <- with(data=nhanes2_imp, glm(hyp~bmi+chl+age, family = binomial)) ``` ``` ## Warning: glm.fit: algorithm did not converge ``` ``` ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ``` ``` ## Warning: glm.fit: algorithm did not converge ``` ``` ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ``` ``` ## Warning: glm.fit: algorithm did not converge ``` ``` ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred ``` This turns out to be problematic here because we have some logistic regressions with perfect predictions. --- ## Illustration: using all `\(m\)` datasets ```r hyplogreg <- pool(hyplogreg_imp) summary(hyplogreg) ``` ``` ## estimate std.error statistic df p.value ## (Intercept) -5042.073708 788097.633 -6.397778e-03 18.25726 0.9949647 ## bmi 90.415352 13854.295 6.526161e-03 18.25698 0.9948637 ## chl 8.402631 1359.481 6.180766e-03 18.25763 0.9951355 ## age40-59 996.882034 169469.486 5.882369e-03 18.25787 0.9953704 ## age60-99 503.458547 8664479.986 5.810603e-05 18.25904 0.9999543 ``` We do not have enough data to do a meaningful logistic regression here, unless we drop age as a predictor, but the command structure is fine! --- ## Modifying predictors in the imputation models Going back to the imputed datasets, which variables does `mice()` use as predictors for imputation of each incomplete variable? ```r nhanes2_imp$predictorMatrix ``` ``` ## age bmi hyp chl ## age 0 1 1 1 ## bmi 1 0 1 1 ## hyp 1 1 0 1 ## chl 1 1 1 0 ``` --- ## Modifying predictors in the imputation models We can choose to exclude variables from any of the imputation models. For example, suppose we think that `hyp` should not predict `bmi`. Then, ```r pred <- nhanes2_imp$predictorMatrix pred["bmi","hyp"] <- 0 pred ``` ``` ## age bmi hyp chl ## age 0 1 1 1 ## bmi 1 0 0 1 ## hyp 1 1 0 1 ## chl 1 1 1 0 ``` ```r mice(nhanes2,m=10,defaultMethod=c("norm","logreg","polyreg","polr"),predictorMatrix=pred) ```