This assignment involves logistic regression with multiple predictors. Please type your solutions using R Markdown, LaTeX or any other word processor but be sure to knit or convert the final output file to “.pdf”. Submissions should be made on gradescope: go to Assignments \(\rightarrow\) Methods and Data Analysis 3.
MATERNAL SMOKING AND PRE-TERM BIRTH. This question is a continuation of the question on maternal smoking and birth weigths from the last homework. Remember that the data file contained an indicator variable called Premature (gestational age < 270 days), which is just a recoding of gestational age. For this homework, use that as your outcome variable.
Our questions of interest are very similar to the last homework as well, except using the new outcome variable. The questions are as follows:
Analyze the data and investigate these questions using a logistic regression model. Also, do the following.
First build your model, then do model assessment and validation. You should only proceed to answer the questions when you are satisfied with your final model; you should answer all the questions using that final model.
DO NOT INCLUDE R CODE OR OUTPUT IN YOUR REPORT! All R code must be included in the Appendix, and R outputs should be converted to nicely formatted tables. Feel free to use R packages such as kable
and xtable
. Also, you will be penalized should your report exceed 5 pages.
FEW THINGS TO KEEP IN MIND LIKE THE LAST HOMEWORK:
There are some complexities in this dataset to be aware of since this is the same dataset as the last homework. Some variables again have missing values. In particular, you will see from the .csv file that the height and weight of the father are missing quite frequently. This is typical in data on births: it is often difficult to get data about the fathers. I recommend that you not consider father’s height and weight when modeling. Some of the other variables have a few missing cases here and there. For this analysis, you can drop them from the modeling. This is not the ideal way to handle missing data in an analysis–and we will learn better methods later in the course–but for now it will move the analysis forward. I strongly recommend that you make a data file that has complete observations on every single case for all the variables you are thinking about including in the model, and run the regression using that file. For example, I posted such a file in the Sakai site that excludes all of the variables on the fathers. You are welcome to use this file, or make your own if you want to use fathers’ data. The modified data can be found in the file “smoking.csv” on Sakai.
The file contains an indicator variable for Premature (gestational age < 270 days), which is just a recoding of gestational age; we use that as our outcome variable. The data files also contain two other outcome variables: gestational age and birth weight. Both of these could be affected by smoking, so both are outcomes rather than predictors. It does not make sense scientifically to include one as a predictor of the other; the two variables happen simultaneously and hence are a bivariate outcome. For this analysis, we exclude birth weight from the modeling. Of course, one could do a separate regression for birth weight to see if smoking has an effect on gestational ages. Even better, one could treat birth weight and gestational age as a bivariate outcome and fit a regression model that predicts the bivariate outcome. This is a model we won’t have to time to learn about in our course, but come find the instructor if you want to learn more.
The main file also includes information on the number of cigarettes smoked and about timing for mothers who quit smoking. For this analysis you do not have to use those variables, as we just compare smokers and non-smokers. Also, for this analysis, you can ignore the birth date variable, you can collapse education categories from 6-7 into one category for education = trade school, and you can also collapse race categories from 0 - 5 into one category for race = white.
Finally, regarding the fathers’ data, you might pay attention to correlation among the mothers’ and fathers’ values. For example, the mothers’ and fathers’ races might tend to be similar (use a “table” command to see the contingency table of the two races), in which case you have to be concerned about effects of multicollinearity if you want to include both mother’s and father’s races in the model.
Code Book
Variable | Description |
---|---|
Id | id number |
birth | birth date where 1096 = January1, 1961 |
gestation | length of gestation in days |
bwt | birth weight in ounces (999 = unknown) |
parity | total number of previous pregnancies, including fetal deaths and still births. (99=unknown) |
mrace | mother’s race or ethnicity 0-5=white 6=mexican 7=black 8=asian 9=mix 99=unknown |
mage | mother’s age in years at termination of pregnancy |
med | mother’s education 0 = less than 8th grade 1 = 8th to 12th grade. did not graduate high school 2 = high school graduate, no other schooling 3 = high school graduate + trade school 4 = high school graduate + some college 5 = college graduate 6,7 = trade school but unclear if graduated from high school 9 = unknown |
mht | mother’s height in inches |
mpregwt | mother’s pre-pregnancy weight in pounds |
drace | father’s race or ethnicity 0-5 = white 6 = mexican 7 = black 8 = asian 9 = mix |
dage | father’s age in years at termination of pregnancy |
ded | father’s education 0 = less than 8th grade 1 = 8th to 12th grade. did not graduate high school 2 = high school graduate, no other schooling 3 = high school graduate + trade school 4 = high school graduate + some college 5 = college graduate 6,7 = trade school but unclear if graduated from high school 9 = unknown |
dht | father’s height |
dwt | father’s pre-pregnancy weight in pounds |
marital | marital status of mother 1 = married 2 = legally separated 3 = divorced 4 = widowed 5 = never married |
income | family yearly income in 2500 increments. 0 = under 2500, 1 = 2500-4999, …, 9 = 15000+. 98=unknown, 99=not asked |
smoke | does mother smoke? 0 = never 1 = smokes now 2 = until preg 3 = once did, not now |
time | If mother quit, how long ago did she quit? 0 = never smoked, 1 = still smokes, 2 = quit during pregnancy, 3 = up to 1 yr ago, 4 = up to 2 yr ago, 5 = up to 3 yr ago, 6 = up to 4 yr ago, 7 = 5 to 9yr ago, 8 = 10+yr ago, 9 = quit and don’t know, 98 = unknown |
number | number of cigs smoked a day for past and current smokers 0 = never smoked 1 = 1-4 2 = 5-9 3 = 10-14 4 = 15-19 5 = 20-29 6 = 30-39 7 = 40-60 8 = 60+, 9 = smoke but don’t know |
Premature | 1 = baby born before gestational age of 270, and 0 = otherwise. This is a dichotomized function of the gestational age. We use it as the outcome variable. |
20 points.