Due dates

  • In-class presentations: 10:00 - 11:20am, Friday, November 1.
  • Final reports: 11:59pm, Monday, November 4.

General instructions

Team work

You MUST work within your MIDS-assigned teams.

  • Each team member must work on understanding the data, exploring the data, building the models and providing answers to the questions of interest.
  • You are free to divide all other work any way you choose but be sure to divide the responsibilities fairly equally within teams. For example
    • One student could focus on cleaning the data,
    • Another student could focus on writing code,
    • Another student could focus on writing the report,
    • Another student could focus on the presentation slides, and so on.

GitHub

Each team MUST collaborate using GitHub. A blank repository has been created for this project. Follow this link: https://classroom.github.com/g/rfttmYjL to gain access. You should already know how to clone the repository locally once you gain access. The first student to accept the invitation within each team will be responsible for creating the team name. All other members of the team will be able to join the team once the first person has added the team name. You should only join your MIDS-assigned team. Each individual’s contributions will be assessed via the commits history of the team’s repository. Feel free to create other folders within the repository as needed but you must push your final reports and presentation slides to the corresponding folders already created for you.

Presentations

Each team will give a brief (6 minute) presentation of their findings in class.

  • Teams 1, 3, 5, 7, and 9 will only present findings and results for Part I, while Teams 2, 4, 6, 8, and 10 will only present findings and results for Part II.
  • The presentation for each team should only cover a brief introduction as well as your most interesting and important findings. At least one EDA plot must be included.
  • Create your presentation slides with the time limit in mind. You should consider making between 5 and 7 slides (excluding the title slide), so that you have approximately one minute to get through each one.
  • You are free to create your presentation slides using PowerPoint, LaTeX or any other application you choose.
  • The order of the presentations will be randomized for each part of the project; each team is expected to come to class fully prepared to present first.
  • Each team’s presentation must be pushed to the team’s repository by 9:00am on Friday, November 1.

Reports

Each team MUST turn in only one report with team members’ names at the top of the report.

  • Please limit your combined PDF document for both parts of the project to 10 pages in total. You will be penalized should your combined report exceed 10 pages!
  • Please type your reports using R Markdown, LaTeX or any other word processor but be sure to knit or convert the final output file to .pdf.
  • DO NOT INCLUDE R CODE OR OUTPUT IN YOUR REPORT! R outputs should be converted to nicely formatted tables. Feel free to use R packages such as kable and xtable.
  • All R-code must be included in the appendix. Feel free to also include any supplemental material that is important for your analysis, such as diagnostic checks or exploratory plots that you feel justify the conclusions in your report.
  • In addition to pushing the reports to GitHub, they should also be submitted on gradescope: go to Assignments \(\rightarrow\) Team Project 2. Gradescope will let you select your team mates when submitting, so make sure to do so. Only one person needs to submit the report.

Analyses

Your analyses MUST address the two sets of questions directly. The report should be written so that there is a section which clearly answers the first set of questions for Part I and another section which clearly answers the second set of questions about Part II. If you prefer, you can write two 5-paged reports, one to address each set of questions. Be sure to also include the following in your report:

  • the final models you ultimately decided to use,
  • clear model building, that is, justification for the models (e.g., why you chose certain transformations and why you decided the final models are reasonable),
  • model assessment for the final models,
  • the relevant regression output (includes: a table with coefficients and SEs, and confidence intervals (or credible intervals)),
  • your interpretation of the results in the context of the questions of interest, including clear and direct answers to the questions posed, and
  • any potential limitations of the analyses.

Part I: Estrogen Bioassay

This data is from: https://stat.duke.edu/resources/datasets/estrogen-bioassay

Introduction

Estrogens are a group of hormones produced in both the female ovaries and male testes, with larger amounts made in females than in males. They are particularly influential during puberty, menstruation, and pregnancy, but they also help regulate the growth of bones, skin, and other organs and tissues. In general, they have a strong effect of endocrine function by disrupting these functions.

Over the past 10 years, many synthetic compounds and plant products present in the environment have been found to affect hormonal functions in various ways. Those that have estrogenic activity have been labeled as environmental estrogens. There is increasing concern that chemicals in the environment referred to as environmental estrogens may be causing adverse effects through endocrine disruption. Hence, there is a need for new approaches for screening chemicals for endocrine disrupting effects.

The rat uterotrophic bioassay provides one approach for identifying agonists or antagonists of estrogen. An estrogen antagonist is a compound that blocks the binding of estrogen and so blocks the action of estrogen. An estrogen agonist is a compound that enhances the action of estrogen. Rats in this study are either immature or have their ovaries removed and therefore do not produce estrogen.

The point of the study is to use the rats as an assay to test the effect of estrogen agonists and antagonists on a particular hormonal response, the weight of the uterus. This is done by varying the amount of the agonist or antagonist give to the rat. The response is the weight of the uterus, with uterus weight expected to exhibit an increasing dose response trend for chemicals acting as estrogen agonists and with estrogen antagonists acting to block such estrogen effects. It is expected that the uterus gets heavier with the increase of estrogen agonist dose.

The basic design randomizes female rats to treatment groups, with groups consisting of a control group and several groups having increasing doses of the test agent. An international multi-laboratory study was conducted to compare the results of the rat uterotrophic bioassay using a known estrogen agonist (EE) and a known estrogen antagonist (ZM). The main goal of the study was to assess whether the results were consistent across the laboratories.

Data

The data for this part of the project can be found in the file bioassay.txt on Sakai.

Questions

Use a multi-level model to answer the following questions of interest.

  • Is the uterotrophic bioassay successful at identifying estrogenic effects of EE and anti-estrogenic effects of ZM. That is, after controlling for predictors and random effects, does uterus weight exhibit an increasing dose response trend for EE and a decreasing dose response trend for ZM?
  • Does the dose response vary across labs? If so, are there certain labs that appear to be outliers?
  • Do the protocols differ in their sensitivity to detecting EE and ZM effects? If so, is there one protocol that can be recommended?

Code Book

Variable Description
protocol: A = immature female rats dosed by oral gavage (3 days)
B = immature female rats dosed by injection (3 days)
C = adult ovariectomized female rats dosed by injection (3 days)
D = adult ovariectomized female rats dosed by injection (7 days)
uterus Uterus weight (mg)
weight Body weight of rat (g)
EE Dose of estrogen agonist, EE in mg/kg/day
ZM Dose of estrogen antagonist, ZM in mg/kg/day
lab Laboratory at which assay was conducted
group Lab replicate group (6 rats were used per group)

Part II: Voting in NC (2016 General Elections)

Introduction

The North Carolina State Board of Elections (NCSBE) is the agency charged with the administration of the elections process and campaign finance disclosure and compliance. Among other things, they provide voter registration and turnout data online (https://www.ncsbe.gov/index.html, https://www.ncsbe.gov/data-stats/other-election-related-data). Using the NC voter files for the general elections in November 2016, you will attempt to identify/estimate groups that voted in 2016 out of those who registered. Here’s an interesting read on turnout rates for NC in 2016: https://democracync.org/wp-content/uploads/2017/05/WhoVoted2016.pdf.

Data

The data for this part of the project can be found on Sakai. The file voter_stats_20161108.txt contains information about the aggregate number of registered voters by the demographic variables; the data dictionary can be found in the file DataDictionaryForVoterStats.txt. The file history_stats_20161108.txt contains information about the aggregate number of voters who actually voted by the demographic variables.

There are a few million rows in both datasets but you will only work with a subset of those. Take a random sample of 20 counties out of all the counties in both datasets. You should indicate the counties you sampled in your final report. You will need to merge the two files voter_stats_20161108.txt and history_stats_20161108.txt by the common variables for the counties you care about. Take a look at the set of join functions in the dplyr package in R (https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/join) or the merge function in base R. I recommend the functions in dplyr. You may choose to merge the datasets before or after selecting the samples you want, but be careful if you decide to do the latter.

Unfortunately, the data dictionary from the NCSBE does not provide the exact difference between the variables party_cd and voted_party_cd in the history_stats_20161108.txt file (if you are able to find documentation on the difference, do let me know). However, I suspect that the voted party code encodes the information about people who changed their party affiliation as at the registration deadline, whereas the first party code is everyone’s original affiliation. Voters are allowed to change their party affiliation in NC so that lines up. The two variables are thus very similar and only about 0.8% of the rows in the history_stats_20161108.txt file have different values for the two variables. I would suggest using the voted party code for the history_stats_20161108.txt dataset.

Questions

Use a multi-level model to answer the following questions of interest.

  • How did demographic subgroups vote in 2016? For example, how did the turnout for males compare to the turnout for females after controlling for other potential predictors?
  • Did the overall probability or odds of voting differ by county in 2016?
  • How did the turnout rates differ between females and males for the different party affiliations?

Grading

40 points: 20 points for each part.