This assignment is due at the of class on Wednesday, September 20th.
General notes on homework assignments (also see syllabus for policies and suggestions):
Steps to proceed:
hw02.Rmd
from the Homework page on the course website.As a reminder, the first 3 steps in this process are documented in the “Intro to R” lab, and there are screenshots in the html version of that lab on the course website. As always, let me know if you run into trouble.
These problems may be helpful for a midterm or final review.
SDM4 1.1, 1.3, 1.5, 1.15, 1.17, 1.21, 1.23, 1.29
SDM4 2.5, 2.19, 2.21, 2.39, 2.43
SDM4 3.17, 3.19, 3.33, 3.51, 3.55 (and other odd numbered problems)
SDM4 4.23, 4.27, 4.33, 4.47, 4.49, 4.51
People who get lost in the desert, mountains, or woods often seem to wander in circles rather than walk in straight lines. To see whether people naturally walk in circles in the absence of visual clues, researcher Andrea Axtell tested 32 people on a football field. One at a time, they stood at the center of one goal line, were blindfolded, and then tried to walk to the other goal line. She recorded each individual’s sex, height, handedness, the number of yards each was able to walk before going out of boudns, and whether each wandered off course to the left or the right. No one made it all the way to the far end of the field without crossing one of the sidelines. (Source: STATS No. 39, Winter 2004).
For this study, identify the W’s, name the variables, specify for each variable whether its use indicates that it should be treated as categorical or quantitative, and for any quantitative variable, identify the units in which it was measured (or note that the units were not provided).
SOLUTION:
Here’s a table that classifes movies by genre and MPAA rating (note – you do not need to edit the R code below, and you will not be responsible for understanding the code used to create the genre_rating_tbl
object below; it turns out that making a table with this format is unreasonably difficult to do in R):
movies <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Movie_budgets.csv") %>%
mutate(
Rating = factor(Rating),
Genre = factor(Genre)
)
levels(movies$Rating) <- c("G", "PG", "PG13", "R")
genre_rating_tbl <- table(movies$Genre, movies$Rating) %>%
sweep(c(1, 2), sum(.), "/") %>%
addmargins(c(1, 2), list(Total = sum, Total = sum), quiet = TRUE)
print(xtable(genre_rating_tbl), type = "html")
G | PG | PG13 | R | Total | |
---|---|---|---|---|---|
Action | 0.00 | 0.00 | 0.10 | 0.07 | 0.17 |
Adventure | 0.03 | 0.04 | 0.04 | 0.01 | 0.12 |
Comedy | 0.02 | 0.10 | 0.17 | 0.03 | 0.32 |
Drama | 0.00 | 0.02 | 0.07 | 0.14 | 0.23 |
Horror | 0.00 | 0.00 | 0.07 | 0.04 | 0.11 |
Thriller | 0.00 | 0.00 | 0.02 | 0.02 | 0.05 |
Total | 0.05 | 0.17 | 0.47 | 0.32 | 1.00 |
SOLUTION:
SOLUTION:
SOLUTION:
SOLUTION:
In 2000, the Journal of the American Medical Association (JAMA) published a study that examined pregnancies that resulted in the birth of twins. Births were classified as preterm with intervention (induced labor or cesarean), preterm without procedures, or term/post-term. Researchers also classifed the pregnancies by the level of prenatal medical are the mother received (inadequate, adequate, or intensive). The data, from the years 1995 – 1997, are summarized in the table below. Figures are in thousands of births (Source: JAMA 284 [2000]:335-341). (As before, you do not need to edit the R code below):
twin_births <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Twin_births.csv") %>%
rename(
prenatal_care = `Level of Prenatal Care`,
birth_type = `Birth Type`) %>%
mutate(
prenatal_care = factor(prenatal_care,
levels = c("Intensive", "Adequate", "Inadequate")),
birth_type = factor(birth_type,
levels = c("Preterm (induced or cesarean)", "Preterm without procedures", "Term or post-term"))
)
births_tbl <- table(twin_births$prenatal_care, twin_births$birth_type) %>%
addmargins(c(1, 2), list(Total = sum, Total = sum))
Margins computed over dimensions in the following order: 1: 2:
print(xtable(births_tbl, digits = 0), type = "html")
Preterm (induced or cesarean) | Preterm without procedures | Term or post-term | Total | |
---|---|---|---|---|
Intensive | 18 | 15 | 28 | 61 |
Adequate | 46 | 43 | 65 | 154 |
Inadequate | 12 | 13 | 38 | 63 |
Total | 76 | 71 | 131 | 278 |
SOLUTION:
SOLUTION:
SOLUTION: Insert your R code in the chunk below:
# Your code goes here.
SOLUTION:
Can you design a Simpson’s Paradox?
Two companies are vying for a city’s “Best Local Employer” award, to be given to th company most committed to hirnig local residents. Although both employers hired 300 new people in the past year, Company A brags that it deserves the award because 70% of its new jobs went to local residents, compared to only 60% for Company B. Company B concedes that those percentages are correct, but points out that most of its new jobs were full-time, while most of Company A’s were part time. Not only that, says Company B, but a higher percentage of its full-time jobs went to local residents than did Company A’s, and the same was tru for part-time jobs. Thus, Company B argues, it’s a better local employer than Company A.
Show how it’s possible for Company B to fill a higher percentage of both full-time and part-time jobs with local residents, even though Company A hired more local residents overall.
You will get credit for this problem if you write down enough that you convince the grader that you’ve made a serious attempt. Getting a full solution will be helpful for your understanding, but please don’t spend more than an hour on this problem. In thinking about this problem, look back at the example we did in class about murder cases in Indiana.
SOLUTION:
A nutrition researcher collected data about common breakfast cereals, including the brand name and the sugar content (as a percentage of weight). The measurements of sugar content are loaded in the R chunk below.
cereals <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Sugar_in_cereals.csv") %>%
rename(
brand = BRAND,
sugar = `Sugar %`)
## Parsed with column specification:
## cols(
## BRAND = col_character(),
## `Sugar %` = col_double()
## )
SOLUTION:
# Your code goes here.
SOLUTION:
SOLUTION:
# Your code goes here.
SOLUTION:
A researcher collected information about vineyards in the Finger Lakes region of New York, including the number of acres of land held by each vineyard. The measurements of vineyard size are loaded in the R chunk below.
vineyards <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Vineyards.csv")
## Parsed with column specification:
## cols(
## Winery = col_character(),
## Acres = col_integer()
## )
SOLUTION:
# Your code goes here.
SOLUTION:
SOLUTION:
# Your code goes here.
SOLUTION:
A university teacher saved every e-mail received from students in a large Introductory Statistics class during an entire term. He then counted, for each student had sent him at least one email, how many emails each student had sent. The R code below reads in the data:
emails <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/E-mails.csv") %>%
rename(num_emails = `Number_of_E-mails`)
## Parsed with column specification:
## cols(
## `Number_of_E-mails` = col_integer()
## )
SOLUTION:
# Your code goes here.
SOLUTION:
SOLUTION:
# Your code goes here.
SOLUTION:
SOLUTION:
Ozone levels (in parts per billion, ppb) were recorded at sites in New Jersey monthly between 1926 and 1971. The R code below reads these data in.
ozone <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Ozone.csv")
## Parsed with column specification:
## cols(
## Ozone = col_character(),
## Year = col_integer(),
## Month = col_integer()
## )
SOLUTION:
# Your code goes here.
SOLUTION:
SOLUTION:
SOLUTION:
SOLUTION:
SOLUTION:
In an experiment to determine whether seeding clouds with silver iodide increases rainfall, 52 clouds were randomly assigned to be seeded for not. The amount of rain they generated was then measured (in acre-feet). The R code below loads the data:
cloud_seeding <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Cloud_seeding.csv") %>%
gather("treatment", "rainfall_amount", Unseeded_Clouds, Seeded_Clouds) %>%
mutate(
treatment = factor(treatment)
)
## Parsed with column specification:
## cols(
## Unseeded_Clouds = col_double(),
## Seeded_Clouds = col_double()
## )
levels(cloud_seeding$treatment) <- c("seeded", "unseeded")
SOLUTION:
# Your code goes here.
SOLUTION:
treatment
variable.Hint: The best way to do this is by using group_by()
to group by the treatment, and then using the summarize()
function to calculate the chosen statistics.
SOLUTION:
# Your code goes here.
SOLUTION: