This assignment is due at the of class on Wednesday, September 20th.

General notes on homework assignments (also see syllabus for policies and suggestions):

  1. Don’t forget to replace the YOUR NAME GOES HERE at the top.
  2. For computer problems include the output, followed by your interpretation, after the SOLUTION: text. Then knit the document to generate a html file, and print a hard copy to turn in.
  3. For hand written problems, write your answers on a separate page. Write them in the order assigned, and staple them to the back of the computer print out.
  4. I strongly encourage that you read the questions as soon as you get the assignment. This will help you to start thinking how to solve them!
  5. In case of questions, or if you get stuck please don’t hesitate to come to office hours or email me!! (However, note that if you email me within 24 hours of the due date for the assignment, I may not have a chance to get back to you before the assignment is due).

Steps to proceed:

  1. Download the file hw02.Rmd from the Homework page on the course website.
  2. Create a folder for this assignment on the RStudio server.
  3. Upload the Rmd file to the RStudio server.
  4. Enter your solutions and knit the document.
  5. Knit the document to generate a html file.
  6. Print the html file and bring to class to turn in.

As a reminder, the first 3 steps in this process are documented in the “Intro to R” lab, and there are screenshots in the html version of that lab on the course website. As always, let me know if you run into trouble.

PRACTICE PROBLEMS (not to be submitted)

These problems may be helpful for a midterm or final review.

SDM4 1.1, 1.3, 1.5, 1.15, 1.17, 1.21, 1.23, 1.29

SDM4 2.5, 2.19, 2.21, 2.39, 2.43

SDM4 3.17, 3.19, 3.33, 3.51, 3.55 (and other odd numbered problems)

SDM4 4.23, 4.27, 4.33, 4.47, 4.49, 4.51

PROBLEMS TO TURN IN:

Problem 1: SDM4 1.32 (Walking in circles)

People who get lost in the desert, mountains, or woods often seem to wander in circles rather than walk in straight lines. To see whether people naturally walk in circles in the absence of visual clues, researcher Andrea Axtell tested 32 people on a football field. One at a time, they stood at the center of one goal line, were blindfolded, and then tried to walk to the other goal line. She recorded each individual’s sex, height, handedness, the number of yards each was able to walk before going out of boudns, and whether each wandered off course to the left or the right. No one made it all the way to the far end of the field without crossing one of the sidelines. (Source: STATS No. 39, Winter 2004).

For this study, identify the W’s, name the variables, specify for each variable whether its use indicates that it should be treated as categorical or quantitative, and for any quantitative variable, identify the units in which it was measured (or note that the units were not provided).

SOLUTION:

Problem 2: (Movies by genre and rating – based on SDM4 2.26)

Here’s a table that classifes movies by genre and MPAA rating (note – you do not need to edit the R code below, and you will not be responsible for understanding the code used to create the genre_rating_tbl object below; it turns out that making a table with this format is unreasonably difficult to do in R):

movies <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Movie_budgets.csv") %>%
  mutate(
    Rating = factor(Rating),
    Genre = factor(Genre)
  )
levels(movies$Rating) <- c("G", "PG", "PG13", "R")

genre_rating_tbl <- table(movies$Genre, movies$Rating) %>%
  sweep(c(1, 2), sum(.), "/") %>%
  addmargins(c(1, 2), list(Total = sum, Total = sum), quiet = TRUE)

print(xtable(genre_rating_tbl), type = "html")
G PG PG13 R Total
Action 0.00 0.00 0.10 0.07 0.17
Adventure 0.03 0.04 0.04 0.01 0.12
Comedy 0.02 0.10 0.17 0.03 0.32
Drama 0.00 0.02 0.07 0.14 0.23
Horror 0.00 0.00 0.07 0.04 0.11
Thriller 0.00 0.00 0.02 0.02 0.05
Total 0.05 0.17 0.47 0.32 1.00

a) Based on looking at just the table in the knit html document, and not the R code, how can you tell that this table holds over-all percentages (rather than row or column percentages)? State your answer in 1 sentence.

SOLUTION:

b) What was the most common genre/rating combination in this data set?

SOLUTION:

c) There were 120 movies in this data set. How many movies were G-rated? (I’m looking for an integer count, not a proportion. Calculate this based on the table in the knit html document, not by looking at the data file or writing any R code.)

SOLUTION:

d) What proportion of movies in the data set have a rating of PG13 or R?

SOLUTION:

Problem 3: (Twin births – based on SDM4 2.36)

In 2000, the Journal of the American Medical Association (JAMA) published a study that examined pregnancies that resulted in the birth of twins. Births were classified as preterm with intervention (induced labor or cesarean), preterm without procedures, or term/post-term. Researchers also classifed the pregnancies by the level of prenatal medical are the mother received (inadequate, adequate, or intensive). The data, from the years 1995 – 1997, are summarized in the table below. Figures are in thousands of births (Source: JAMA 284 [2000]:335-341). (As before, you do not need to edit the R code below):

twin_births <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Twin_births.csv") %>%
  rename(
    prenatal_care = `Level of Prenatal Care`,
    birth_type = `Birth Type`) %>%
  mutate(
    prenatal_care = factor(prenatal_care,
      levels = c("Intensive", "Adequate", "Inadequate")),
    birth_type = factor(birth_type,
      levels = c("Preterm (induced or cesarean)", "Preterm without procedures", "Term or post-term"))
  )

births_tbl <- table(twin_births$prenatal_care, twin_births$birth_type) %>%
  addmargins(c(1, 2), list(Total = sum, Total = sum))

Margins computed over dimensions in the following order: 1: 2:

print(xtable(births_tbl, digits = 0), type = "html")
Preterm (induced or cesarean) Preterm without procedures Term or post-term Total
Intensive 18 15 28 61
Adequate 46 43 65 154
Inadequate 12 13 38 63
Total 76 71 131 278

a) Among the mothers in this study, what was the marginal distribution of the level of care they received during their pregnancies?

SOLUTION:

b) Among the mothers in this study, what was the conditional distribution of the birth type, given that the mother received inadequate medical care?

SOLUTION:

c) Create an appropriate graph comparing the outcomes of these pregnancies by the level of medical care the mother received. Your plot code should be based on the twin_births data frame, not on the births_tbl.

SOLUTION: Insert your R code in the chunk below:

# Your code goes here.

d) Write one or two sentences describing the association between these two variables.

SOLUTION:

Problem 4: SDM4 2.46 (Simpson’s Paradox)

Can you design a Simpson’s Paradox?

Two companies are vying for a city’s “Best Local Employer” award, to be given to th company most committed to hirnig local residents. Although both employers hired 300 new people in the past year, Company A brags that it deserves the award because 70% of its new jobs went to local residents, compared to only 60% for Company B. Company B concedes that those percentages are correct, but points out that most of its new jobs were full-time, while most of Company A’s were part time. Not only that, says Company B, but a higher percentage of its full-time jobs went to local residents than did Company A’s, and the same was tru for part-time jobs. Thus, Company B argues, it’s a better local employer than Company A.

Show how it’s possible for Company B to fill a higher percentage of both full-time and part-time jobs with local residents, even though Company A hired more local residents overall.

You will get credit for this problem if you write down enough that you convince the grader that you’ve made a serious attempt. Getting a full solution will be helpful for your understanding, but please don’t spend more than an hour on this problem. In thinking about this problem, look back at the example we did in class about murder cases in Indiana.

SOLUTION:

Problem 5: Sugar Content of Cereal

A nutrition researcher collected data about common breakfast cereals, including the brand name and the sugar content (as a percentage of weight). The measurements of sugar content are loaded in the R chunk below.

cereals <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Sugar_in_cereals.csv") %>%
  rename(
    brand = BRAND,
    sugar = `Sugar %`)
## Parsed with column specification:
## cols(
##   BRAND = col_character(),
##   `Sugar %` = col_double()
## )

a) Make an appropriate plot of the cereal sugar content.

SOLUTION:

# Your code goes here.

b) Based on the plot you created in part a), would it be more appropriate to summarize the center and spread of the distribution with the mean and standard deviation, or with the median and inter-quartile range?

SOLUTION:

c) Use the appropriate functions to calculate the summary statistics you chose for part b).

SOLUTION:

# Your code goes here.

d) Describe the sample size plus center, spread, shape, and any unusual features of this distribution, providing only a single measure of center and a single measure of spread. Be sure to provide an interpretation in the context of the problem (and don’t forget to specify units).

SOLUTION:

Problem 6: Vineyards in New York

A researcher collected information about vineyards in the Finger Lakes region of New York, including the number of acres of land held by each vineyard. The measurements of vineyard size are loaded in the R chunk below.

vineyards <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Vineyards.csv")
## Parsed with column specification:
## cols(
##   Winery = col_character(),
##   Acres = col_integer()
## )

a) Make an appropriate plot of the vineyard acreage.

SOLUTION:

# Your code goes here.

b) Based on the plot you created in part a), would it be more appropriate to summarize the center and spread of the distribution with the mean and standard deviation, or with the median and inter-quartile range?

SOLUTION:

c) Use the appropriate functions to calculate the summary statistics you chose for part b).

SOLUTION:

# Your code goes here.

d) Describe the sample size plus center, spread, shape, and any unusual features of this distribution, providing only a single measure of center and a single measure of spread. Be sure to provide an interpretation in the context of the problem (and don’t forget to specify units).

SOLUTION:

Problem 7: (Emails – based on SDM4 3.20)

A university teacher saved every e-mail received from students in a large Introductory Statistics class during an entire term. He then counted, for each student had sent him at least one email, how many emails each student had sent. The R code below reads in the data:

emails <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/E-mails.csv") %>%
  rename(num_emails = `Number_of_E-mails`)
## Parsed with column specification:
## cols(
##   `Number_of_E-mails` = col_integer()
## )

a) Make an appropriate plot of the number of emails received from each student.

SOLUTION:

# Your code goes here.

b) Without doing any calculations, would you expect the mean or median to be larger? Explain why.

SOLUTION:

c) Write some R code to verify that your answer to part a) is correct by calculating the mean and the median.

SOLUTION:

# Your code goes here.

d) Would the mean or median be more appropriate for describing this distribution? Why?

SOLUTION:

e) Describe the sample size plus center, spread, shape, and any unusual features of this distribution.

SOLUTION:

Problem 8: (Ozone, based on SDM4 4.30)

Ozone levels (in parts per billion, ppb) were recorded at sites in New Jersey monthly between 1926 and 1971. The R code below reads these data in.

ozone <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Ozone.csv")
## Parsed with column specification:
## cols(
##   Ozone = col_character(),
##   Year = col_integer(),
##   Month = col_integer()
## )

a) Make a plot showing side-by-side box plots of the data for each month, with the ozone measurement on the vertical axis and the month on the horizontal axis.

SOLUTION:

# Your code goes here.

b) In what month was the highest ozone level recorded?

SOLUTION:

c) Which month has the largest IQR?

SOLUTION:

d) Which month has the smallest range?

SOLUTION:

e) Write a brief comparison (2 or 3 sentences) of the ozone levels in January and June.

SOLUTION:

f) Write a brief report (2 or 3 sentences) on the annual patterns you see in Ozone levels.

SOLUTION:

Problem 9: SDM4 4.40 (Cloud seeding)

In an experiment to determine whether seeding clouds with silver iodide increases rainfall, 52 clouds were randomly assigned to be seeded for not. The amount of rain they generated was then measured (in acre-feet). The R code below loads the data:

cloud_seeding <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Cloud_seeding.csv") %>%
  gather("treatment", "rainfall_amount", Unseeded_Clouds, Seeded_Clouds) %>%
  mutate(
    treatment = factor(treatment)
  )
## Parsed with column specification:
## cols(
##   Unseeded_Clouds = col_double(),
##   Seeded_Clouds = col_double()
## )
levels(cloud_seeding$treatment) <- c("seeded", "unseeded")

a) Make a density plot to compare the distribution of rainfall amounts for seeded clouds and unseeded clouds. You can use different colors or facetting to distinguish the treatment types.

SOLUTION:

# Your code goes here.

b) Based on the plot you created in part a), would it be more appropriate to compare the distributions of rainfall under the two treatments with the mean and standard deviation, or with the median, quartiles, and inter-quartile range?

SOLUTION:

c) Calculate the statistics you chose in part b) separately for each level of the treatment variable.

Hint: The best way to do this is by using group_by() to group by the treatment, and then using the summarize() function to calculate the chosen statistics.

SOLUTION:

# Your code goes here.

d) Do you see any evidence that seeding clouds may be effective?

SOLUTION: