STAT111 HW#2: SDM4 Chapters 1

General notes on homework assignments (also see syllabus for policies and suggestions):

As a reminder, the first 3 steps in this process are documented in the “Intro to R” lab, and there are screenshots in the html version of that lab on the course website. As always, let me know if you run into trouble.

PRACTICE PROBLEMS (not to be submitted)

These problems may be helpful for a midterm or final review.

SDM4 1.1, 1.3, 1.5, 1.15, 1.17, 1.21, 1.23, 1.29

SDM4 2.5, 2.19, 2.21, 2.39, 2.43

SDM4 3.17, 3.19, 3.33, 3.51, 3.55 (and other odd numbered problems)

SDM4 4.23, 4.27, 4.33, 4.47, 4.49, 4.51

PROBLEMS TO TURN IN:

Problem 1: SDM4 1.32 (Walking in circles)

People who get lost in the desert, mountains, or woods often seem to wander in circles rather than walk in straight lines. To see whether people naturally walk in circles in the absence of visual clues, researcher Andrea Axtell tested 32 people on a football field. One at a time, they stood at the center of one goal line, were blindfolded, and then tried to walk to the other goal line. She recorded each individual’s sex, height, handedness, the number of yards each was able to walk before going out of boudns, and whether each wandered off course to the left or the right. No one made it all the way to the far end of the field without crossing one of the sidelines. (Source: STATS No. 39, Winter 2004).

For this study, identify the W’s, name the variables, specify for each variable whether its use indicates that it should be treated as categorical or quantitative, and for any quantitative variable, identify the units in which it was measured (or note that the units were not provided).

SOLUTION:

Problem 2: (Movies by genre and rating – based on SDM4 2.26)

Here’s a table that classifes movies by genre and MPAA rating (note – you do not need to edit the R code below, and you will not be responsible for understanding the code used to create the genre_rating_tbl object below; it turns out that making a table with this format is unreasonably difficult to do in R):

movies <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Movie_budgets.csv") %>%
  mutate(
    Rating = factor(Rating),
    Genre = factor(Genre)
  )
levels(movies$Rating) <- c("G", "PG", "PG13", "R")

genre_rating_tbl <- table(movies$Genre, movies$Rating) %>%
  sweep(c(1, 2), sum(.), "/") %>%
  addmargins(c(1, 2), list(Total = sum, Total = sum), quiet = TRUE)

print(xtable(genre_rating_tbl), type = "html")

	G	PG	PG13	R	Total
Action	0.00	0.00	0.10	0.07	0.17
Adventure	0.03	0.04	0.04	0.01	0.12
Comedy	0.02	0.10	0.17	0.03	0.32
Drama	0.00	0.02	0.07	0.14	0.23
Horror	0.00	0.00	0.07	0.04	0.11
Thriller	0.00	0.00	0.02	0.02	0.05
Total	0.05	0.17	0.47	0.32	1.00

a) Based on looking at just the table in the knit html document, and not the R code, how can you tell that this table holds over-all percentages (rather than row or column percentages)? State your answer in 1 sentence.

SOLUTION:

b) What was the most common genre/rating combination in this data set?

SOLUTION:

c) There were 120 movies in this data set. How many movies were G-rated? (I’m looking for an integer count, not a proportion. Calculate this based on the table in the knit html document, not by looking at the data file or writing any R code.)

SOLUTION:

d) What proportion of movies in the data set have a rating of PG13 or R?

SOLUTION:

Problem 3: (Twin births – based on SDM4 2.36)

In 2000, the Journal of the American Medical Association (JAMA) published a study that examined pregnancies that resulted in the birth of twins. Births were classified as preterm with intervention (induced labor or cesarean), preterm without procedures, or term/post-term. Researchers also classifed the pregnancies by the level of prenatal medical are the mother received (inadequate, adequate, or intensive). The data, from the years 1995 – 1997, are summarized in the table below. Figures are in thousands of births (Source: JAMA 284 [2000]:335-341). (As before, you do not need to edit the R code below):

twin_births <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Twin_births.csv") %>%
  rename(
    prenatal_care = `Level of Prenatal Care`,
    birth_type = `Birth Type`) %>%
  mutate(
    prenatal_care = factor(prenatal_care,
      levels = c("Intensive", "Adequate", "Inadequate")),
    birth_type = factor(birth_type,
      levels = c("Preterm (induced or cesarean)", "Preterm without procedures", "Term or post-term"))
  )

births_tbl <- table(twin_births$prenatal_care, twin_births$birth_type) %>%
  addmargins(c(1, 2), list(Total = sum, Total = sum))

Margins computed over dimensions in the following order: 1: 2:

print(xtable(births_tbl, digits = 0), type = "html")

	Preterm (induced or cesarean)	Preterm without procedures	Term or post-term	Total
Intensive	18	15	28	61
Adequate	46	43	65	154
Inadequate	12	13	38	63
Total	76	71	131	278

a) Among the mothers in this study, what was the marginal distribution of the level of care they received during their pregnancies?

SOLUTION:

b) Among the mothers in this study, what was the conditional distribution of the birth type, given that the mother received inadequate medical care?

SOLUTION:

c) Create an appropriate graph comparing the outcomes of these pregnancies by the level of medical care the mother received. Your plot code should be based on the twin_births data frame, not on the births_tbl.

SOLUTION: Insert your R code in the chunk below:

# Your code goes here.

d) Write one or two sentences describing the association between these two variables.

SOLUTION:

Problem 4: SDM4 2.46 (Simpson’s Paradox)

Can you design a Simpson’s Paradox?

Two companies are vying for a city’s “Best Local Employer” award, to be given to th company most committed to hirnig local residents. Although both employers hired 300 new people in the past year, Company A brags that it deserves the award because 70% of its new jobs went to local residents, compared to only 60% for Company B. Company B concedes that those percentages are correct, but points out that most of its new jobs were full-time, while most of Company A’s were part time. Not only that, says Company B, but a higher percentage of its full-time jobs went to local residents than did Company A’s, and the same was tru for part-time jobs. Thus, Company B argues, it’s a better local employer than Company A.

Show how it’s possible for Company B to fill a higher percentage of both full-time and part-time jobs with local residents, even though Company A hired more local residents overall.

You will get credit for this problem if you write down enough that you convince the grader that you’ve made a serious attempt. Getting a full solution will be helpful for your understanding, but please don’t spend more than an hour on this problem. In thinking about this problem, look back at the example we did in class about murder cases in Indiana.

SOLUTION:

Problem 5: Sugar Content of Cereal

A nutrition researcher collected data about common breakfast cereals, including the brand name and the sugar content (as a percentage of weight). The measurements of sugar content are loaded in the R chunk below.

cereals <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Sugar_in_cereals.csv") %>%
  rename(
    brand = BRAND,
    sugar = `Sugar %`)

## Parsed with column specification:
## cols(
##   BRAND = col_character(),
##   `Sugar %` = col_double()
## )

a) Make an appropriate plot of the cereal sugar content.

SOLUTION:

# Your code goes here.

b) Based on the plot you created in part a), would it be more appropriate to summarize the center and spread of the distribution with the mean and standard deviation, or with the median and inter-quartile range?

SOLUTION:

c) Use the appropriate functions to calculate the summary statistics you chose for part b).

SOLUTION:

# Your code goes here.

d) Describe the sample size plus center, spread, shape, and any unusual features of this distribution, providing only a single measure of center and a single measure of spread. Be sure to provide an interpretation in the context of the problem (and don’t forget to specify units).

SOLUTION:

Problem 6: Vineyards in New York

A researcher collected information about vineyards in the Finger Lakes region of New York, including the number of acres of land held by each vineyard. The measurements of vineyard size are loaded in the R chunk below.

vineyards <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Vineyards.csv")

## Parsed with column specification:
## cols(
##   Winery = col_character(),
##   Acres = col_integer()
## )

a) Make an appropriate plot of the vineyard acreage.

SOLUTION:

# Your code goes here.

b) Based on the plot you created in part a), would it be more appropriate to summarize the center and spread of the distribution with the mean and standard deviation, or with the median and inter-quartile range?

SOLUTION:

c) Use the appropriate functions to calculate the summary statistics you chose for part b).

SOLUTION:

# Your code goes here.

d) Describe the sample size plus center, spread, shape, and any unusual features of this distribution, providing only a single measure of center and a single measure of spread. Be sure to provide an interpretation in the context of the problem (and don’t forget to specify units).

SOLUTION:

Problem 7: (Emails – based on SDM4 3.20)

A university teacher saved every e-mail received from students in a large Introductory Statistics class during an entire term. He then counted, for each student had sent him at least one email, how many emails each student had sent. The R code below reads in the data:

emails <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/E-mails.csv") %>%
  rename(num_emails = `Number_of_E-mails`)

## Parsed with column specification:
## cols(
##   `Number_of_E-mails` = col_integer()
## )

a) Make an appropriate plot of the number of emails received from each student.

SOLUTION:

# Your code goes here.

b) Without doing any calculations, would you expect the mean or median to be larger? Explain why.

SOLUTION:

c) Write some R code to verify that your answer to part a) is correct by calculating the mean and the median.

SOLUTION:

# Your code goes here.

d) Would the mean or median be more appropriate for describing this distribution? Why?

SOLUTION:

e) Describe the sample size plus center, spread, shape, and any unusual features of this distribution.

SOLUTION:

Problem 8: (Ozone, based on SDM4 4.30)

Ozone levels (in parts per billion, ppb) were recorded at sites in New Jersey monthly between 1926 and 1971. The R code below reads these data in.

ozone <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Ozone.csv")

## Parsed with column specification:
## cols(
##   Ozone = col_character(),
##   Year = col_integer(),
##   Month = col_integer()
## )

a) Make a plot showing side-by-side box plots of the data for each month, with the ozone measurement on the vertical axis and the month on the horizontal axis.

SOLUTION:

# Your code goes here.

b) In what month was the highest ozone level recorded?

SOLUTION:

c) Which month has the largest IQR?

SOLUTION:

d) Which month has the smallest range?

SOLUTION:

e) Write a brief comparison (2 or 3 sentences) of the ozone levels in January and June.

SOLUTION:

f) Write a brief report (2 or 3 sentences) on the annual patterns you see in Ozone levels.

SOLUTION:

Problem 9: SDM4 4.40 (Cloud seeding)

In an experiment to determine whether seeding clouds with silver iodide increases rainfall, 52 clouds were randomly assigned to be seeded for not. The amount of rain they generated was then measured (in acre-feet). The R code below loads the data:

cloud_seeding <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Cloud_seeding.csv") %>%
  gather("treatment", "rainfall_amount", Unseeded_Clouds, Seeded_Clouds) %>%
  mutate(
    treatment = factor(treatment)
  )

## Parsed with column specification:
## cols(
##   Unseeded_Clouds = col_double(),
##   Seeded_Clouds = col_double()
## )

levels(cloud_seeding$treatment) <- c("seeded", "unseeded")

a) Make a density plot to compare the distribution of rainfall amounts for seeded clouds and unseeded clouds. You can use different colors or facetting to distinguish the treatment types.

SOLUTION:

# Your code goes here.

b) Based on the plot you created in part a), would it be more appropriate to compare the distributions of rainfall under the two treatments with the mean and standard deviation, or with the median, quartiles, and inter-quartile range?

SOLUTION:

c) Calculate the statistics you chose in part b) separately for each level of the `treatment` variable.

Hint: The best way to do this is by using group_by() to group by the treatment, and then using the summarize() function to calculate the chosen statistics.

SOLUTION:

# Your code goes here.

d) Do you see any evidence that seeding clouds may be effective?

SOLUTION:

STAT111 HW#2: SDM4 Chapters 1 – 4

YOUR NAME HERE

September 20, 2016

PRACTICE PROBLEMS (not to be submitted)

PROBLEMS TO TURN IN:

Problem 1: SDM4 1.32 (Walking in circles)

Problem 2: (Movies by genre and rating – based on SDM4 2.26)

a) Based on looking at just the table in the knit html document, and not the R code, how can you tell that this table holds over-all percentages (rather than row or column percentages)? State your answer in 1 sentence.

b) What was the most common genre/rating combination in this data set?

c) There were 120 movies in this data set. How many movies were G-rated? (I’m looking for an integer count, not a proportion. Calculate this based on the table in the knit html document, not by looking at the data file or writing any R code.)

d) What proportion of movies in the data set have a rating of PG13 or R?

Problem 3: (Twin births – based on SDM4 2.36)

a) Among the mothers in this study, what was the marginal distribution of the level of care they received during their pregnancies?

b) Among the mothers in this study, what was the conditional distribution of the birth type, given that the mother received inadequate medical care?

c) Create an appropriate graph comparing the outcomes of these pregnancies by the level of medical care the mother received. Your plot code should be based on the twin_births data frame, not on the births_tbl.

d) Write one or two sentences describing the association between these two variables.

Problem 4: SDM4 2.46 (Simpson’s Paradox)

Problem 5: Sugar Content of Cereal

a) Make an appropriate plot of the cereal sugar content.

b) Based on the plot you created in part a), would it be more appropriate to summarize the center and spread of the distribution with the mean and standard deviation, or with the median and inter-quartile range?

c) Use the appropriate functions to calculate the summary statistics you chose for part b).

d) Describe the sample size plus center, spread, shape, and any unusual features of this distribution, providing only a single measure of center and a single measure of spread. Be sure to provide an interpretation in the context of the problem (and don’t forget to specify units).

Problem 6: Vineyards in New York

a) Make an appropriate plot of the vineyard acreage.

b) Based on the plot you created in part a), would it be more appropriate to summarize the center and spread of the distribution with the mean and standard deviation, or with the median and inter-quartile range?

c) Use the appropriate functions to calculate the summary statistics you chose for part b).

d) Describe the sample size plus center, spread, shape, and any unusual features of this distribution, providing only a single measure of center and a single measure of spread. Be sure to provide an interpretation in the context of the problem (and don’t forget to specify units).

Problem 7: (Emails – based on SDM4 3.20)

a) Make an appropriate plot of the number of emails received from each student.

b) Without doing any calculations, would you expect the mean or median to be larger? Explain why.

c) Write some R code to verify that your answer to part a) is correct by calculating the mean and the median.

d) Would the mean or median be more appropriate for describing this distribution? Why?

e) Describe the sample size plus center, spread, shape, and any unusual features of this distribution.

Problem 8: (Ozone, based on SDM4 4.30)

a) Make a plot showing side-by-side box plots of the data for each month, with the ozone measurement on the vertical axis and the month on the horizontal axis.

b) In what month was the highest ozone level recorded?

c) Which month has the largest IQR?

d) Which month has the smallest range?

e) Write a brief comparison (2 or 3 sentences) of the ozone levels in January and June.

f) Write a brief report (2 or 3 sentences) on the annual patterns you see in Ozone levels.

Problem 9: SDM4 4.40 (Cloud seeding)

a) Make a density plot to compare the distribution of rainfall amounts for seeded clouds and unseeded clouds. You can use different colors or facetting to distinguish the treatment types.

b) Based on the plot you created in part a), would it be more appropriate to compare the distributions of rainfall under the two treatments with the mean and standard deviation, or with the median, quartiles, and inter-quartile range?

c) Calculate the statistics you chose in part b) separately for each level of the treatment variable.

d) Do you see any evidence that seeding clouds may be effective?

c) Calculate the statistics you chose in part b) separately for each level of the `treatment` variable.