This homework is due at the start of class on Friday, September 29th.

PRACTICE PROBLEMS (not to be turned in; may be helpful for exam review):

SDM4 5.25, 5.35, 5.37, 5.43, 5.47, 5.51

SDM4 6.1, 6.5, 6.13, 6.15, 6.19, 6.21, 6.25, 6.27, 6.29, 6.31, 6.33, 6.35, 6.39

PROBLEMS TO TURN IN:

Problem 1: Adapted from SDM4 5.20

The Hopkins Forest is a research forest in northwestern Massachusetts, Vermont, and New York states. The Williams College Center for Environmental Studies studies the forest, and records measurements of weather on an ongoing basis. The box plots below show the average wind speed recorded each day in 2011, broken down by month. (i.e., the data frame had 365 rows – one for each day in 2011 – and for each day, we have a measurement of average wind speed that day).

hopkins <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Hopkins_Forest_2011.csv") %>%
  mutate(
    Month = factor(Month, ordered = TRUE))
## Parsed with column specification:
## cols(
##   Season = col_character(),
##   `Avg Wind Speed mph)` = col_double(),
##   Month = col_integer(),
##   Day = col_integer(),
##   `Day of Year` = col_integer(),
##   `Avg Temp(deg C)` = col_double(),
##   `Avg Temp(deg F)` = col_double(),
##   `Max Wind Speed(mph)` = col_double(),
##   `Avg Barom(mb)` = col_double(),
##   `Precip(in)` = col_double()
## )
levels(hopkins$Month) <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
names(hopkins)[6] <- "ave_daily_temp_C"

ggplot() +
  geom_boxplot(mapping = aes(x = Month, y = ave_daily_temp_C), data = hopkins)

Notice that there are relatively large outliers January and July. Let’s investigate those three outliers (there are also outliers in other months, but two is enough for the sake of this homework problem). In the R chunk below, I create two data sets, one with just observations from January and one with just observations from July.

hopkins_jan <- filter(hopkins, Month == "Jan")
hopkins_jul <- filter(hopkins, Month == "Jul")

a) Make a density plot of the ave_daily_temp_C variable for the data from January, and another for the data from July. In your judgment, is it reasonably appropriate to calculate means and standard deviations with these data? Why or why not?

SOLUTION:

# Your code goes here

b) Regardless of your answer to part a), go ahead and calculate the mean and standard deviation for average daily temperatures in January, and for average daily temperatures in July.

SOLUTION:

# Your code goes here

c) For each of these months, what was the average temperature on the day with the highest average temperature? Use the max function.

SOLUTION:

# Your code goes here

d) Using your results from parts b) and c), calculate the \(z\)-score for the maximum daily temperature observed in January relative to the distribution of values observed in that month. Also perform a similar calculation for the maximum daily temperature observed in July. You can use R or type up the math by hand.

SOLUTION:

e) Interpret the \(z\)-scores you calculated in part d). (What do the \(z\)-scores measure?) Which of these daily temperatures is more surprising, considering the distribution of daily temperatures in the month in which it was observed?

SOLUTION:

Problem 2: SDM4 5.24

Two companies market new batteries targeted at owners of personal music players (OK, the book is old). DuraTunes claims a mean battery life of 11 hours, while RockReady advertises 12 hours.

  1. Explain why you would also like to know the standard deviations of the battery lifespans before deciding which brand to buy.

SOLUTION:

  1. Suppose those standard deviations are 2 hours for DuraTunes and 1.5 hours for RockReady. You are headed for 8 hours at the beach. Which battery is most likely to last for at least 8 hours? It’s not stated in the text book, but for this problem you may assume that a normal model is appropriate.

Here’s some code that is relevant:

1 - pnorm(q = 8, mean = 11, sd = 2)
## [1] 0.9331928

Explain why the above line of code is relevant. What does it calculate, and how does that relate to the question?

SOLUTION:

Now, add another line to the R code chunk above to calculate the other number you’ll need. Then, answer the question of which battery is most likely to last for at least 8 hours below:

SOLUTION:

  1. Which battery is more likely to last for 16 hours?

In the R chunk below, use similar commands to the ones you used for part b) above, updated to do the calculation for 16 hours. Then answer the question.

SOLUTION:

# Your code goes here

Problem 3: I made it up.

A 1997 study of the movement patterns of house cats found that the average area of the region explored by suburban house cats at night time was 2.54 hectares, with a standard deviation of 1.08 hectares (Barratt, David G. “Home range size, habitat utilisation and movement patterns of suburban and farm cats Felis catus.” Ecography 20.3 (1997): 271-280.) Let’s assume that the area explored by suburban cats at night time follows a normal distribution (though this is almost certainly not the case in reality).

a) Using the 68-95-99.7 rule, find the 97.5th percentile of the area explored by suburban cats at night.

SOLUTION:

b) Using R, find the 95th percentile of the area explored by suburban cats at night.

SOLUTION:

# Your code goes here.

c) Using the 68-95-99.7 rule, find the proportion of suburban cat night-time adventures that cover between 1.46 hectares and 4.7 hectares

SOLUTION:

d) My cat is not very adventurous, and I think she only explores about 0.1 hectares on the nights she gets out. Use R to find the proportion of suburban house cats who explore more land than my cat.

SOLUTION:

# Your code goes here.

Problem 4: SDM4 6.28 (Association V)

A researcher investigating the association between two variables collected some data and was surprised when he calculated the correlation. He had expected to find a fairly strong association, yet the correlation was near 0. Discouraged, he didn’t bother making a scatter plot. Explain to him how the scatter plot could still reveal the strong association he anticipated.

SOLUTION:

Problem 5: SDM4 6.36 (Correlation conclusions II)

The correlation between Fuel Efficiency (as measured by miles per gallon) and Price of 140 cars at a large dealership is \(r = -0.34\). Explain whether or not each of these possible conclusions is justified:

a) The more you pay, the lower the fuel efficiency of your car will be.

SOLUTION:

b) The form of the relationship between fuel efficiency and price is moderately straight.

SOLUTION:

c) There are several outliers that explain the low correlation.

SOLUTION:

d) If we measure fuel efficiency in kilometers per liter instead of miles per gallon, the correlation will increase.

SOLUTION:

Problem 6: SDM4 6.50 (Vehicle weights)

The Minnesota Department of Transportation hoped that they could measure the weights of big trucks without actually stopping the vehicles by using a newly developed “weight-in-motion” scale. To see if the new device was accurate, they conducted a calibration test. They weighed several stopped trucks (static Weight) and assumed that this weight was correct. Then they weighed the trucks again while they were moving to see how well the new scale could estimate the actual weight. We read in these data below:

truck_weights <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Vehicle_weights.csv")
## Parsed with column specification:
## cols(
##   WeightinMotion = col_double(),
##   StaticWeight = col_double()
## )

a) Make a scatterplot for these data.

SOLUTION:

# Your code goes here.

b) Describe the direction, form, and strength of the plot.

SOLUTION:

c) Write a few sentences telling what the plot says about the data.

SOLUTION:

d) Find (and interpret!) the correlation between the WeightinMotion and StaticWeight variables. Use the cor function.

SOLUTION:

# add call to cor() here...

e) If the trucks were weighed in kilograms rather than pounds would this change the correlation?

SOLUTION:

f) Do any points deviate from the overall pattern? What does the plot say about a possible recalibration of the weight-in-motion scale?

SOLUTION: