This assignment is due by the start of class on Monday, December 11th.

PRACTICE PROBLEMS (not to be turned in):

SDM4 25.1, 25.3, 25.5, 25.7, 25.9, 25.11, 25.13, 25.15, 25.17, 25.19, 25.21, 25.23, 25.25, 25.31, 25.35, 25.37, 25.39

PROBLEMS TO TURN IN:

SDM4 25.59 (Education and Mortality, modified)

The following R code reads in data recording the mortality rate (age-adjusted deaths per 100,000 people) and the education level (average number of years in school) for 58 U.S. cities.

cities <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Education_and_mortality.csv")
## Parsed with column specification:
## cols(
##   Mortality = col_double(),
##   Education = col_double()
## )

(a) Make an argument for why it’s reasonable to think of one of these variables as an explanatory variable and the other as a response variable. Which is which?

SOLUTION:

(b) Make an appropriate plot of the data.

SOLUTION:

(c) Would it be reasonable to use a linear model to describe the relationship between these variables? Check all relevant conditions that you can check without fitting the model.

SOLUTION:

(d) Fit the linear regression model.

SOLUTION:

(e) Check any conditions that you weren’t able to check before fitting the model. Is it OK to go ahead with using this model?

SOLUTION:

(f) What does the regression model say about the relationship between education and mortality rates?

As part of your discussion, address the following points: i. the interpretation of the intercept; ii. the interpretation of the slope; and iii. what the model has to say about whether higher education rates cause reduced mortality rates.

SOLUTION:

(g) What is a reasonable interpretation of the “population” in this example?

SOLUTION:

(h) Is there statistically significant evidence that in the population there is a relationship between the level of Education in a city and the Mortality rate?

Conduct a hypothesis test by using the pt() function to calculate a p-value. You will need some output from calling the summary() function on your linear model fit object. Verify that your p-value matches the p-value in the R output from the summary (mine matched the R output very closely).

SOLUTION:

(i) Find and interpret a 90% confidence interval for the slope coefficient.

You should do this using the critical value from the qt() function as well as the output from calling the summary() function on your linear model fit object. Verify that your confidence interval is similar to the output from R’s confint() function (mine matched up to 1 decimal place).

SOLUTION:

(j) Interpret the residual standard deviation in context using the “95” part of the 68-95-99.7 rule.

SOLUTION:

(k) Interpret the standard error for \(b_1\) in context using the “95” part of the 68-95-99.7 rule.

SOLUTION:

(l) Interpret the \(R^2\) for this model fit in context.

SOLUTION:

SDM4 25.45 (Streams, modified)

This question is sortof a throw-back to chapter 9 of the 4th edition of the book (chapter 10 of the 3rd edition of the book), but updated to the ideas we’ve been talking about more recently.

Biologists studying the effects of acid rain on wildlife collected data from 172 sites on streams in the Adirondack Mountains. Importantly, some of the sites are on the same stream. The researchers recorded the pH (acidity) of the water and the BCI, a measure of biological diversity. Here’s a look at the first 10 rows of the data and a scatterplot of BCI against pH for the 163 sites for which we have these data (we didn’t have measurements for all 172 streams), along with results from a linear model fit to the data.

streams <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Streams.csv")
## Parsed with column specification:
## cols(
##   Stream = col_character(),
##   SUB = col_character(),
##   pH = col_double(),
##   Temo = col_double(),
##   BCI = col_character(),
##   Hardness = col_double(),
##   Alk = col_double(),
##   Phos = col_double()
## )
streams <- mutate(streams, BCI = as.numeric(BCI)) %>%
  filter(!is.na(pH) & !is.na(BCI)) %>%
  arrange(Stream)
## Warning in evalq(as.numeric(BCI), <environment>): NAs introduced by
## coercion
head(streams, 10)
## # A tibble: 10 x 8
##          Stream   SUB    pH  Temo   BCI Hardness   Alk  Phos
##           <chr> <chr> <dbl> <dbl> <dbl>    <dbl> <dbl> <dbl>
##  1       ABIJAH     S  6.75     8  1413     34.2  34.2  0.12
##  2       ABIJAH     S  7.00     1  1366     51.3  51.3  0.20
##  3    ABIJAH178     S  7.00     2  1382     85.5  51.3  0.07
##  4   ABIJAHBULL     S  7.00     2  1492     68.4  51.3  0.05
##  5 ABIJAHNICHOL     S  7.00     3  1462     85.5  51.3  0.06
##  6     BARNSCOR     S  7.00     2  1357     34.2  51.3  0.10
##  7         BEAR     S  7.00     2  1389     51.3   0.0  0.05
##  8         BEAR     L  7.50     9  1365     68.4  85.5  0.05
##  9         BEAR     S  7.20     0  1289     51.3  51.3  0.83
## 10         BEAR     M  7.00     1  1301     51.3  51.3  0.21
ggplot() +
  geom_point(mapping = aes(x = pH, y = BCI), data = streams)

lm_fit <- lm(BCI ~ pH, data = streams)
streams <- mutate(streams, residual = residuals(lm_fit))
ggplot() +
  geom_density(mapping = aes(x = residual), data = streams)

ggplot() +
  geom_point(mapping = aes(x = pH, y = residual), data = streams)

(a) Which assumptions for the linear model do you think are violated? Explain. For each of those assumptions that are violated, describe something you might do to resolve the problem.

There are multiple valid answers to this question. Simple strategies could be based on ideas that we discussed in Chapter 9 (chapter 10 of the 3rd edition of the book) and in the Frogs example on Monday, Nov 27 (think about the first example, where we had multiple observations for each frog). (Dealing with assumption violations like this is one of the big ideas of later statistics classes).

SOLUTION:

(b) Here is a plot with the fitted regression line and output you could use to conduct hypothesis tests. Note that I have not corrected any of the assumption violations you discussed above (although I would definitely do that in real life).

ggplot() +
  geom_point(mapping = aes(x = pH, y = BCI), data = streams) +
  geom_smooth(mapping = aes(x = pH, y = BCI),
    data = streams,
    method = "lm",
    se = FALSE)

summary(lm_fit)
## 
## Call:
## lm(formula = BCI ~ pH, data = streams)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -502.5  -59.9   12.0   87.3  387.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2733.4      187.9   14.55  < 2e-16 ***
## pH            -197.7       25.6   -7.73  1.1e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 140 on 161 degrees of freedom
## Multiple R-squared:  0.271,  Adjusted R-squared:  0.266 
## F-statistic: 59.8 on 1 and 161 DF,  p-value: 1.09e-12

i. Describe the form and direction of the relationship between acidity and biological diversity as measured by BCI.

SOLUTION:

ii. Interpret the slope of the regression line in context. Obtain a 95% confidence interval for the slope of the regression line in the population of all stream sites in this region.

SOLUTION:

iii. Here is the standard interpretation of what we mean by “95% confident” for the confidence interval from part ii: “If we were to take many different samples of sites on streams in the Adirondack mountains, and calculate a different 95% confidence interval based on the data from each of those samples, about 95% of those confidence intervals would contain the slope of the regression line describing the relationship between pH and BCI in the population of all stream sites in this region.” Do you think that your confidence interval above works exactly as advertised?

SOLUTION:

iv. Do these data convince you that there is a relationship between acidity level and biological diversity?

SOLUTION:

SDM4 21.12 (More errors)

For each of the following situations, state whether a Type I, a Type II, or neither error has been made.

  1. A test of \(H_0: \mu = 25\) vs. \(H_A: \mu > 25\) rejects the null hypothesis. Later it is discovered that \(\mu = 24.9\).

SOLUTION:

  1. A test of \(H_0: p = 0.8\) vs. \(H_A: p < 0.8\) fails to reject the null hypothesis. Later it is discovered that \(p = 0.9\).

SOLUTION:

  1. A test of \(H_0: p = 0.5\) vs. \(H_A: p \neq 0.5\) rejects the null hypothesis. Later it is discovered that \(p = 0.65\).

SOLUTION:

  1. A test of \(H_0: p = 0.7\) vs. \(H_A: p < 0.7\) fails to reject the null hypothesis. Later it is discovered that \(p = 0.6\).

SOLUTION:

SDM4 21.19 (Significant?)

Public health officials believe that 98% of children have been vaccinated against measles. A random survey of medical records at many schools across the country found that among more than 13,000 children, only 97.4% had been vaccinated. A hypothesis test would reject the null hypothesis of 98% with a p-value of less than 0.0001.

  1. Explain what the p-value means in this context.

SOLUTION:

  1. The result is statistically significant, but does it matter? Comment.

SOLUTION: