---
title: "Linear Regression -- Part 2"
author: "Evan L. Ray"
date: "September 29, 2017"
output: ioslides_presentation
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
require(ggplot2)
require(dplyr)
require(tidyr)
require(readr)
```
## Wrap Up for Wednesday's Lab
```{r, message=FALSE, echo = FALSE}
wildfires <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Wildfires_2012.csv")
names(wildfires) <- c("num_fires", "years_since_1985", "ave_acres_burned")
wildfires <- mutate(wildfires, year = years_since_1985 + 1985)
wildfires <- arrange(wildfires, year)
```
```{r, echo = TRUE}
linear_fit <- lm(ave_acres_burned ~ years_since_1985,
data = wildfires)
coef(linear_fit)
```
* Predicted average number of acres burned in 1985 is 19.6
```{r, echo = TRUE}
linear_fit_year <- lm(ave_acres_burned ~ year,
data = wildfires)
coef(linear_fit_year)
```
* Predicted average number of acres burned in year 0 is -4390
## What's going on?
```{r, echo = FALSE, fig.width=3.5, fig.height=4}
ggplot() +
geom_point(aes(x = years_since_1985, y = ave_acres_burned), data = wildfires) +
geom_abline(intercept = coef(linear_fit)[1], slope = coef(linear_fit)[2]) +
coord_cartesian(xlim = range(wildfires$years_since_1985), ylim = c(coef(linear_fit)[1], 110))
```
```{r, echo = FALSE, fig.width=3.5, fig.height=4}
ggplot() +
geom_point(aes(x = year, y = ave_acres_burned), data = wildfires) +
geom_abline(intercept = coef(linear_fit_year)[1], slope = coef(linear_fit_year)[2]) +
coord_cartesian(xlim = c(0, 2017), ylim = c(coef(linear_fit_year)[1], 110))
```
## Never Extrapolate Beyond the Data
[An Important Message](https://www.causeweb.org/cause/sites/default/files/resources/fun/videos/How%20Far%20He%E2%80%99ll%20Go%20-%20Final.mp4)
## Back to the Mortality Example
```{r, echo=FALSE, message=FALSE, fig.height = 3.5, fig.width=8}
mortality_water <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Hard_water_Derby.csv")
linear_fit <- lm(Mortality ~ Calcium, data = mortality_water)
ggplot() +
geom_point(mapping = aes(x = Calcium, y = Mortality), data = mortality_water) +
geom_smooth(mapping = aes(x = Calcium, y = Mortality), data = mortality_water, method = "lm", se = FALSE)
```
* **Explanatory:** Concentration of calcium in drinking water
* **Response:** Annual mortality rate per 100,000 population
* **Equation:** Predicted Mortality = 1676.36 - 3.23 $*$ Calcium
## Residuals
* **Residual** = **Observed** - **Predicted**
* $\definecolor{residual}{RGB}{230,159,0}\color{residual}e_i$ = $\definecolor{observed}{RGB}{0,158,115}\color{observed}y_i$ - $\definecolor{predicted}{RGB}{86,180,233}\color{predicted}\widehat{y}_i$
```{r, echo=FALSE, message=FALSE, fig.height = 3, fig.width=6}
mortality_water <- mutate(mortality_water,
Mortality = as.numeric(Mortality),
predicted = predict(linear_fit),
residual = residuals(linear_fit))
mortality_water_augmented <- mortality_water
mortality_water_augmented$min_val <- as.numeric(apply(mortality_water_augmented, 1, function(rowvals) {min(rowvals[c("predicted", "Mortality")])}))
mortality_water_augmented$max_val <- as.numeric(apply(mortality_water_augmented, 1, function(rowvals) {max(rowvals[c("predicted", "Mortality")])}))
ggplot() +
geom_linerange(mapping = aes(x = Calcium, ymin = min_val, ymax = max_val), color = "#e69f00", size = 2, data = mortality_water_augmented) +
geom_point(mapping = aes(x = Calcium, y = Mortality), color = "#009E73", size = 3, data = mortality_water_augmented) +
geom_smooth(mapping = aes(x = Calcium, y = Mortality), data = mortality_water_augmented, method = "lm", se = FALSE) +
geom_point(mapping = aes(x = Calcium, y = predicted), color = "#56B4E9", size = 3, data = mortality_water_augmented)
```
* How much information does the model give us about Mortality rate?
* How big do the residuals tend to be?
## Residual Distribution
```{r, echo=FALSE, message=FALSE, fig.height = 2.5, fig.width=6}
mortality_water <- mutate(mortality_water,
Mortality = as.numeric(Mortality),
predicted = predict(linear_fit),
residual = residuals(linear_fit))
mortality_water_augmented <- mortality_water
mortality_water_augmented$min_val <- as.numeric(apply(mortality_water_augmented, 1, function(rowvals) {min(rowvals[c("predicted", "Mortality")])}))
mortality_water_augmented$max_val <- as.numeric(apply(mortality_water_augmented, 1, function(rowvals) {max(rowvals[c("predicted", "Mortality")])}))
ggplot() +
geom_linerange(mapping = aes(x = Calcium, ymin = min_val, ymax = max_val), color = "#e69f00", size = 2, data = mortality_water_augmented) +
geom_point(mapping = aes(x = Calcium, y = Mortality), color = "#009E73", size = 3, data = mortality_water_augmented) +
geom_smooth(mapping = aes(x = Calcium, y = Mortality), data = mortality_water_augmented, method = "lm", se = FALSE) +
geom_point(mapping = aes(x = Calcium, y = predicted), color = "#56B4E9", size = 3, data = mortality_water_augmented)
```
```{r, echo=FALSE, message = FALSE, fig.height=2.5, fig.width=2.5}
ggplot() +
geom_histogram(mapping = aes(x = residual), bins = 12, data = mortality_water)
```
```{r, echo=FALSE, fig.height=2.5, fig.width=2.5}
ggplot() +
geom_density(mapping = aes(x = residual), data = mortality_water) +
stat_function(mapping = aes(x = residual),
fun = dnorm,
colour = "red",
args = list(mean = 0, sd = summary(linear_fit)$sigma),
data = mortality_water)
```
## Residual Standard Deviation
* How big do the residuals tend to be?
* One way to measure this: The residual standard deviation
$$s_e = \sqrt{\frac{\sum_{i = 1}^n e_i^2}{n - 2}} = \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n - 2}}$$
```{r, echo = FALSE, fig.height=3, fig.width=3.5}
ggplot() +
geom_density(mapping = aes(x = residual), data = mortality_water) +
stat_function(mapping = aes(x = residual),
fun = dnorm,
colour = "red",
args = list(mean = 0, sd = summary(linear_fit)$sigma),
data = mortality_water)
```
* In our example, $s_e = 143$
* For about 95% of our sample, the observed value was within $\pm 2 s_e$ of the model's prediction ($\pm 286$ people per 100,000 population)
## What if we didn't know about calcium?
* A reasonable guess for mortality in this town: the average mortality rate in our sample, $\bar{y} = 1524$
```{r, echo=FALSE, message=FALSE, fig.height = 2, fig.width=6}
mortality_water_augmented$mean_mortality <- mean(mortality_water_augmented$Mortality)
mortality_water_augmented$min_val_rel_mean <- as.numeric(apply(mortality_water_augmented, 1, function(rowvals) {min(rowvals[c("mean_mortality", "Mortality")])}))
mortality_water_augmented$max_val_rel_mean <- as.numeric(apply(mortality_water_augmented, 1, function(rowvals) {max(rowvals[c("mean_mortality", "Mortality")])}))
ggplot() +
geom_linerange(mapping = aes(x = Calcium, ymin = min_val_rel_mean, ymax = max_val_rel_mean), color = "#e69f00", size = 2, data = mortality_water_augmented) +
geom_point(mapping = aes(x = Calcium, y = Mortality), color = "#009E73", size = 3, data = mortality_water_augmented) +
geom_hline(yintercept = mean(mortality_water_augmented$Mortality), color = "blue") +
geom_point(mapping = aes(x = Calcium, y = mean(mortality_water_augmented$Mortality)), color = "#56B4E9", size = 3, data = mortality_water_augmented)
```
```{r, echo = FALSE, fig.height=2, fig.width=3.5}
ggplot() +
geom_density(mapping = aes(x = Mortality), data = mortality_water) +
stat_function(mapping = aes(x = Mortality),
fun = dnorm,
colour = "red",
args = list(mean = mean(mortality_water$Mortality), sd = sd(mortality_water$Mortality)),
data = mortality_water)
```
* $s_y = 188$: For about 95% of our sample, the observed value was within $\pm 2 s_y$ of our prediction of 1524 ($\pm 376$ people per 100,000 population)
## Comparing the two:
* If we know about Calcium: Guess $\hat{y}_i = 1676.36 - 3.23 x_i$
* For about 95% of sample, $y_i$ is within $\pm 286$ of $\hat{y}_i$
```{r, echo=FALSE, fig.height=1.5, fig.width=3.5}
ggplot() +
geom_linerange(mapping = aes(x = Calcium, ymin = min_val, ymax = max_val), color = "#e69f00", size = 2, data = mortality_water_augmented) +
geom_point(mapping = aes(x = Calcium, y = Mortality), color = "#009E73", size = 3, data = mortality_water_augmented) +
geom_smooth(mapping = aes(x = Calcium, y = Mortality), data = mortality_water_augmented, method = "lm", se = FALSE) +
geom_point(mapping = aes(x = Calcium, y = predicted), color = "#56B4E9", size = 3, data = mortality_water_augmented)
```
```{r, echo=FALSE, fig.height=1.5, fig.width=3.5}
ggplot() +
geom_density(mapping = aes(x = residual), data = mortality_water) +
stat_function(mapping = aes(x = residual),
fun = dnorm,
colour = "red",
args = list(mean = 0, sd = summary(linear_fit)$sigma),
data = mortality_water) +
coord_cartesian(
xlim = c(-3, 3) * sd(mortality_water_augmented$Mortality),
ylim = c(0, 0.003))
```
* If we **don't** know about Calcium: Guess $\bar{y} = 1524$
* For about 95% of sample, $y_i$ is within $\pm 376$ of $\bar{y}$
```{r, echo=FALSE, message=FALSE, fig.height = 1.5, fig.width=3.5}
ggplot() +
geom_linerange(mapping = aes(x = Calcium, ymin = min_val_rel_mean, ymax = max_val_rel_mean), color = "#e69f00", size = 2, data = mortality_water_augmented) +
geom_point(mapping = aes(x = Calcium, y = Mortality), color = "#009E73", size = 3, data = mortality_water_augmented) +
geom_hline(yintercept = mean(mortality_water_augmented$Mortality), color = "blue") +
geom_point(mapping = aes(x = Calcium, y = mean(mortality_water_augmented$Mortality)), color = "#56B4E9", size = 3, data = mortality_water_augmented)
```
```{r, echo = FALSE, fig.height=1.5, fig.width=3.5}
ggplot() +
geom_density(mapping = aes(x = Mortality), data = mortality_water) +
stat_function(mapping = aes(x = Mortality),
fun = dnorm,
colour = "red",
args = list(mean = mean(mortality_water$Mortality), sd = sd(mortality_water$Mortality)),
data = mortality_water) +
coord_cartesian(
xlim = mean(mortality_water_augmented$Mortality) + c(-3, 3) * sd(mortality_water_augmented$Mortality),
ylim = c(0, 0.003))
```
## How much does knowing Calcium help?
* Without knowing calcium:
* Guess $\bar{y} = 1524$ people per 100,000 population
* About 95% of our sample fell within $\pm 2 s_y$ ($\pm 376$) of this guess
* If we know Calcium:
* Guess $\hat{y} = 1676.36 - 3.23 * Calcium$
* About 95% of our sample fell within $\pm 2 s_e$ ($\pm 286$) of this guess
* Knowing calcium narrowed down range of values of Mortality a fair amount!
## How much does knowing Calcium help?
* Let's compare the spreads of these distributions of deviations from the predictions with and without knowing calcium.
* Let's compute $\frac{\text{spread of residuals if we DO know calcium}}{\text{spread of deviations if we DON'T know about calcium}}$
```{r, echo=FALSE, fig.height=1.5, fig.width=3.5}
ggplot() +
geom_density(mapping = aes(x = residual), data = mortality_water) +
stat_function(mapping = aes(x = residual),
fun = dnorm,
colour = "red",
args = list(mean = 0, sd = summary(linear_fit)$sigma),
data = mortality_water) +
coord_cartesian(xlim = c(-3, 3) * sd(mortality_water_augmented$Mortality))
```
```{r, echo = FALSE, fig.height=1.5, fig.width=3.5}
ggplot() +
geom_density(mapping = aes(x = Mortality), data = mortality_water) +
stat_function(mapping = aes(x = Mortality),
fun = dnorm,
colour = "red",
args = list(mean = mean(mortality_water$Mortality), sd = sd(mortality_water$Mortality)),
data = mortality_water) +
coord_cartesian(xlim = mean(mortality_water_augmented$Mortality) + c(-3, 3) * sd(mortality_water_augmented$Mortality))
```
* If knowing Calcium helps a lot, this will be close to 0
* If knowing Calcium doesn't help much, this will be close to 1
## A Brief Aside: variance = (std. dev.)$^2$
* Sample variance of Mortality:
$$s_y^2 = \left[\sqrt{\frac{\sum_{i = 1}^n (y_i - \bar{y})^2}{n - 1}}\right]^2 = \frac{\sum_{i = 1}^n (y_i - \bar{y})^2}{n - 1} = \frac{SST}{n - 1}$$
* $SST = \sum_{i = 1}^n (y_i - \bar{y})^2$. $SST$ stands for **S**um of **S**quares **T**otal (used in measuring the total variability in the response)
* Residual Variance:
$$s^2_e = \left[\sqrt{\frac{\sum_{i = 1}^n e_i^2}{n - 2}}\right]^2 = \frac{\sum_{i = 1}^n e_i^2}{n - 2} = \frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n - 2} = \frac{SSE}{n - 2}$$
* $SSE = \sum_{i = 1}^n (y_i - \hat{y}_i)^2$. $SSE$ stands for **S**um of **S**quared **E**rrors (used in measuring the amount of variability in the errors, or residuals, from the model fit)
## "Multiple" $R^2$
* Let's summarize spread of residuals by $SSE = \sum_{i = 1}^n (y_i - \hat{y}_i)^2$
* ... and summarize spread of Mortality by $SST = \sum_{i = 1}^n (y_i - \bar{y})^2$
* So $\frac{\text{spread of residuals if we DO know calcium}}{\text{spread of deviations if we DON'T know about calcium}} = \frac{SSE}{SST}$
* Two extreme cases:
* If the error goes to zero, we'd have a "perfect fit". $\frac{SSE}{SST} = 0$
* If $SSE = SST$, $x$ has told us nothing about $y$. $\frac{SSE}{SST} = 1$
* Interpret $SSE / SST$ as the fraction of the total variation in $y$ that is still in the residuals
* $R^2 = 1 - \frac{SSE}{SST}$ is the fraction of the total variation in $y$ "accounted for" by the model in $x$.
## "Adjusted" $R^2$
* Let's summarize spread of residuals by the residual variance $s_e^2 = \frac{SSE}{n - 2}$
* ... and summarize spread of Mortality by its sample variance $s_y^2 = \frac{SST}{n - 1}$
* So $\frac{\text{spread of residuals if we DO know calcium}}{\text{spread of deviations if we DON'T know about calcium}} = \frac{s_e^2}{s_y^2}$
* Two extreme cases:
* If the error goes to zero, we'd have a "perfect fit". $\frac{s_e^2}{s_y^2} = 0$
* If $s_e^2 = s_y^2$, $x$ has told us nothing about $y$. $\frac{s_e^2}{s_y^2} = 1$
* $R^2_{adj} = 1 - \frac{s_e^2}{s_y^2}$ is the fraction of the total variation in $y$ "accounted for" by the model in $x$.
## Residual Std. Dev. and $R^2$ in R
```{r, echo = TRUE}
summary(linear_fit)
```
## An even briefer aside:
* Sample variance of Mortality:
$$s_y^2 = \frac{SST}{n - 1}$$
* Residual Variance:
$$s^2_e = \frac{SSE}{n - 2}$$
* The $n - 1$ and $n - 2$ have to do with the number of parameters used to make our prediction:
* Normal model: 1 parameter ($\mu$), use $n - 1$
* Linear model: 2 parameters ($b_0$, $b_1$), use $n - 2$
* We'll talk about **why** in a month or so