--- title: "Linear Regression -- Part 2" author: "Evan L. Ray" date: "September 29, 2017" output: ioslides_presentation --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) require(ggplot2) require(dplyr) require(tidyr) require(readr) ``` ## Wrap Up for Wednesday's Lab ```{r, message=FALSE, echo = FALSE} wildfires <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Wildfires_2012.csv") names(wildfires) <- c("num_fires", "years_since_1985", "ave_acres_burned") wildfires <- mutate(wildfires, year = years_since_1985 + 1985) wildfires <- arrange(wildfires, year) ``` ```{r, echo = TRUE} linear_fit <- lm(ave_acres_burned ~ years_since_1985, data = wildfires) coef(linear_fit) ``` * Predicted average number of acres burned in 1985 is 19.6 ```{r, echo = TRUE} linear_fit_year <- lm(ave_acres_burned ~ year, data = wildfires) coef(linear_fit_year) ``` * Predicted average number of acres burned in year 0 is -4390 ## What's going on?

```{r, echo = FALSE, fig.width=3.5, fig.height=4} ggplot() + geom_point(aes(x = years_since_1985, y = ave_acres_burned), data = wildfires) + geom_abline(intercept = coef(linear_fit)[1], slope = coef(linear_fit)[2]) + coord_cartesian(xlim = range(wildfires$years_since_1985), ylim = c(coef(linear_fit)[1], 110)) ``` ```{r, echo = FALSE, fig.width=3.5, fig.height=4} ggplot() + geom_point(aes(x = year, y = ave_acres_burned), data = wildfires) + geom_abline(intercept = coef(linear_fit_year)[1], slope = coef(linear_fit_year)[2]) + coord_cartesian(xlim = c(0, 2017), ylim = c(coef(linear_fit_year)[1], 110)) ```

## Never Extrapolate Beyond the Data [An Important Message](https://www.causeweb.org/cause/sites/default/files/resources/fun/videos/How%20Far%20He%E2%80%99ll%20Go%20-%20Final.mp4) ## Back to the Mortality Example ```{r, echo=FALSE, message=FALSE, fig.height = 3.5, fig.width=8} mortality_water <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Hard_water_Derby.csv") linear_fit <- lm(Mortality ~ Calcium, data = mortality_water) ggplot() + geom_point(mapping = aes(x = Calcium, y = Mortality), data = mortality_water) + geom_smooth(mapping = aes(x = Calcium, y = Mortality), data = mortality_water, method = "lm", se = FALSE) ``` * **Explanatory:** Concentration of calcium in drinking water * **Response:** Annual mortality rate per 100,000 population * **Equation:** Predicted Mortality = 1676.36 - 3.23 $*$ Calcium ## Residuals * **Residual** = **Observed** - **Predicted** * $\definecolor{residual}{RGB}{230,159,0}\color{residual}e_i$ = $\definecolor{observed}{RGB}{0,158,115}\color{observed}y_i$ - $\definecolor{predicted}{RGB}{86,180,233}\color{predicted}\widehat{y}_i$ ```{r, echo=FALSE, message=FALSE, fig.height = 3, fig.width=6} mortality_water <- mutate(mortality_water, Mortality = as.numeric(Mortality), predicted = predict(linear_fit), residual = residuals(linear_fit)) mortality_water_augmented <- mortality_water mortality_water_augmented$min_val <- as.numeric(apply(mortality_water_augmented, 1, function(rowvals) {min(rowvals[c("predicted", "Mortality")])})) mortality_water_augmented$max_val <- as.numeric(apply(mortality_water_augmented, 1, function(rowvals) {max(rowvals[c("predicted", "Mortality")])})) ggplot() + geom_linerange(mapping = aes(x = Calcium, ymin = min_val, ymax = max_val), color = "#e69f00", size = 2, data = mortality_water_augmented) + geom_point(mapping = aes(x = Calcium, y = Mortality), color = "#009E73", size = 3, data = mortality_water_augmented) + geom_smooth(mapping = aes(x = Calcium, y = Mortality), data = mortality_water_augmented, method = "lm", se = FALSE) + geom_point(mapping = aes(x = Calcium, y = predicted), color = "#56B4E9", size = 3, data = mortality_water_augmented) ``` * How much information does the model give us about Mortality rate? * How big do the residuals tend to be? ## Residual Distribution ```{r, echo=FALSE, message=FALSE, fig.height = 2.5, fig.width=6} mortality_water <- mutate(mortality_water, Mortality = as.numeric(Mortality), predicted = predict(linear_fit), residual = residuals(linear_fit)) mortality_water_augmented <- mortality_water mortality_water_augmented$min_val <- as.numeric(apply(mortality_water_augmented, 1, function(rowvals) {min(rowvals[c("predicted", "Mortality")])})) mortality_water_augmented$max_val <- as.numeric(apply(mortality_water_augmented, 1, function(rowvals) {max(rowvals[c("predicted", "Mortality")])})) ggplot() + geom_linerange(mapping = aes(x = Calcium, ymin = min_val, ymax = max_val), color = "#e69f00", size = 2, data = mortality_water_augmented) + geom_point(mapping = aes(x = Calcium, y = Mortality), color = "#009E73", size = 3, data = mortality_water_augmented) + geom_smooth(mapping = aes(x = Calcium, y = Mortality), data = mortality_water_augmented, method = "lm", se = FALSE) + geom_point(mapping = aes(x = Calcium, y = predicted), color = "#56B4E9", size = 3, data = mortality_water_augmented) ```

```{r, echo=FALSE, message = FALSE, fig.height=2.5, fig.width=2.5} ggplot() + geom_histogram(mapping = aes(x = residual), bins = 12, data = mortality_water) ``` ```{r, echo=FALSE, fig.height=2.5, fig.width=2.5} ggplot() + geom_density(mapping = aes(x = residual), data = mortality_water) + stat_function(mapping = aes(x = residual), fun = dnorm, colour = "red", args = list(mean = 0, sd = summary(linear_fit)$sigma), data = mortality_water) ```

## Residual Standard Deviation * How big do the residuals tend to be? * One way to measure this: The residual standard deviation $$s_e = \sqrt{\frac{\sum_{i = 1}^n e_i^2}{n - 2}} = \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n - 2}}$$

```{r, echo = FALSE, fig.height=3, fig.width=3.5} ggplot() + geom_density(mapping = aes(x = residual), data = mortality_water) + stat_function(mapping = aes(x = residual), fun = dnorm, colour = "red", args = list(mean = 0, sd = summary(linear_fit)$sigma), data = mortality_water) ``` * In our example, $s_e = 143$ * For about 95% of our sample, the observed value was within $\pm 2 s_e$ of the model's prediction ($\pm 286$ people per 100,000 population)

## What if we didn't know about calcium? * A reasonable guess for mortality in this town: the average mortality rate in our sample, $\bar{y} = 1524$ ```{r, echo=FALSE, message=FALSE, fig.height = 2, fig.width=6} mortality_water_augmented$mean_mortality <- mean(mortality_water_augmented$Mortality) mortality_water_augmented$min_val_rel_mean <- as.numeric(apply(mortality_water_augmented, 1, function(rowvals) {min(rowvals[c("mean_mortality", "Mortality")])})) mortality_water_augmented$max_val_rel_mean <- as.numeric(apply(mortality_water_augmented, 1, function(rowvals) {max(rowvals[c("mean_mortality", "Mortality")])})) ggplot() + geom_linerange(mapping = aes(x = Calcium, ymin = min_val_rel_mean, ymax = max_val_rel_mean), color = "#e69f00", size = 2, data = mortality_water_augmented) + geom_point(mapping = aes(x = Calcium, y = Mortality), color = "#009E73", size = 3, data = mortality_water_augmented) + geom_hline(yintercept = mean(mortality_water_augmented$Mortality), color = "blue") + geom_point(mapping = aes(x = Calcium, y = mean(mortality_water_augmented$Mortality)), color = "#56B4E9", size = 3, data = mortality_water_augmented) ```

```{r, echo = FALSE, fig.height=2, fig.width=3.5} ggplot() + geom_density(mapping = aes(x = Mortality), data = mortality_water) + stat_function(mapping = aes(x = Mortality), fun = dnorm, colour = "red", args = list(mean = mean(mortality_water$Mortality), sd = sd(mortality_water$Mortality)), data = mortality_water) ``` * $s_y = 188$: For about 95% of our sample, the observed value was within $\pm 2 s_y$ of our prediction of 1524 ($\pm 376$ people per 100,000 population)

## Comparing the two: * If we know about Calcium: Guess $\hat{y}_i = 1676.36 - 3.23 x_i$ * For about 95% of sample, $y_i$ is within $\pm 286$ of $\hat{y}_i$

```{r, echo=FALSE, fig.height=1.5, fig.width=3.5} ggplot() + geom_linerange(mapping = aes(x = Calcium, ymin = min_val, ymax = max_val), color = "#e69f00", size = 2, data = mortality_water_augmented) + geom_point(mapping = aes(x = Calcium, y = Mortality), color = "#009E73", size = 3, data = mortality_water_augmented) + geom_smooth(mapping = aes(x = Calcium, y = Mortality), data = mortality_water_augmented, method = "lm", se = FALSE) + geom_point(mapping = aes(x = Calcium, y = predicted), color = "#56B4E9", size = 3, data = mortality_water_augmented) ``` ```{r, echo=FALSE, fig.height=1.5, fig.width=3.5} ggplot() + geom_density(mapping = aes(x = residual), data = mortality_water) + stat_function(mapping = aes(x = residual), fun = dnorm, colour = "red", args = list(mean = 0, sd = summary(linear_fit)$sigma), data = mortality_water) + coord_cartesian( xlim = c(-3, 3) * sd(mortality_water_augmented$Mortality), ylim = c(0, 0.003)) ```

* If we **don't** know about Calcium: Guess $\bar{y} = 1524$ * For about 95% of sample, $y_i$ is within $\pm 376$ of $\bar{y}$

```{r, echo=FALSE, message=FALSE, fig.height = 1.5, fig.width=3.5} ggplot() + geom_linerange(mapping = aes(x = Calcium, ymin = min_val_rel_mean, ymax = max_val_rel_mean), color = "#e69f00", size = 2, data = mortality_water_augmented) + geom_point(mapping = aes(x = Calcium, y = Mortality), color = "#009E73", size = 3, data = mortality_water_augmented) + geom_hline(yintercept = mean(mortality_water_augmented$Mortality), color = "blue") + geom_point(mapping = aes(x = Calcium, y = mean(mortality_water_augmented$Mortality)), color = "#56B4E9", size = 3, data = mortality_water_augmented) ``` ```{r, echo = FALSE, fig.height=1.5, fig.width=3.5} ggplot() + geom_density(mapping = aes(x = Mortality), data = mortality_water) + stat_function(mapping = aes(x = Mortality), fun = dnorm, colour = "red", args = list(mean = mean(mortality_water$Mortality), sd = sd(mortality_water$Mortality)), data = mortality_water) + coord_cartesian( xlim = mean(mortality_water_augmented$Mortality) + c(-3, 3) * sd(mortality_water_augmented$Mortality), ylim = c(0, 0.003)) ```

## How much does knowing Calcium help? * Without knowing calcium: * Guess $\bar{y} = 1524$ people per 100,000 population * About 95% of our sample fell within $\pm 2 s_y$ ($\pm 376$) of this guess * If we know Calcium: * Guess $\hat{y} = 1676.36 - 3.23 * Calcium$ * About 95% of our sample fell within $\pm 2 s_e$ ($\pm 286$) of this guess * Knowing calcium narrowed down range of values of Mortality a fair amount! ## How much does knowing Calcium help? * Let's compare the spreads of these distributions of deviations from the predictions with and without knowing calcium. * Let's compute $\frac{\text{spread of residuals if we DO know calcium}}{\text{spread of deviations if we DON'T know about calcium}}$

```{r, echo=FALSE, fig.height=1.5, fig.width=3.5} ggplot() + geom_density(mapping = aes(x = residual), data = mortality_water) + stat_function(mapping = aes(x = residual), fun = dnorm, colour = "red", args = list(mean = 0, sd = summary(linear_fit)$sigma), data = mortality_water) + coord_cartesian(xlim = c(-3, 3) * sd(mortality_water_augmented$Mortality)) ``` ```{r, echo = FALSE, fig.height=1.5, fig.width=3.5} ggplot() + geom_density(mapping = aes(x = Mortality), data = mortality_water) + stat_function(mapping = aes(x = Mortality), fun = dnorm, colour = "red", args = list(mean = mean(mortality_water$Mortality), sd = sd(mortality_water$Mortality)), data = mortality_water) + coord_cartesian(xlim = mean(mortality_water_augmented$Mortality) + c(-3, 3) * sd(mortality_water_augmented$Mortality)) ``` * If knowing Calcium helps a lot, this will be close to 0 * If knowing Calcium doesn't help much, this will be close to 1

## A Brief Aside: variance = (std. dev.)$^2$ * Sample variance of Mortality: $$s_y^2 = \left[\sqrt{\frac{\sum_{i = 1}^n (y_i - \bar{y})^2}{n - 1}}\right]^2 = \frac{\sum_{i = 1}^n (y_i - \bar{y})^2}{n - 1} = \frac{SST}{n - 1}$$ * $SST = \sum_{i = 1}^n (y_i - \bar{y})^2$. $SST$ stands for **S**um of **S**quares **T**otal (used in measuring the total variability in the response) * Residual Variance: $$s^2_e = \left[\sqrt{\frac{\sum_{i = 1}^n e_i^2}{n - 2}}\right]^2 = \frac{\sum_{i = 1}^n e_i^2}{n - 2} = \frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n - 2} = \frac{SSE}{n - 2}$$ * $SSE = \sum_{i = 1}^n (y_i - \hat{y}_i)^2$. $SSE$ stands for **S**um of **S**quared **E**rrors (used in measuring the amount of variability in the errors, or residuals, from the model fit) ## "Multiple" $R^2$ * Let's summarize spread of residuals by $SSE = \sum_{i = 1}^n (y_i - \hat{y}_i)^2$ * ... and summarize spread of Mortality by $SST = \sum_{i = 1}^n (y_i - \bar{y})^2$ * So $\frac{\text{spread of residuals if we DO know calcium}}{\text{spread of deviations if we DON'T know about calcium}} = \frac{SSE}{SST}$ * Two extreme cases: * If the error goes to zero, we'd have a "perfect fit". $\frac{SSE}{SST} = 0$ * If $SSE = SST$, $x$ has told us nothing about $y$. $\frac{SSE}{SST} = 1$ * Interpret $SSE / SST$ as the fraction of the total variation in $y$ that is still in the residuals * $R^2 = 1 - \frac{SSE}{SST}$ is the fraction of the total variation in $y$ "accounted for" by the model in $x$. ## "Adjusted" $R^2$ * Let's summarize spread of residuals by the residual variance $s_e^2 = \frac{SSE}{n - 2}$ * ... and summarize spread of Mortality by its sample variance $s_y^2 = \frac{SST}{n - 1}$ * So $\frac{\text{spread of residuals if we DO know calcium}}{\text{spread of deviations if we DON'T know about calcium}} = \frac{s_e^2}{s_y^2}$ * Two extreme cases: * If the error goes to zero, we'd have a "perfect fit". $\frac{s_e^2}{s_y^2} = 0$ * If $s_e^2 = s_y^2$, $x$ has told us nothing about $y$. $\frac{s_e^2}{s_y^2} = 1$ * $R^2_{adj} = 1 - \frac{s_e^2}{s_y^2}$ is the fraction of the total variation in $y$ "accounted for" by the model in $x$. ## Residual Std. Dev. and $R^2$ in R ```{r, echo = TRUE} summary(linear_fit) ``` ## An even briefer aside: * Sample variance of Mortality: $$s_y^2 = \frac{SST}{n - 1}$$ * Residual Variance: $$s^2_e = \frac{SSE}{n - 2}$$ * The $n - 1$ and $n - 2$ have to do with the number of parameters used to make our prediction: * Normal model: 1 parameter ($\mu$), use $n - 1$ * Linear model: 2 parameters ($b_0$, $b_1$), use $n - 2$ * We'll talk about **why** in a month or so