--- title: "Linear Regression" author: "Evan L. Ray" date: "September 27, 2017" output: ioslides_presentation --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) require(ggplot2) require(dplyr) require(tidyr) require(readr) require(mosaicData) ``` ## Organization for These Slides 1. Linear Models -- the Statistical Ideas 2. Linear Models in R # Linear Models -- the Statistical Ideas ## Today's Data Set * Scientists believe that hard water (water with high concentrations of calcium and magnesium) is beneficial for health (Sengupta, P. (2013). IJPM, 4(8), 866-875.) * We have recordings of the mortality rate (deaths per 100,000 population) and concentration of calcium in drinking water (parts per million) in 61 large towns in England and Wales ```{r, echo=FALSE, message=FALSE, fig.height = 2.2, fig.width=6} mortality_water <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Hard_water_Derby.csv") ggplot() + geom_point(mapping = aes(x = Calcium, y = Mortality), data = mortality_water) ``` * **Explanatory** (~~independent~~) variable goes on the x axis, **Response** (~~dependent~~) variable goes on the y axis ## Does Fitting a Line Make Sense? * Always look at a scatter plot to verify these conditions: 1. **Linear** relationship (straight enough) 2. **No Outliers** ```{r, echo=FALSE, message=FALSE, fig.height = 4, fig.width=6} ggplot() + geom_point(mapping = aes(x = Calcium, y = Mortality), data = mortality_water) ``` ## Line of Best Fit ```{r, echo=FALSE, message=FALSE} mortality_water <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Hard_water_Derby.csv") ``` ```{r, echo=FALSE, message=FALSE, fig.height = 2.5, fig.width=8} ggplot() + geom_point(mapping = aes(x = Calcium, y = Mortality), data = mortality_water) + geom_smooth(mapping = aes(x = Calcium, y = Mortality), data = mortality_water, method = "lm", se = FALSE) ``` ```{r, echo=FALSE} linear_fit <- lm(Mortality ~ Calcium, data = mortality_water) ``` * **Equation:** Predicted Mortality = 1676.36 - 3.23 $*$ Calcium * **General Form:** Predicted Response = intercept + slope $*$ Explanatory * **General Form (math notation):** $\widehat{y}_i = b_0 + b_1 * x_i$ * $y_i$: response for observation $i$, $\widehat{y}_i$: predicted response, $x_i$ explanatory variable ## Interpretation of the Coefficients: ```{r, echo=FALSE, message=FALSE, fig.height = 2.5, fig.width=6} ggplot() + geom_point(mapping = aes(x = Calcium, y = Mortality), data = mortality_water) + geom_smooth(mapping = aes(x = Calcium, y = Mortality), data = mortality_water, method = "lm", se = FALSE) ``` * **Equation:** Predicted Mortality = 1676.36 - 3.23 Calcium * **Interpretation of $b_0$**: If a town's drinking water contained no calcium, the predicted mortality rate would be 1676.36 people per 100,000, annually. * **Interpretation of $b_1$**: For each unit increase in Calcium, the predicted mortality rate goes down by 3.23 people per 100,000, annually. ## Interpretation Continued ```{r, echo=FALSE, message=FALSE, fig.height = 2.5, fig.width=6} ggplot() + geom_point(mapping = aes(x = Calcium, y = Mortality), data = mortality_water) + geom_smooth(mapping = aes(x = Calcium, y = Mortality), data = mortality_water, method = "lm", se = FALSE) ``` * **Interpretation of $b_1$**: For each unit increase in Calcium, the predicted mortality rate goes down by 3.23 people per 100,000, annually. * **Does this mean a unit increase in Calcium *causes* a reduction in mortality rate of 3.23 people per 100,000?** ## Lurking Variables * **Lurking Variable**: A variable that's not part of the model, but affects the way the variables in the model appear to be related. * Are there any possible lurking variables in our example? ```{r, echo=FALSE, message=FALSE, fig.height = 2.2, fig.width=6} ggplot() + geom_point(mapping = aes(x = Calcium, y = Mortality), data = mortality_water) + geom_smooth(mapping = aes(x = Calcium, y = Mortality), data = mortality_water, method = "lm", se = FALSE) ``` * **Never conclude that a causal relationship exists** by fitting a regression model to observational data * Need to run experiments! ## Model Predictions Are Not Exactly Right! ```{r, echo=FALSE, message=FALSE, fig.height = 2.5, fig.width=6} ggplot() + geom_point(mapping = aes(x = Calcium, y = Mortality), data = mortality_water) + geom_smooth(mapping = aes(x = Calcium, y = Mortality), data = mortality_water, method = "lm", se = FALSE) ``` * **Equation:** Predicted Mortality = 1676.36 - 3.23 Calcium * The word **Predicted** is essential! The model's predictions don't exactly equal the observed values. ## Residuals * **Residual** = **Observed** - **Predicted** * $\definecolor{residual}{RGB}{230,159,0}\color{residual}e_i$ = $\definecolor{observed}{RGB}{0,158,115}\color{observed}y_i$ - $\definecolor{predicted}{RGB}{86,180,233}\color{predicted}\widehat{y}_i$ ```{r, echo=FALSE, message=FALSE, fig.height = 4, fig.width=6} x <- 71 y_obs <- 1713 y_hat <- predict(linear_fit, newdata = data.frame(Calcium = 71)) ex_df <- data.frame( x = x, y_obs = y_obs, y_hat = y_hat, resid = y_obs - y_hat ) offset <- 8 offset2 <- 5 y_mid <- mean(c(y_obs, y_hat)) resid_df <- data.frame( x = c(x, x + offset, x + offset, x + offset + offset2, x + offset, x + offset, x), y = c(y_obs, y_obs, y_mid, y_mid, y_mid, y_hat, y_hat) ) ggplot() + geom_point(mapping = aes(x = Calcium, y = Mortality), data = mortality_water) + geom_smooth(mapping = aes(x = Calcium, y = Mortality), data = mortality_water, method = "lm", se = FALSE) + geom_path(mapping = aes(x = x, y = y), color = "#e69f00", size = 2, data = resid_df) + geom_point(mapping = aes(x = x, y = y_obs), color = "#009E73", size = 4, data = ex_df) + geom_point(mapping = aes(x = x, y = y_hat), color = "#56B4E9", size = 4, data = ex_df) + geom_label(mapping = aes(label = "y[i]", x = x, y = y_obs), size = 8, nudge_x = -6.5, nudge_y = 40, color = "#009E73", parse = TRUE, data = ex_df) + geom_label(mapping = aes(label = "hat(y)[i]", x = x, y = y_hat), size = 8, nudge_x = -6.5, nudge_y = 55, color = "#56B4E9", parse = TRUE, data = ex_df) + geom_label(mapping = aes(label = "e[i]", x = x + offset + offset2, y = y_mid), size = 8, nudge_x = 5, color = "#e69f00", parse = TRUE, data = ex_df) ``` ## Finding the Line of Best Fit ```{r, echo=FALSE, message=FALSE, fig.height = 3, fig.width=6} mortality_water <- mutate(mortality_water, Mortality = as.numeric(Mortality), predicted = predict(linear_fit), residual = residuals(linear_fit)) mortality_water_augmented <- mortality_water mortality_water_augmented$min_val <- as.numeric(apply(mortality_water_augmented, 1, function(rowvals) {min(rowvals[c("predicted", "Mortality")])})) mortality_water_augmented$max_val <- as.numeric(apply(mortality_water_augmented, 1, function(rowvals) {max(rowvals[c("predicted", "Mortality")])})) ggplot() + geom_linerange(mapping = aes(x = Calcium, ymin = min_val, ymax = max_val), color = "#e69f00", size = 2, data = mortality_water_augmented) + geom_point(mapping = aes(x = Calcium, y = Mortality), color = "#009E73", size = 3, data = mortality_water_augmented) + geom_smooth(mapping = aes(x = Calcium, y = Mortality), data = mortality_water_augmented, method = "lm", se = FALSE) + geom_point(mapping = aes(x = Calcium, y = predicted), color = "#56B4E9", size = 3, data = mortality_water_augmented) ``` * The "best" line has the smallest residuals * Pick $b_0, b_1$ to minimize sum of squared errors/residuals: $SSE = \sum_{i = 1}^n e_i^2$ (see Chapter 26...) $$b_1 = r \frac{s_y}{s_x} \qquad \qquad b_0 = \bar{y} - b_1 \bar{x}$$ ## The Distribution of the Residuals ```{r, echo = FALSE, fig.height=3} mortality_water <- mutate(mortality_water, predicted = predict(linear_fit), residual = residuals(linear_fit)) ggplot() + geom_density(mapping = aes(x = residual), data = mortality_water) ``` * Estimating $b_0$ and $b_1$ by minimizing $SSE = \sum_{i = 1}^n e_i^2$ is only justified if: * The distribution of residuals is **nearly normal** (unimodal, symmetric) * The standard deviation of the residuals is about the same across all values of the explanatory variable ## Checking Equal Residual Std. Dev. * "Does the plot thicken?"