--- title: "Linear Regression" author: "Evan L. Ray" date: "September 27, 2017" output: ioslides_presentation --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) require(ggplot2) require(dplyr) require(tidyr) require(readr) require(mosaicData) ``` ## Organization for These Slides 1. Linear Models -- the Statistical Ideas 2. Linear Models in R # Linear Models -- the Statistical Ideas ## Today's Data Set * Scientists believe that hard water (water with high concentrations of calcium and magnesium) is beneficial for health (Sengupta, P. (2013). IJPM, 4(8), 866-875.) * We have recordings of the mortality rate (deaths per 100,000 population) and concentration of calcium in drinking water (parts per million) in 61 large towns in England and Wales ```{r, echo=FALSE, message=FALSE, fig.height = 2.2, fig.width=6} mortality_water <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Hard_water_Derby.csv") ggplot() + geom_point(mapping = aes(x = Calcium, y = Mortality), data = mortality_water) ``` * **Explanatory** (~~independent~~) variable goes on the x axis, **Response** (~~dependent~~) variable goes on the y axis ## Does Fitting a Line Make Sense? * Always look at a scatter plot to verify these conditions: 1. **Linear** relationship (straight enough) 2. **No Outliers** ```{r, echo=FALSE, message=FALSE, fig.height = 4, fig.width=6} ggplot() + geom_point(mapping = aes(x = Calcium, y = Mortality), data = mortality_water) ``` ## Line of Best Fit ```{r, echo=FALSE, message=FALSE} mortality_water <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Hard_water_Derby.csv") ``` ```{r, echo=FALSE, message=FALSE, fig.height = 2.5, fig.width=8} ggplot() + geom_point(mapping = aes(x = Calcium, y = Mortality), data = mortality_water) + geom_smooth(mapping = aes(x = Calcium, y = Mortality), data = mortality_water, method = "lm", se = FALSE) ``` ```{r, echo=FALSE} linear_fit <- lm(Mortality ~ Calcium, data = mortality_water) ``` * **Equation:** Predicted Mortality = 1676.36 - 3.23 $*$ Calcium * **General Form:** Predicted Response = intercept + slope $*$ Explanatory * **General Form (math notation):** $\widehat{y}_i = b_0 + b_1 * x_i$ * $y_i$: response for observation $i$, $\widehat{y}_i$: predicted response, $x_i$ explanatory variable ## Interpretation of the Coefficients: ```{r, echo=FALSE, message=FALSE, fig.height = 2.5, fig.width=6} ggplot() + geom_point(mapping = aes(x = Calcium, y = Mortality), data = mortality_water) + geom_smooth(mapping = aes(x = Calcium, y = Mortality), data = mortality_water, method = "lm", se = FALSE) ``` * **Equation:** Predicted Mortality = 1676.36 - 3.23 Calcium * **Interpretation of $b_0$**: If a town's drinking water contained no calcium, the predicted mortality rate would be 1676.36 people per 100,000, annually. * **Interpretation of $b_1$**: For each unit increase in Calcium, the predicted mortality rate goes down by 3.23 people per 100,000, annually. ## Interpretation Continued ```{r, echo=FALSE, message=FALSE, fig.height = 2.5, fig.width=6} ggplot() + geom_point(mapping = aes(x = Calcium, y = Mortality), data = mortality_water) + geom_smooth(mapping = aes(x = Calcium, y = Mortality), data = mortality_water, method = "lm", se = FALSE) ``` * **Interpretation of $b_1$**: For each unit increase in Calcium, the predicted mortality rate goes down by 3.23 people per 100,000, annually. * **Does this mean a unit increase in Calcium *causes* a reduction in mortality rate of 3.23 people per 100,000?** ## Lurking Variables * **Lurking Variable**: A variable that's not part of the model, but affects the way the variables in the model appear to be related. * Are there any possible lurking variables in our example? ```{r, echo=FALSE, message=FALSE, fig.height = 2.2, fig.width=6} ggplot() + geom_point(mapping = aes(x = Calcium, y = Mortality), data = mortality_water) + geom_smooth(mapping = aes(x = Calcium, y = Mortality), data = mortality_water, method = "lm", se = FALSE) ``` * **Never conclude that a causal relationship exists** by fitting a regression model to observational data * Need to run experiments! ## Model Predictions Are Not Exactly Right! ```{r, echo=FALSE, message=FALSE, fig.height = 2.5, fig.width=6} ggplot() + geom_point(mapping = aes(x = Calcium, y = Mortality), data = mortality_water) + geom_smooth(mapping = aes(x = Calcium, y = Mortality), data = mortality_water, method = "lm", se = FALSE) ``` * **Equation:** Predicted Mortality = 1676.36 - 3.23 Calcium * The word **Predicted** is essential! The model's predictions don't exactly equal the observed values. ## Residuals * **Residual** = **Observed** - **Predicted** * $\definecolor{residual}{RGB}{230,159,0}\color{residual}e_i$ = $\definecolor{observed}{RGB}{0,158,115}\color{observed}y_i$ - $\definecolor{predicted}{RGB}{86,180,233}\color{predicted}\widehat{y}_i$ ```{r, echo=FALSE, message=FALSE, fig.height = 4, fig.width=6} x <- 71 y_obs <- 1713 y_hat <- predict(linear_fit, newdata = data.frame(Calcium = 71)) ex_df <- data.frame( x = x, y_obs = y_obs, y_hat = y_hat, resid = y_obs - y_hat ) offset <- 8 offset2 <- 5 y_mid <- mean(c(y_obs, y_hat)) resid_df <- data.frame( x = c(x, x + offset, x + offset, x + offset + offset2, x + offset, x + offset, x), y = c(y_obs, y_obs, y_mid, y_mid, y_mid, y_hat, y_hat) ) ggplot() + geom_point(mapping = aes(x = Calcium, y = Mortality), data = mortality_water) + geom_smooth(mapping = aes(x = Calcium, y = Mortality), data = mortality_water, method = "lm", se = FALSE) + geom_path(mapping = aes(x = x, y = y), color = "#e69f00", size = 2, data = resid_df) + geom_point(mapping = aes(x = x, y = y_obs), color = "#009E73", size = 4, data = ex_df) + geom_point(mapping = aes(x = x, y = y_hat), color = "#56B4E9", size = 4, data = ex_df) + geom_label(mapping = aes(label = "y[i]", x = x, y = y_obs), size = 8, nudge_x = -6.5, nudge_y = 40, color = "#009E73", parse = TRUE, data = ex_df) + geom_label(mapping = aes(label = "hat(y)[i]", x = x, y = y_hat), size = 8, nudge_x = -6.5, nudge_y = 55, color = "#56B4E9", parse = TRUE, data = ex_df) + geom_label(mapping = aes(label = "e[i]", x = x + offset + offset2, y = y_mid), size = 8, nudge_x = 5, color = "#e69f00", parse = TRUE, data = ex_df) ``` ## Finding the Line of Best Fit ```{r, echo=FALSE, message=FALSE, fig.height = 3, fig.width=6} mortality_water <- mutate(mortality_water, Mortality = as.numeric(Mortality), predicted = predict(linear_fit), residual = residuals(linear_fit)) mortality_water_augmented <- mortality_water mortality_water_augmented$min_val <- as.numeric(apply(mortality_water_augmented, 1, function(rowvals) {min(rowvals[c("predicted", "Mortality")])})) mortality_water_augmented$max_val <- as.numeric(apply(mortality_water_augmented, 1, function(rowvals) {max(rowvals[c("predicted", "Mortality")])})) ggplot() + geom_linerange(mapping = aes(x = Calcium, ymin = min_val, ymax = max_val), color = "#e69f00", size = 2, data = mortality_water_augmented) + geom_point(mapping = aes(x = Calcium, y = Mortality), color = "#009E73", size = 3, data = mortality_water_augmented) + geom_smooth(mapping = aes(x = Calcium, y = Mortality), data = mortality_water_augmented, method = "lm", se = FALSE) + geom_point(mapping = aes(x = Calcium, y = predicted), color = "#56B4E9", size = 3, data = mortality_water_augmented) ``` * The "best" line has the smallest residuals * Pick $b_0, b_1$ to minimize sum of squared errors/residuals: $SSE = \sum_{i = 1}^n e_i^2$ (see Chapter 26...) $$b_1 = r \frac{s_y}{s_x} \qquad \qquad b_0 = \bar{y} - b_1 \bar{x}$$ ## The Distribution of the Residuals ```{r, echo = FALSE, fig.height=3} mortality_water <- mutate(mortality_water, predicted = predict(linear_fit), residual = residuals(linear_fit)) ggplot() + geom_density(mapping = aes(x = residual), data = mortality_water) ``` * Estimating $b_0$ and $b_1$ by minimizing $SSE = \sum_{i = 1}^n e_i^2$ is only justified if: * The distribution of residuals is **nearly normal** (unimodal, symmetric) * The standard deviation of the residuals is about the same across all values of the explanatory variable ## Checking Equal Residual Std. Dev. * "Does the plot thicken?"
* Mortality Example: ```{r, echo=FALSE, message=FALSE, fig.height = 4.2, fig.width=3.5} ggplot() + geom_point(mapping = aes(x = Calcium, y = Mortality), data = mortality_water) + geom_smooth(mapping = aes(x = Calcium, y = Mortality), data = mortality_water, method = "lm", se = FALSE) ``` * Median family income vs. housing cost index in the 50 states ```{r, echo=FALSE, message=FALSE, fig.height = 3.5, fig.width=3.5} income_housing <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Income_and_Housing.csv") names(income_housing) <- c("state", "median_income", "housing_cost") ggplot() + geom_point(mapping = aes(x = median_income, y = housing_cost), data = income_housing) + geom_smooth(mapping = aes(x = median_income, y = housing_cost), data = income_housing, method = "lm", se = FALSE) ```
## Summary of Conditions to Check * Linear Models are only appropriate for **quantitative variables** (at least, in this Chapter) * Four conditions to check: 1. **Linear** relationship (straight enough) * scatterplot of `response_variable` vs. `explanatory_variable` 2. No **Outliers** * scatterplot of `response_variable` vs. `explanatory_variable` 3. **Normally Distributed Residuals** * density plot of residuals 4. **Equal Spread** (does the plot thicken?) * scatterplot of `response_variable` vs. `explanatory_variable` # Linear Models in R ## Fitting Linear Models in R * General Template: ```{r, echo = TRUE, eval = FALSE} linear_fit <- lm(response_variable ~ explanatory_variable, data = data_frame) ``` * Our Example: ```{r, echo = TRUE} linear_fit <- lm(Mortality ~ Calcium, data = mortality_water) ``` * "Fit a linear model for `response_variable` as a function of `explanatory_variable`, with the variables taken from `data_frame`, and save the result in an object called `linear_fit`" ## Getting the Estimated Coefficients in R * Use the `coef()` function to view the estimated coefficients: ```{r, echo = TRUE} coef(linear_fit) ``` * $b_0$ is labeled `(Intercept)` * $b_1$ is labeled with the explanatory variable name ## Getting Predicted Values in R * Use `predict()` to get the predicted responses for all observations in the data set ```{r, echo = TRUE} predict(linear_fit) ``` ## Getting Residuals in R * Use `residuals()` to get the residuals for all observations in the data set ```{r, echo = TRUE} residuals(linear_fit) ``` ## Saving the Predictions and Residuals * For making plots, it's helpful to add the predictions and residuals to the original data frame: ```{r, echo = TRUE} mortality_water <- mutate(mortality_water, predicted = predict(linear_fit), residual = residuals(linear_fit)) head(mortality_water) ``` ## A Density Plot of Residuals ```{r, echo = TRUE, fig.height=4} ggplot() + geom_density(mapping = aes(x = residual), data = mortality_water) ``` ## Adding the line to a scatterplot ```{r, echo=TRUE, message=FALSE, fig.height = 3, fig.width=8} ggplot() + geom_point(mapping = aes(x = Calcium, y = Mortality), data = mortality_water) + geom_smooth(mapping = aes(x = Calcium, y = Mortality), data = mortality_water, method = "lm", se = FALSE) ``` ## Summary of R Functions: | Function | What does it do? | |----------------------|--------------| | `linear_fit <- lm(response ~ explanatory,`
` data = data_frame)` | Fit a linear model and save the fit in an object named `linear_fit` | | `coef(linear_fit)` | Get the estimated coefficients $b_0$ and $b_1$ | | `predict(linear_fit)` | Get the predicted response for each observation | | `residuals(linear_fit)` | Get the residuals for each observation | | `geom_smooth(mapping = aes(...), data = ..., method = "lm", se = FALSE)` | Add a line to a scatter plot | * You probably want to save the `predict`ions and `residuals` in the original data frame by using `mutate`.