--- title: "Linear Regression -- Part 3" author: "Evan L. Ray (adapted from Brianna Heggeseth)" date: "October 2, 2017" output: ioslides_presentation --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) require(ggplot2) require(dplyr) require(tidyr) require(readr) ``` ## 93 Car Models Sold in 1993 * Evaluate assumptions: * Quantitative * Linear Relationship * Outliers * Equal Spread ```{r, message = FALSE, echo = FALSE, fig.height=3, fig.width=8} library(MASS) ggplot() + geom_point(mapping = aes(x = Weight, y = MPG.city), data = Cars93) ``` ## Residuals Plots ```{r, echo = TRUE, eval = FALSE, message = FALSE, fig.height=3, fig.width = 4} ggplot() + geom_point(mapping = aes(x = prediction, y = residual), data = Cars93) + geom_smooth(mapping = aes(x = prediction, y = residual), data = Cars93, se = FALSE) ggplot() + geom_point(mapping = aes(x = Weight, y = residual), data = Cars93) + geom_smooth(mapping = aes(x = Weight, y = residual), data = Cars93, se = FALSE) ```
```{r, echo = FALSE, eval = TRUE, message = FALSE, fig.height=2.8, fig.width = 4} linear_fit <- lm(MPG.city ~ Weight, data = Cars93) Cars93 <- mutate(Cars93, residual = residuals(linear_fit), prediction = predict(linear_fit)) ggplot() + geom_point(mapping = aes(x = prediction, y = residual), data = Cars93) + geom_smooth(mapping = aes(x = prediction, y = residual), data = Cars93, se = FALSE) ``` ```{r, echo = FALSE, eval = TRUE, message = FALSE, fig.height=2.8, fig.width = 4} ggplot() + geom_point(mapping = aes(x = Weight, y = residual), data = Cars93) + geom_smooth(mapping = aes(x = Weight, y = residual), data = Cars93, se = FALSE) ```
## What to do if assumptions aren't met? * If at all possible, **don't throw out outliers** (unless they are due to an irresolvable error in data collection) * Outliers represent an important part of reality * If we must discard outliers, do analysis with and without outliers and discuss both
```{r, echo = FALSE, eval = TRUE, message = FALSE, fig.height=3.5, fig.width = 4} ggplot() + geom_point(mapping = aes(x = Weight, y = residual), data = Cars93) + geom_smooth(mapping = aes(x = Weight, y = residual), data = Cars93, se = FALSE) ``` ```{r, echo = TRUE} Cars93$Make[Cars93$residual > 10] ```
## What to do? * So what do we do when either * the scatterplot isn't linear (the residual plot has a pattern) or * there is unequal variability (there is thickening) * residuals are not normally distributed * Basically, 2 options: 1. Fit a more flexible model (e.g., fit a parabola) 2. Transform the data ## Option 1: A more flexible model * Let's fit a parabola (polynomial with degree 2) ```{r, echo = TRUE, eval = TRUE, message = FALSE, fig.height=2.3, fig.width = 6} quadratic_fit <- lm(MPG.city ~ poly(Weight, degree = 2), data = Cars93) ggplot() + geom_point(aes(x = Weight, y = MPG.city), data = Cars93) + geom_smooth(aes(x = Weight, y = MPG.city), data = Cars93, method ="lm", formula = y ~ poly(x, degree = 2), se = FALSE) ``` ## Residuals from Quadratic Fit ```{r, echo = TRUE, eval = TRUE, message = FALSE, fig.height = 3, fig.width = 4} Cars93 <- mutate(Cars93, residual_quad = residuals(quadratic_fit), prediction_quad = predict(quadratic_fit)) ggplot() + geom_point(mapping = aes(x = Weight, y = residual_quad), data = Cars93) + geom_smooth(mapping = aes(x = Weight, y = residual_quad), data = Cars93, se = FALSE) ``` ```{r, echo = FALSE, eval = TRUE, message = FALSE, fig.height=2.8, fig.width = 4} ggplot() + geom_point(mapping = aes(x = Weight, y = residual), data = Cars93) + geom_smooth(mapping = aes(x = Weight, y = residual), data = Cars93, se = FALSE) ``` ## Option 2: Transformation * We might be able to **transform** $y$ and/or $x$. Start with $y$. * Imagine a "ladder of powers" of $y$ (or $x$): We start at $y$ and go up or down the ladder. $$ \vdots\\ y^2\\ y\\ \sqrt{y}\\ y^{"0"} \text{ (we use $\log(y)$ here)} \\ -1/\sqrt{y} \text{ (the $-$ keeps direction of association between $x$ and response)}\\ -1/y\\ -1/y^2\\ \vdots $$ ##Tukey's Circle of Transformations ```{r, echo = TRUE, eval = FALSE, message = FALSE, fig.height=2.3, fig.width = 6} ggplot() + geom_point(aes(x = Weight, y = MPG.city), data = Cars93) + geom_smooth(aes(x = Weight, y = MPG.city), data = Cars93, se = FALSE) ```
```{r, echo = FALSE, eval = TRUE, message = FALSE, fig.height=4, fig.width = 3.5} ggplot() + geom_point(aes(x = Weight, y = MPG.city), data = Cars93) + geom_smooth(aes(x = Weight, y = MPG.city), data = Cars93, se = FALSE) ```
## Down the Ladder: $y = \sqrt{MPG.City}$ ```{r, echo = TRUE} sqrt_fit <- lm(sqrt(MPG.city) ~ Weight, data = Cars93) ```
```{r, echo = FALSE, eval = TRUE, message = FALSE, fig.height=4, fig.width = 3.5} ggplot() + geom_point(aes(x = Weight, y = sqrt(MPG.city)), data = Cars93) + geom_smooth(aes(x = Weight, y = sqrt(MPG.city)), data = Cars93, se = FALSE) ``` ```{r, echo = FALSE, eval = TRUE, message = FALSE, fig.height=4, fig.width = 3.5} Cars93 <- mutate(Cars93, sqrt_residual = residuals(sqrt_fit)) ggplot() + geom_point(aes(x = Weight, y = sqrt_residual), data = Cars93) + geom_smooth(aes(x = Weight, y = sqrt_residual), data = Cars93, se = FALSE) ```
It's better, but we should go further. ## Down the Ladder: $y = log{(MPG.City)}$ ```{r, echo = TRUE} log_fit <- lm(log(MPG.city) ~ Weight, data = Cars93) ```
```{r, echo = FALSE, eval = TRUE, message = FALSE, fig.height=4, fig.width = 3.5} ggplot() + geom_point(aes(x = Weight, y = log(MPG.city)), data = Cars93) + geom_smooth(aes(x = Weight, y = log(MPG.city)), data = Cars93, se = FALSE) ``` ```{r, echo = FALSE, eval = TRUE, message = FALSE, fig.height=4, fig.width = 3.5} Cars93 <- mutate(Cars93, log_residual = residuals(log_fit)) ggplot() + geom_point(aes(x = Weight, y = log_residual), data = Cars93) + geom_smooth(aes(x = Weight, y = log_residual), data = Cars93, se = FALSE) ```
Always go further so that you actually have to come back. ## Down the Ladder: $y = \frac{-1}{MPG.City}$ ```{r, echo = TRUE} inverse_fit <- lm(I(-1/(MPG.city)) ~ Weight, data = Cars93) ```
```{r, echo = FALSE, eval = TRUE, message = FALSE, fig.height=4, fig.width = 3.5} ggplot() + geom_point(aes(x = Weight, y = -1/(MPG.city)), data = Cars93) + geom_smooth(aes(x = Weight, y = -1/(MPG.city)), data = Cars93, se = FALSE) ``` ```{r, echo = FALSE, eval = TRUE, message = FALSE, fig.height=4, fig.width = 3.5} Cars93 <- mutate(Cars93, inverse_residual = residuals(inverse_fit)) ggplot() + geom_point(aes(x = Weight, y = inverse_residual), data = Cars93) + geom_smooth(aes(x = Weight, y = inverse_residual), data = Cars93, se = FALSE) ```
This looks pretty good -- let's try one more. ## Down the Ladder: $y = \frac{-1}{MPG.City^2}$ ```{r, echo = TRUE} inverse_sq_fit <- lm(I(-1/(MPG.city^2)) ~ Weight, data = Cars93) ```
```{r, echo = FALSE, eval = TRUE, message = FALSE, fig.height=4, fig.width = 3.5} ggplot() + geom_point(aes(x = Weight, y = -1/(MPG.city^2)), data = Cars93) + geom_smooth(aes(x = Weight, y = -1/(MPG.city^2)), data = Cars93, se = FALSE) ``` ```{r, echo = FALSE, eval = TRUE, message = FALSE, fig.height=4, fig.width = 3.5} Cars93 <- mutate(Cars93, inverse_sq_residual = residuals(inverse_sq_fit)) ggplot() + geom_point(aes(x = Weight, y = inverse_sq_residual), data = Cars93) + geom_smooth(aes(x = Weight, y = inverse_sq_residual), data = Cars93, se = FALSE) ```
Looks like we went too far. Final answer: $y = \frac{-1}{MPG.City}$ ## Transforming Back Once we transform a response, how do we use the model? ```{r, echo = TRUE} Hummer <- data.frame(Weight = 6280) predict(linear_fit, new = Hummer) # Not a good prediction ... predict(inverse_fit, new = Hummer) # ok, but what does it mean? -1/predict(inverse_fit, new = Hummer) #Aha! ``` ## Things to remember * Transformations can help straighten curves and equalize spread * Don't forget to look at residual plots after tranforming * And don't forget to transform the predictions back! * No one wants the prediction of $-1/\sqrt{revenue}$ * There are also automatic procedures for selecting the transformation -- see more advanced classes