---
title: "Linear Regression -- Part 3"
author: "Evan L. Ray (adapted from Brianna Heggeseth)"
date: "October 2, 2017"
output: ioslides_presentation
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
require(ggplot2)
require(dplyr)
require(tidyr)
require(readr)
```
## 93 Car Models Sold in 1993
* Evaluate assumptions:
* Quantitative
* Linear Relationship
* Outliers
* Equal Spread
```{r, message = FALSE, echo = FALSE, fig.height=3, fig.width=8}
library(MASS)
ggplot() +
geom_point(mapping = aes(x = Weight, y = MPG.city), data = Cars93)
```
## Residuals Plots
```{r, echo = TRUE, eval = FALSE, message = FALSE, fig.height=3, fig.width = 4}
ggplot() +
geom_point(mapping = aes(x = prediction, y = residual), data = Cars93) +
geom_smooth(mapping = aes(x = prediction, y = residual), data = Cars93,
se = FALSE)
ggplot() +
geom_point(mapping = aes(x = Weight, y = residual), data = Cars93) +
geom_smooth(mapping = aes(x = Weight, y = residual), data = Cars93,
se = FALSE)
```
```{r, echo = FALSE, eval = TRUE, message = FALSE, fig.height=2.8, fig.width = 4}
linear_fit <- lm(MPG.city ~ Weight, data = Cars93)
Cars93 <- mutate(Cars93, residual = residuals(linear_fit), prediction = predict(linear_fit))
ggplot() +
geom_point(mapping = aes(x = prediction, y = residual), data = Cars93) +
geom_smooth(mapping = aes(x = prediction, y = residual), data = Cars93,
se = FALSE)
```
```{r, echo = FALSE, eval = TRUE, message = FALSE, fig.height=2.8, fig.width = 4}
ggplot() +
geom_point(mapping = aes(x = Weight, y = residual), data = Cars93) +
geom_smooth(mapping = aes(x = Weight, y = residual), data = Cars93,
se = FALSE)
```
## What to do if assumptions aren't met?
* If at all possible, **don't throw out outliers** (unless they are due to an irresolvable error in data collection)
* Outliers represent an important part of reality
* If we must discard outliers, do analysis with and without outliers and discuss both
```{r, echo = FALSE, eval = TRUE, message = FALSE, fig.height=3.5, fig.width = 4}
ggplot() +
geom_point(mapping = aes(x = Weight, y = residual), data = Cars93) +
geom_smooth(mapping = aes(x = Weight, y = residual), data = Cars93,
se = FALSE)
```
```{r, echo = TRUE}
Cars93$Make[Cars93$residual > 10]
```
## What to do?
* So what do we do when either
* the scatterplot isn't linear (the residual plot has a pattern) or
* there is unequal variability (there is thickening)
* residuals are not normally distributed
* Basically, 2 options:
1. Fit a more flexible model (e.g., fit a parabola)
2. Transform the data
## Option 1: A more flexible model
* Let's fit a parabola (polynomial with degree 2)
```{r, echo = TRUE, eval = TRUE, message = FALSE, fig.height=2.3, fig.width = 6}
quadratic_fit <- lm(MPG.city ~ poly(Weight, degree = 2), data = Cars93)
ggplot() +
geom_point(aes(x = Weight, y = MPG.city), data = Cars93) +
geom_smooth(aes(x = Weight, y = MPG.city), data = Cars93,
method ="lm",
formula = y ~ poly(x, degree = 2),
se = FALSE)
```
## Residuals from Quadratic Fit
```{r, echo = TRUE, eval = TRUE, message = FALSE, fig.height = 3, fig.width = 4}
Cars93 <- mutate(Cars93,
residual_quad = residuals(quadratic_fit),
prediction_quad = predict(quadratic_fit))
ggplot() +
geom_point(mapping = aes(x = Weight, y = residual_quad), data = Cars93) +
geom_smooth(mapping = aes(x = Weight, y = residual_quad), data = Cars93,
se = FALSE)
```
```{r, echo = FALSE, eval = TRUE, message = FALSE, fig.height=2.8, fig.width = 4}
ggplot() +
geom_point(mapping = aes(x = Weight, y = residual), data = Cars93) +
geom_smooth(mapping = aes(x = Weight, y = residual), data = Cars93,
se = FALSE)
```
## Option 2: Transformation
* We might be able to **transform** $y$ and/or $x$. Start with $y$.
* Imagine a "ladder of powers" of $y$ (or $x$): We start at $y$ and go up or down the ladder.
$$
\vdots\\
y^2\\
y\\
\sqrt{y}\\
y^{"0"} \text{ (we use $\log(y)$ here)} \\
-1/\sqrt{y} \text{ (the $-$ keeps direction of association between $x$ and response)}\\
-1/y\\
-1/y^2\\
\vdots
$$
##Tukey's Circle of Transformations
```{r, echo = TRUE, eval = FALSE, message = FALSE, fig.height=2.3, fig.width = 6}
ggplot() +
geom_point(aes(x = Weight, y = MPG.city), data = Cars93) +
geom_smooth(aes(x = Weight, y = MPG.city), data = Cars93,
se = FALSE)
```
```{r, echo = FALSE, eval = TRUE, message = FALSE, fig.height=4, fig.width = 3.5}
ggplot() +
geom_point(aes(x = Weight, y = MPG.city), data = Cars93) +
geom_smooth(aes(x = Weight, y = MPG.city), data = Cars93,
se = FALSE)
```
## Down the Ladder: $y = \sqrt{MPG.City}$
```{r, echo = TRUE}
sqrt_fit <- lm(sqrt(MPG.city) ~ Weight, data = Cars93)
```
```{r, echo = FALSE, eval = TRUE, message = FALSE, fig.height=4, fig.width = 3.5}
ggplot() +
geom_point(aes(x = Weight, y = sqrt(MPG.city)), data = Cars93) +
geom_smooth(aes(x = Weight, y = sqrt(MPG.city)), data = Cars93,
se = FALSE)
```
```{r, echo = FALSE, eval = TRUE, message = FALSE, fig.height=4, fig.width = 3.5}
Cars93 <- mutate(Cars93,
sqrt_residual = residuals(sqrt_fit))
ggplot() +
geom_point(aes(x = Weight, y = sqrt_residual), data = Cars93) +
geom_smooth(aes(x = Weight, y = sqrt_residual), data = Cars93,
se = FALSE)
```
It's better, but we should go further.
## Down the Ladder: $y = log{(MPG.City)}$
```{r, echo = TRUE}
log_fit <- lm(log(MPG.city) ~ Weight, data = Cars93)
```
```{r, echo = FALSE, eval = TRUE, message = FALSE, fig.height=4, fig.width = 3.5}
ggplot() +
geom_point(aes(x = Weight, y = log(MPG.city)), data = Cars93) +
geom_smooth(aes(x = Weight, y = log(MPG.city)), data = Cars93,
se = FALSE)
```
```{r, echo = FALSE, eval = TRUE, message = FALSE, fig.height=4, fig.width = 3.5}
Cars93 <- mutate(Cars93,
log_residual = residuals(log_fit))
ggplot() +
geom_point(aes(x = Weight, y = log_residual), data = Cars93) +
geom_smooth(aes(x = Weight, y = log_residual), data = Cars93,
se = FALSE)
```
Always go further so that you actually have to come back.
## Down the Ladder: $y = \frac{-1}{MPG.City}$
```{r, echo = TRUE}
inverse_fit <- lm(I(-1/(MPG.city)) ~ Weight, data = Cars93)
```
```{r, echo = FALSE, eval = TRUE, message = FALSE, fig.height=4, fig.width = 3.5}
ggplot() +
geom_point(aes(x = Weight, y = -1/(MPG.city)), data = Cars93) +
geom_smooth(aes(x = Weight, y = -1/(MPG.city)), data = Cars93,
se = FALSE)
```
```{r, echo = FALSE, eval = TRUE, message = FALSE, fig.height=4, fig.width = 3.5}
Cars93 <- mutate(Cars93,
inverse_residual = residuals(inverse_fit))
ggplot() +
geom_point(aes(x = Weight, y = inverse_residual), data = Cars93) +
geom_smooth(aes(x = Weight, y = inverse_residual), data = Cars93,
se = FALSE)
```
This looks pretty good -- let's try one more.
## Down the Ladder: $y = \frac{-1}{MPG.City^2}$
```{r, echo = TRUE}
inverse_sq_fit <- lm(I(-1/(MPG.city^2)) ~ Weight, data = Cars93)
```
```{r, echo = FALSE, eval = TRUE, message = FALSE, fig.height=4, fig.width = 3.5}
ggplot() +
geom_point(aes(x = Weight, y = -1/(MPG.city^2)), data = Cars93) +
geom_smooth(aes(x = Weight, y = -1/(MPG.city^2)), data = Cars93,
se = FALSE)
```
```{r, echo = FALSE, eval = TRUE, message = FALSE, fig.height=4, fig.width = 3.5}
Cars93 <- mutate(Cars93,
inverse_sq_residual = residuals(inverse_sq_fit))
ggplot() +
geom_point(aes(x = Weight, y = inverse_sq_residual), data = Cars93) +
geom_smooth(aes(x = Weight, y = inverse_sq_residual), data = Cars93,
se = FALSE)
```
Looks like we went too far. Final answer: $y = \frac{-1}{MPG.City}$
## Transforming Back
Once we transform a response, how do we use the model?
```{r, echo = TRUE}
Hummer <- data.frame(Weight = 6280)
predict(linear_fit, new = Hummer) # Not a good prediction ...
predict(inverse_fit, new = Hummer) # ok, but what does it mean?
-1/predict(inverse_fit, new = Hummer) #Aha!
```
## Things to remember
* Transformations can help straighten curves and equalize spread
* Don't forget to look at residual plots after tranforming
* And don't forget to transform the predictions back!
* No one wants the prediction of $-1/\sqrt{revenue}$
* There are also automatic procedures for selecting the transformation -- see more advanced classes