Linear Regression

October 2, 2017

93 Car Models Sold in 1993

Evaluate assumptions:
- Quantitative
- Linear Relationship
- Outliers
- Equal Spread

Residuals Plots

ggplot() +
  geom_point(mapping = aes(x = prediction, y = residual), data = Cars93) +
  geom_smooth(mapping = aes(x = prediction, y = residual), data = Cars93,
    se = FALSE)
ggplot() +
  geom_point(mapping = aes(x = Weight, y = residual), data = Cars93) +
  geom_smooth(mapping = aes(x = Weight, y = residual), data = Cars93,
    se = FALSE)

What to do if assumptions aren't met?

If at all possible, don't throw out outliers (unless they are due to an irresolvable error in data collection)
- Outliers represent an important part of reality
- If we must discard outliers, do analysis with and without outliers and discuss both

Cars93$Make[Cars93$residual > 10]

## [1] Geo Metro   Honda Civic
## 93 Levels: Acura Integra Acura Legend Audi 100 Audi 90 ... Volvo 850

What to do?

So what do we do when either
- the scatterplot isn't linear (the residual plot has a pattern) or
- there is unequal variability (there is thickening)
- residuals are not normally distributed
Basically, 2 options:
1. Fit a more flexible model (e.g., fit a parabola)
2. Transform the data

Option 1: A more flexible model

Let's fit a parabola (polynomial with degree 2)

quadratic_fit <- lm(MPG.city ~ poly(Weight, degree = 2), data = Cars93)

ggplot() +
  geom_point(aes(x = Weight, y = MPG.city), data = Cars93) +
  geom_smooth(aes(x = Weight, y = MPG.city), data = Cars93,
    method ="lm",
    formula = y ~ poly(x, degree = 2),
    se = FALSE)

Residuals from Quadratic Fit

Cars93 <- mutate(Cars93,
  residual_quad = residuals(quadratic_fit),
  prediction_quad = predict(quadratic_fit))

ggplot() +
  geom_point(mapping = aes(x = Weight, y = residual_quad), data = Cars93) +
  geom_smooth(mapping = aes(x = Weight, y = residual_quad), data = Cars93,
    se = FALSE)

Option 2: Transformation

We might be able to transform $y$ and/or $x$. Start with $y$.
Imagine a "ladder of powers" of $y$ (or $x$): We start at $y$ and go up or down the ladder.

\[ \vdots\\ y^2\\ y\\ \sqrt{y}\\ y^{"0"} \text{ (we use $\log(y)$ here)} \\ -1/\sqrt{y} \text{ (the $-$ keeps direction of association between $x$ and response)}\\ -1/y\\ -1/y^2\\ \vdots \]

Tukey's Circle of Transformations

ggplot() +
  geom_point(aes(x = Weight, y = MPG.city), data = Cars93) +
  geom_smooth(aes(x = Weight, y = MPG.city), data = Cars93,
    se = FALSE)

Down the Ladder: $y = \sqrt{MPG.City}$

sqrt_fit <- lm(sqrt(MPG.city) ~ Weight, data = Cars93)

It's better, but we should go further.

Down the Ladder: $y = log{(MPG.City)}$

log_fit <- lm(log(MPG.city) ~ Weight, data = Cars93)

Always go further so that you actually have to come back.

Down the Ladder: $y = \frac{-1}{MPG.City}$

inverse_fit <- lm(I(-1/(MPG.city)) ~ Weight, data = Cars93)

This looks pretty good – let's try one more.

Down the Ladder: $y = \frac{-1}{MPG.City^2}$

inverse_sq_fit <- lm(I(-1/(MPG.city^2)) ~ Weight, data = Cars93)

Looks like we went too far. Final answer: $y = \frac{-1}{MPG.City}$

Transforming Back

Once we transform a response, how do we use the model?

Hummer <- data.frame(Weight = 6280)
predict(linear_fit, new = Hummer) # Not a good prediction ...

##         1 
## -3.395065

predict(inverse_fit, new = Hummer) # ok, but what does it mean?

##           1 
## -0.09401586

-1/predict(inverse_fit, new = Hummer)  #Aha!

##       1 
## 10.6365

Things to remember

Transformations can help straighten curves and equalize spread
Don't forget to look at residual plots after tranforming
And don't forget to transform the predictions back!
No one wants the prediction of $-1/\sqrt{revenue}$
There are also automatic procedures for selecting the transformation – see more advanced classes