October 2, 2017

93 Car Models Sold in 1993

  • Evaluate assumptions:
    • Quantitative
    • Linear Relationship
    • Outliers
    • Equal Spread

Residuals Plots

ggplot() +
  geom_point(mapping = aes(x = prediction, y = residual), data = Cars93) +
  geom_smooth(mapping = aes(x = prediction, y = residual), data = Cars93,
    se = FALSE)
ggplot() +
  geom_point(mapping = aes(x = Weight, y = residual), data = Cars93) +
  geom_smooth(mapping = aes(x = Weight, y = residual), data = Cars93,
    se = FALSE)

What to do if assumptions aren't met?

  • If at all possible, don't throw out outliers (unless they are due to an irresolvable error in data collection)
    • Outliers represent an important part of reality
    • If we must discard outliers, do analysis with and without outliers and discuss both

Cars93$Make[Cars93$residual > 10]
## [1] Geo Metro   Honda Civic
## 93 Levels: Acura Integra Acura Legend Audi 100 Audi 90 ... Volvo 850

What to do?

  • So what do we do when either
    • the scatterplot isn't linear (the residual plot has a pattern) or
    • there is unequal variability (there is thickening)
    • residuals are not normally distributed
  • Basically, 2 options:
    1. Fit a more flexible model (e.g., fit a parabola)
    2. Transform the data

Option 1: A more flexible model

  • Let's fit a parabola (polynomial with degree 2)
quadratic_fit <- lm(MPG.city ~ poly(Weight, degree = 2), data = Cars93)

ggplot() +
  geom_point(aes(x = Weight, y = MPG.city), data = Cars93) +
  geom_smooth(aes(x = Weight, y = MPG.city), data = Cars93,
    method ="lm",
    formula = y ~ poly(x, degree = 2),
    se = FALSE)

Residuals from Quadratic Fit

Cars93 <- mutate(Cars93,
  residual_quad = residuals(quadratic_fit),
  prediction_quad = predict(quadratic_fit))

ggplot() +
  geom_point(mapping = aes(x = Weight, y = residual_quad), data = Cars93) +
  geom_smooth(mapping = aes(x = Weight, y = residual_quad), data = Cars93,
    se = FALSE)

Option 2: Transformation

  • We might be able to transform \(y\) and/or \(x\). Start with \(y\).

  • Imagine a "ladder of powers" of \(y\) (or \(x\)): We start at \(y\) and go up or down the ladder.

\[ \vdots\\ y^2\\ y\\ \sqrt{y}\\ y^{"0"} \text{ (we use $\log(y)$ here)} \\ -1/\sqrt{y} \text{ (the $-$ keeps direction of association between $x$ and response)}\\ -1/y\\ -1/y^2\\ \vdots \]

Tukey's Circle of Transformations

ggplot() +
  geom_point(aes(x = Weight, y = MPG.city), data = Cars93) +
  geom_smooth(aes(x = Weight, y = MPG.city), data = Cars93,
    se = FALSE)

Down the Ladder: \(y = \sqrt{MPG.City}\)

sqrt_fit <- lm(sqrt(MPG.city) ~ Weight, data = Cars93)

It's better, but we should go further.

Down the Ladder: \(y = log{(MPG.City)}\)

log_fit <- lm(log(MPG.city) ~ Weight, data = Cars93)

Always go further so that you actually have to come back.

Down the Ladder: \(y = \frac{-1}{MPG.City}\)

inverse_fit <- lm(I(-1/(MPG.city)) ~ Weight, data = Cars93)

This looks pretty good – let's try one more.

Down the Ladder: \(y = \frac{-1}{MPG.City^2}\)

inverse_sq_fit <- lm(I(-1/(MPG.city^2)) ~ Weight, data = Cars93)

Looks like we went too far. Final answer: \(y = \frac{-1}{MPG.City}\)

Transforming Back

Once we transform a response, how do we use the model?

Hummer <- data.frame(Weight = 6280)
predict(linear_fit, new = Hummer) # Not a good prediction ...
##         1 
## -3.395065
predict(inverse_fit, new = Hummer) # ok, but what does it mean? 
##           1 
## -0.09401586
-1/predict(inverse_fit, new = Hummer)  #Aha!
##       1 
## 10.6365

Things to remember

  • Transformations can help straighten curves and equalize spread

  • Don't forget to look at residual plots after tranforming

  • And don't forget to transform the predictions back!

  • No one wants the prediction of \(-1/\sqrt{revenue}\)

  • There are also automatic procedures for selecting the transformation – see more advanced classes