- Evaluate assumptions:
- Quantitative
- Linear Relationship
- Outliers
- Equal Spread
October 2, 2017
ggplot() + geom_point(mapping = aes(x = prediction, y = residual), data = Cars93) + geom_smooth(mapping = aes(x = prediction, y = residual), data = Cars93, se = FALSE) ggplot() + geom_point(mapping = aes(x = Weight, y = residual), data = Cars93) + geom_smooth(mapping = aes(x = Weight, y = residual), data = Cars93, se = FALSE)
Cars93$Make[Cars93$residual > 10]
## [1] Geo Metro Honda Civic ## 93 Levels: Acura Integra Acura Legend Audi 100 Audi 90 ... Volvo 850
quadratic_fit <- lm(MPG.city ~ poly(Weight, degree = 2), data = Cars93) ggplot() + geom_point(aes(x = Weight, y = MPG.city), data = Cars93) + geom_smooth(aes(x = Weight, y = MPG.city), data = Cars93, method ="lm", formula = y ~ poly(x, degree = 2), se = FALSE)
Cars93 <- mutate(Cars93, residual_quad = residuals(quadratic_fit), prediction_quad = predict(quadratic_fit)) ggplot() + geom_point(mapping = aes(x = Weight, y = residual_quad), data = Cars93) + geom_smooth(mapping = aes(x = Weight, y = residual_quad), data = Cars93, se = FALSE)
We might be able to transform \(y\) and/or \(x\). Start with \(y\).
Imagine a "ladder of powers" of \(y\) (or \(x\)): We start at \(y\) and go up or down the ladder.
\[ \vdots\\ y^2\\ y\\ \sqrt{y}\\ y^{"0"} \text{ (we use $\log(y)$ here)} \\ -1/\sqrt{y} \text{ (the $-$ keeps direction of association between $x$ and response)}\\ -1/y\\ -1/y^2\\ \vdots \]
ggplot() + geom_point(aes(x = Weight, y = MPG.city), data = Cars93) + geom_smooth(aes(x = Weight, y = MPG.city), data = Cars93, se = FALSE)
sqrt_fit <- lm(sqrt(MPG.city) ~ Weight, data = Cars93)
It's better, but we should go further.
log_fit <- lm(log(MPG.city) ~ Weight, data = Cars93)
Always go further so that you actually have to come back.
inverse_fit <- lm(I(-1/(MPG.city)) ~ Weight, data = Cars93)
This looks pretty good – let's try one more.
inverse_sq_fit <- lm(I(-1/(MPG.city^2)) ~ Weight, data = Cars93)
Looks like we went too far. Final answer: \(y = \frac{-1}{MPG.City}\)
Once we transform a response, how do we use the model?
Hummer <- data.frame(Weight = 6280) predict(linear_fit, new = Hummer) # Not a good prediction ...
## 1 ## -3.395065
predict(inverse_fit, new = Hummer) # ok, but what does it mean?
## 1 ## -0.09401586
-1/predict(inverse_fit, new = Hummer) #Aha!
## 1 ## 10.6365
Transformations can help straighten curves and equalize spread
Don't forget to look at residual plots after tranforming
And don't forget to transform the predictions back!
No one wants the prediction of \(-1/\sqrt{revenue}\)
There are also automatic procedures for selecting the transformation – see more advanced classes