Linear Regression

September 29, 2017

Wrap Up for Wednesday's Lab

linear_fit <- lm(ave_acres_burned ~ years_since_1985,
  data = wildfires)
coef(linear_fit)

##      (Intercept) years_since_1985 
##        19.616453         2.221771

Predicted average number of acres burned in 1985 is 19.6

linear_fit_year <- lm(ave_acres_burned ~ year,
  data = wildfires)
coef(linear_fit_year)

##  (Intercept)         year 
## -4390.598311     2.221771

Predicted average number of acres burned in year 0 is -4390

What's going on?

Never Extrapolate Beyond the Data

An Important Message

Back to the Mortality Example

Explanatory: Concentration of calcium in drinking water
Response: Annual mortality rate per 100,000 population
Equation: Predicted Mortality = 1676.36 - 3.23 \(*\) Calcium

Residuals

Residual = Observed - Predicted
\(\definecolor{residual}{RGB}{230,159,0}\color{residual}e_i\) = \(\definecolor{observed}{RGB}{0,158,115}\color{observed}y_i\) - \(\definecolor{predicted}{RGB}{86,180,233}\color{predicted}\widehat{y}_i\)

How much information does the model give us about Mortality rate?
How big do the residuals tend to be?

Residual Distribution

Residual Standard Deviation

How big do the residuals tend to be?
One way to measure this: The residual standard deviation

\[s_e = \sqrt{\frac{\sum_{i = 1}^n e_i^2}{n - 2}} = \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n - 2}}\]

In our example, \(s_e = 143\)
For about 95% of our sample, the observed value was within \(\pm 2 s_e\) of the model's prediction (\(\pm 286\) people per 100,000 population)

What if we didn't know about calcium?

A reasonable guess for mortality in this town: the average mortality rate in our sample, \(\bar{y} = 1524\)

\(s_y = 188\): For about 95% of our sample, the observed value was within \(\pm 2 s_y\) of our prediction of 1524 (\(\pm 376\) people per 100,000 population)

Comparing the two:

If we know about Calcium: Guess \(\hat{y}_i = 1676.36 - 3.23 x_i\)
- For about 95% of sample, \(y_i\) is within \(\pm 286\) of \(\hat{y}_i\)

If we don't know about Calcium: Guess \(\bar{y} = 1524\)
- For about 95% of sample, \(y_i\) is within \(\pm 376\) of \(\bar{y}\)

How much does knowing Calcium help?

Without knowing calcium:
- Guess \(\bar{y} = 1524\) people per 100,000 population
- About 95% of our sample fell within \(\pm 2 s_y\) (\(\pm 376\)) of this guess
If we know Calcium:
- Guess \(\hat{y} = 1676.36 - 3.23 * Calcium\)
- About 95% of our sample fell within \(\pm 2 s_e\) (\(\pm 286\)) of this guess
Knowing calcium narrowed down range of values of Mortality a fair amount!

How much does knowing Calcium help?

Let's compare the spreads of these distributions of deviations from the predictions with and without knowing calcium.
Let's compute \(\frac{\text{spread of residuals if we DO know calcium}}{\text{spread of deviations if we DON'T know about calcium}}\)

If knowing Calcium helps a lot, this will be close to 0
If knowing Calcium doesn't help much, this will be close to 1

A Brief Aside: variance = (std. dev.)\(^2\)

Sample variance of Mortality: \[s_y^2 = \left[\sqrt{\frac{\sum_{i = 1}^n (y_i - \bar{y})^2}{n - 1}}\right]^2 = \frac{\sum_{i = 1}^n (y_i - \bar{y})^2}{n - 1} = \frac{SST}{n - 1}\]
\(SST = \sum_{i = 1}^n (y_i - \bar{y})^2\). \(SST\) stands for Sum of Squares Total (used in measuring the total variability in the response)
Residual Variance: \[s^2_e = \left[\sqrt{\frac{\sum_{i = 1}^n e_i^2}{n - 2}}\right]^2 = \frac{\sum_{i = 1}^n e_i^2}{n - 2} = \frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n - 2} = \frac{SSE}{n - 2}\]
\(SSE = \sum_{i = 1}^n (y_i - \hat{y}_i)^2\). \(SSE\) stands for Sum of Squared Errors (used in measuring the amount of variability in the errors, or residuals, from the model fit)

"Multiple" \(R^2\)

Let's summarize spread of residuals by \(SSE = \sum_{i = 1}^n (y_i - \hat{y}_i)^2\)
… and summarize spread of Mortality by \(SST = \sum_{i = 1}^n (y_i - \bar{y})^2\)
So \(\frac{\text{spread of residuals if we DO know calcium}}{\text{spread of deviations if we DON'T know about calcium}} = \frac{SSE}{SST}\)
Two extreme cases:
- If the error goes to zero, we'd have a "perfect fit". \(\frac{SSE}{SST} = 0\)
- If \(SSE = SST\), \(x\) has told us nothing about \(y\). \(\frac{SSE}{SST} = 1\)
Interpret \(SSE / SST\) as the fraction of the total variation in \(y\) that is still in the residuals
\(R^2 = 1 - \frac{SSE}{SST}\) is the fraction of the total variation in \(y\) "accounted for" by the model in \(x\).

"Adjusted" \(R^2\)

Let's summarize spread of residuals by the residual variance \(s_e^2 = \frac{SSE}{n - 2}\)
… and summarize spread of Mortality by its sample variance \(s_y^2 = \frac{SST}{n - 1}\)
So \(\frac{\text{spread of residuals if we DO know calcium}}{\text{spread of deviations if we DON'T know about calcium}} = \frac{s_e^2}{s_y^2}\)
Two extreme cases:
- If the error goes to zero, we'd have a "perfect fit". \(\frac{s_e^2}{s_y^2} = 0\)
- If \(s_e^2 = s_y^2\), \(x\) has told us nothing about \(y\). \(\frac{s_e^2}{s_y^2} = 1\)
\(R^2_{adj} = 1 - \frac{s_e^2}{s_y^2}\) is the fraction of the total variation in \(y\) "accounted for" by the model in \(x\).

Residual Std. Dev. and \(R^2\) in R

summary(linear_fit)

## 
## Call:
## lm(formula = Mortality ~ Calcium, data = mortality_water)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -348.61 -114.52   -7.09  111.52  336.45 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1676.3556    29.2981  57.217  < 2e-16 ***
## Calcium       -3.2261     0.4847  -6.656 1.03e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 143 on 59 degrees of freedom
## Multiple R-squared:  0.4288, Adjusted R-squared:  0.4191 
## F-statistic:  44.3 on 1 and 59 DF,  p-value: 1.033e-08

An even briefer aside:

Sample variance of Mortality: \[s_y^2 = \frac{SST}{n - 1}\]
Residual Variance: \[s^2_e = \frac{SSE}{n - 2}\]
The \(n - 1\) and \(n - 2\) have to do with the number of parameters used to make our prediction:
- Normal model: 1 parameter (\(\mu\)), use \(n - 1\)
- Linear model: 2 parameters (\(b_0\), \(b_1\)), use \(n - 2\)
We'll talk about why in a month or so