September 29, 2017

Wrap Up for Wednesday's Lab

linear_fit <- lm(ave_acres_burned ~ years_since_1985,
  data = wildfires)
coef(linear_fit)
##      (Intercept) years_since_1985 
##        19.616453         2.221771
  • Predicted average number of acres burned in 1985 is 19.6
linear_fit_year <- lm(ave_acres_burned ~ year,
  data = wildfires)
coef(linear_fit_year)
##  (Intercept)         year 
## -4390.598311     2.221771
  • Predicted average number of acres burned in year 0 is -4390

What's going on?

Never Extrapolate Beyond the Data

Back to the Mortality Example

  • Explanatory: Concentration of calcium in drinking water
  • Response: Annual mortality rate per 100,000 population
  • Equation: Predicted Mortality = 1676.36 - 3.23 \(*\) Calcium

Residuals

  • Residual = Observed - Predicted

  • \(\definecolor{residual}{RGB}{230,159,0}\color{residual}e_i\) = \(\definecolor{observed}{RGB}{0,158,115}\color{observed}y_i\) - \(\definecolor{predicted}{RGB}{86,180,233}\color{predicted}\widehat{y}_i\)

  • How much information does the model give us about Mortality rate?
  • How big do the residuals tend to be?

Residual Distribution

Residual Standard Deviation

  • How big do the residuals tend to be?
  • One way to measure this: The residual standard deviation

\[s_e = \sqrt{\frac{\sum_{i = 1}^n e_i^2}{n - 2}} = \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n - 2}}\]

  • In our example, \(s_e = 143\)
  • For about 95% of our sample, the observed value was within \(\pm 2 s_e\) of the model's prediction (\(\pm 286\) people per 100,000 population)

What if we didn't know about calcium?

  • A reasonable guess for mortality in this town: the average mortality rate in our sample, \(\bar{y} = 1524\)

  • \(s_y = 188\): For about 95% of our sample, the observed value was within \(\pm 2 s_y\) of our prediction of 1524 (\(\pm 376\) people per 100,000 population)

Comparing the two:

  • If we know about Calcium: Guess \(\hat{y}_i = 1676.36 - 3.23 x_i\)
    • For about 95% of sample, \(y_i\) is within \(\pm 286\) of \(\hat{y}_i\)

  • If we don't know about Calcium: Guess \(\bar{y} = 1524\)
    • For about 95% of sample, \(y_i\) is within \(\pm 376\) of \(\bar{y}\)

How much does knowing Calcium help?

  • Without knowing calcium:
    • Guess \(\bar{y} = 1524\) people per 100,000 population
    • About 95% of our sample fell within \(\pm 2 s_y\) (\(\pm 376\)) of this guess
  • If we know Calcium:
    • Guess \(\hat{y} = 1676.36 - 3.23 * Calcium\)
    • About 95% of our sample fell within \(\pm 2 s_e\) (\(\pm 286\)) of this guess
  • Knowing calcium narrowed down range of values of Mortality a fair amount!

How much does knowing Calcium help?

  • Let's compare the spreads of these distributions of deviations from the predictions with and without knowing calcium.

  • Let's compute \(\frac{\text{spread of residuals if we DO know calcium}}{\text{spread of deviations if we DON'T know about calcium}}\)

  • If knowing Calcium helps a lot, this will be close to 0
  • If knowing Calcium doesn't help much, this will be close to 1

A Brief Aside: variance = (std. dev.)\(^2\)

  • Sample variance of Mortality: \[s_y^2 = \left[\sqrt{\frac{\sum_{i = 1}^n (y_i - \bar{y})^2}{n - 1}}\right]^2 = \frac{\sum_{i = 1}^n (y_i - \bar{y})^2}{n - 1} = \frac{SST}{n - 1}\]
  • \(SST = \sum_{i = 1}^n (y_i - \bar{y})^2\). \(SST\) stands for Sum of Squares Total (used in measuring the total variability in the response)
  • Residual Variance: \[s^2_e = \left[\sqrt{\frac{\sum_{i = 1}^n e_i^2}{n - 2}}\right]^2 = \frac{\sum_{i = 1}^n e_i^2}{n - 2} = \frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2}{n - 2} = \frac{SSE}{n - 2}\]
  • \(SSE = \sum_{i = 1}^n (y_i - \hat{y}_i)^2\). \(SSE\) stands for Sum of Squared Errors (used in measuring the amount of variability in the errors, or residuals, from the model fit)

"Multiple" \(R^2\)

  • Let's summarize spread of residuals by \(SSE = \sum_{i = 1}^n (y_i - \hat{y}_i)^2\)
  • … and summarize spread of Mortality by \(SST = \sum_{i = 1}^n (y_i - \bar{y})^2\)
  • So \(\frac{\text{spread of residuals if we DO know calcium}}{\text{spread of deviations if we DON'T know about calcium}} = \frac{SSE}{SST}\)
  • Two extreme cases:
    • If the error goes to zero, we'd have a "perfect fit". \(\frac{SSE}{SST} = 0\)
    • If \(SSE = SST\), \(x\) has told us nothing about \(y\). \(\frac{SSE}{SST} = 1\)
  • Interpret \(SSE / SST\) as the fraction of the total variation in \(y\) that is still in the residuals
  • \(R^2 = 1 - \frac{SSE}{SST}\) is the fraction of the total variation in \(y\) "accounted for" by the model in \(x\).

"Adjusted" \(R^2\)

  • Let's summarize spread of residuals by the residual variance \(s_e^2 = \frac{SSE}{n - 2}\)
  • … and summarize spread of Mortality by its sample variance \(s_y^2 = \frac{SST}{n - 1}\)
  • So \(\frac{\text{spread of residuals if we DO know calcium}}{\text{spread of deviations if we DON'T know about calcium}} = \frac{s_e^2}{s_y^2}\)
  • Two extreme cases:
    • If the error goes to zero, we'd have a "perfect fit". \(\frac{s_e^2}{s_y^2} = 0\)
    • If \(s_e^2 = s_y^2\), \(x\) has told us nothing about \(y\). \(\frac{s_e^2}{s_y^2} = 1\)
  • \(R^2_{adj} = 1 - \frac{s_e^2}{s_y^2}\) is the fraction of the total variation in \(y\) "accounted for" by the model in \(x\).

Residual Std. Dev. and \(R^2\) in R

summary(linear_fit)
## 
## Call:
## lm(formula = Mortality ~ Calcium, data = mortality_water)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -348.61 -114.52   -7.09  111.52  336.45 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1676.3556    29.2981  57.217  < 2e-16 ***
## Calcium       -3.2261     0.4847  -6.656 1.03e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 143 on 59 degrees of freedom
## Multiple R-squared:  0.4288, Adjusted R-squared:  0.4191 
## F-statistic:  44.3 on 1 and 59 DF,  p-value: 1.033e-08

An even briefer aside:

  • Sample variance of Mortality: \[s_y^2 = \frac{SST}{n - 1}\]
  • Residual Variance: \[s^2_e = \frac{SSE}{n - 2}\]
  • The \(n - 1\) and \(n - 2\) have to do with the number of parameters used to make our prediction:
    • Normal model: 1 parameter (\(\mu\)), use \(n - 1\)
    • Linear model: 2 parameters (\(b_0\), \(b_1\)), use \(n - 2\)
  • We'll talk about why in a month or so