September 27, 2017

Organization for These Slides

  1. Linear Models – the Statistical Ideas
  2. Linear Models in R

Linear Models – the Statistical Ideas

Today's Data Set

  • Scientists believe that hard water (water with high concentrations of calcium and magnesium) is beneficial for health (Sengupta, P. (2013). IJPM, 4(8), 866-875.)

  • We have recordings of the mortality rate (deaths per 100,000 population) and concentration of calcium in drinking water (parts per million) in 61 large towns in England and Wales

  • Explanatory (independent) variable goes on the x axis, Response (dependent) variable goes on the y axis

Does Fitting a Line Make Sense?

  • Always look at a scatter plot to verify these conditions:
    1. Linear relationship (straight enough)
    2. No Outliers

Line of Best Fit

  • Equation: Predicted Mortality = 1676.36 - 3.23 \(*\) Calcium
  • General Form: Predicted Response = intercept + slope \(*\) Explanatory
  • General Form (math notation): \(\widehat{y}_i = b_0 + b_1 * x_i\)
  • \(y_i\): response for observation \(i\), \(\widehat{y}_i\): predicted response, \(x_i\) explanatory variable

Interpretation of the Coefficients:

  • Equation: Predicted Mortality = 1676.36 - 3.23 Calcium

  • Interpretation of \(b_0\): If a town's drinking water contained no calcium, the predicted mortality rate would be 1676.36 people per 100,000, annually.

  • Interpretation of \(b_1\): For each unit increase in Calcium, the predicted mortality rate goes down by 3.23 people per 100,000, annually.

Interpretation Continued

  • Interpretation of \(b_1\): For each unit increase in Calcium, the predicted mortality rate goes down by 3.23 people per 100,000, annually.

  • Does this mean a unit increase in Calcium causes a reduction in mortality rate of 3.23 people per 100,000?

Lurking Variables

  • Lurking Variable: A variable that's not part of the model, but affects the way the variables in the model appear to be related.
  • Are there any possible lurking variables in our example?

  • Never conclude that a causal relationship exists by fitting a regression model to observational data
  • Need to run experiments!

Model Predictions Are Not Exactly Right!

  • Equation: Predicted Mortality = 1676.36 - 3.23 Calcium

  • The word Predicted is essential! The model's predictions don't exactly equal the observed values.

Residuals

  • Residual = Observed - Predicted

  • \(\definecolor{residual}{RGB}{230,159,0}\color{residual}e_i\) = \(\definecolor{observed}{RGB}{0,158,115}\color{observed}y_i\) - \(\definecolor{predicted}{RGB}{86,180,233}\color{predicted}\widehat{y}_i\)

Finding the Line of Best Fit

  • The "best" line has the smallest residuals
  • Pick \(b_0, b_1\) to minimize sum of squared errors/residuals: \(SSE = \sum_{i = 1}^n e_i^2\) (see Chapter 26…)

\[b_1 = r \frac{s_y}{s_x} \qquad \qquad b_0 = \bar{y} - b_1 \bar{x}\]

The Distribution of the Residuals

  • Estimating \(b_0\) and \(b_1\) by minimizing \(SSE = \sum_{i = 1}^n e_i^2\) is only justified if:
    • The distribution of residuals is nearly normal (unimodal, symmetric)
    • The standard deviation of the residuals is about the same across all values of the explanatory variable

Checking Equal Residual Std. Dev.

  • "Does the plot thicken?"
  • Mortality Example:

  • Median family income vs. housing cost index in the 50 states

Summary of Conditions to Check

  • Linear Models are only appropriate for quantitative variables (at least, in this Chapter)
  • Four conditions to check:
    1. Linear relationship (straight enough)
      • scatterplot of response_variable vs. explanatory_variable
    2. No Outliers
      • scatterplot of response_variable vs. explanatory_variable
    3. Normally Distributed Residuals
      • density plot of residuals
    4. Equal Spread (does the plot thicken?)
      • scatterplot of response_variable vs. explanatory_variable

Linear Models in R

Fitting Linear Models in R

  • General Template:
linear_fit <- lm(response_variable ~ explanatory_variable,
  data = data_frame)
  • Our Example:
linear_fit <- lm(Mortality ~ Calcium, data = mortality_water)
  • "Fit a linear model for response_variable as a function of explanatory_variable, with the variables taken from data_frame, and save the result in an object called linear_fit"

Getting the Estimated Coefficients in R

  • Use the coef() function to view the estimated coefficients:
coef(linear_fit)
## (Intercept)     Calcium 
## 1676.355601   -3.226092
  • \(b_0\) is labeled (Intercept)
  • \(b_1\) is labeled with the explanatory variable name

Getting Predicted Values in R

  • Use predict() to get the predicted responses for all observations in the data set
predict(linear_fit)
##        1        2        3        4        5        6        7        8 
## 1534.408 1486.016 1247.285 1589.251 1656.999 1331.164 1660.225 1386.007 
##        9       10       11       12       13       14       15       16 
## 1656.999 1350.520 1634.416 1631.190 1518.277 1650.547 1631.190 1456.981 
##       17       18       19       20       21       22       23       24 
## 1515.051 1434.399 1447.303 1634.416 1492.468 1447.303 1611.834 1482.790 
##       25       26       27       28       29       30       31       32 
## 1505.373 1282.772 1415.042 1447.303 1608.608 1631.190 1634.416 1337.616 
##       33       34       35       36       37       38       39       40 
## 1621.512 1660.225 1631.190 1618.286 1644.095 1627.964 1424.720 1644.095 
##       41       42       43       44       45       46       47       48 
## 1405.364 1440.851 1637.643 1424.720 1366.651 1611.834 1550.538 1550.538 
##       49       50       51       52       53       54       55       56 
## 1282.772 1608.608 1534.408 1373.103 1650.547 1647.321 1382.781 1231.155 
##       57       58       59       60       61 
## 1624.738 1556.990 1627.964 1650.547 1592.477

Getting Residuals in R

  • Use residuals() to get the residuals for all observations in the data set
residuals(linear_fit)
##           1           2           3           4           5           6 
##  167.592430 -177.016196   11.714583 -162.251127   67.000949 -156.163799 
##           7           8           9          10          11          12 
## -174.225143   69.992644   39.000949 -114.520348   76.583590 -187.190318 
##          13          14          15          16          17          18 
##   72.722888  336.453132 -136.190318  -87.981371 -258.051020  152.601270 
##          19          20          21          22          23          24 
##  265.696903  -77.416410  147.531621  261.696903   13.166231   44.209896 
##          25          26          27          28          29          30 
##  121.627254  203.227575   69.957819  -69.303097  -89.607677  -50.190318 
##          31          32          33          34          35          36 
##   -9.416410  -90.615982   46.487957 -194.225143  168.809682   -9.285952 
##          37          38          39          40          41          42 
##  -86.094685  179.035773 -125.720455   -7.094685  -46.363906  -48.850913 
##          43          44          45          46          47          48 
##  117.357499 -117.720455 -112.650807 -120.833769    4.461972 -122.538028 
##          49          50          51          52          53          54 
##   35.227575 -348.607677  188.592430    5.897010   91.453132  -73.320776 
##          55          56          57          58          59          60 
##  186.218735 -135.154959  -33.738135 -154.990211  144.035773  177.453132 
##          61 
##  111.522781

Saving the Predictions and Residuals

  • For making plots, it's helpful to add the predictions and residuals to the original data frame:
mortality_water <- mutate(mortality_water,
  predicted = predict(linear_fit),
  residual = residuals(linear_fit))
head(mortality_water)
## # A tibble: 6 x 5
##   Mortality Calcium Derby predicted   residual
##       <dbl>   <int> <chr>     <dbl>      <dbl>
## 1      1702      44 South  1534.408  167.59243
## 2      1309      59 South  1486.016 -177.01620
## 3      1259     133 South  1247.285   11.71458
## 4      1427      27 North  1589.251 -162.25113
## 5      1724       6 North  1656.999   67.00095
## 6      1175     107 South  1331.164 -156.16380

A Density Plot of Residuals

ggplot() +
  geom_density(mapping = aes(x = residual), data = mortality_water)

Adding the line to a scatterplot

ggplot() +
  geom_point(mapping = aes(x = Calcium, y = Mortality),
    data = mortality_water) +
  geom_smooth(mapping = aes(x = Calcium, y = Mortality),
    data = mortality_water,
    method = "lm",
    se = FALSE)

Summary of R Functions:

Function What does it do?
linear_fit <- lm(response ~ explanatory,
data = data_frame)
Fit a linear model and save the fit in an object named linear_fit
coef(linear_fit) Get the estimated coefficients \(b_0\) and \(b_1\)
predict(linear_fit) Get the predicted response for each observation
residuals(linear_fit) Get the residuals for each observation
geom_smooth(mapping = aes(...), data = ..., method = "lm", se = FALSE) Add a line to a scatter plot
  • You probably want to save the predictions and residuals in the original data frame by using mutate.