Linear Regression

September 27, 2017

Organization for These Slides

Linear Models – the Statistical Ideas
Linear Models in R

Linear Models – the Statistical Ideas

Today's Data Set

Scientists believe that hard water (water with high concentrations of calcium and magnesium) is beneficial for health (Sengupta, P. (2013). IJPM, 4(8), 866-875.)
We have recordings of the mortality rate (deaths per 100,000 population) and concentration of calcium in drinking water (parts per million) in 61 large towns in England and Wales

Explanatory (~~independent~~) variable goes on the x axis, Response (~~dependent~~) variable goes on the y axis

Does Fitting a Line Make Sense?

Always look at a scatter plot to verify these conditions:
1. Linear relationship (straight enough)
2. No Outliers

Line of Best Fit

Equation: Predicted Mortality = 1676.36 - 3.23 \(*\) Calcium
General Form: Predicted Response = intercept + slope \(*\) Explanatory
General Form (math notation): \(\widehat{y}_i = b_0 + b_1 * x_i\)
\(y_i\): response for observation \(i\), \(\widehat{y}_i\): predicted response, \(x_i\) explanatory variable

Interpretation of the Coefficients:

Equation: Predicted Mortality = 1676.36 - 3.23 Calcium
Interpretation of \(b_0\): If a town's drinking water contained no calcium, the predicted mortality rate would be 1676.36 people per 100,000, annually.
Interpretation of \(b_1\): For each unit increase in Calcium, the predicted mortality rate goes down by 3.23 people per 100,000, annually.

Interpretation Continued

Interpretation of \(b_1\): For each unit increase in Calcium, the predicted mortality rate goes down by 3.23 people per 100,000, annually.
Does this mean a unit increase in Calcium causes a reduction in mortality rate of 3.23 people per 100,000?

Lurking Variables

Lurking Variable: A variable that's not part of the model, but affects the way the variables in the model appear to be related.
Are there any possible lurking variables in our example?

Never conclude that a causal relationship exists by fitting a regression model to observational data
Need to run experiments!

Model Predictions Are Not Exactly Right!

Equation: Predicted Mortality = 1676.36 - 3.23 Calcium
The word Predicted is essential! The model's predictions don't exactly equal the observed values.

Residuals

Residual = Observed - Predicted
\(\definecolor{residual}{RGB}{230,159,0}\color{residual}e_i\) = \(\definecolor{observed}{RGB}{0,158,115}\color{observed}y_i\) - \(\definecolor{predicted}{RGB}{86,180,233}\color{predicted}\widehat{y}_i\)

Finding the Line of Best Fit

The "best" line has the smallest residuals
Pick \(b_0, b_1\) to minimize sum of squared errors/residuals: \(SSE = \sum_{i = 1}^n e_i^2\) (see Chapter 26…)

\[b_1 = r \frac{s_y}{s_x} \qquad \qquad b_0 = \bar{y} - b_1 \bar{x}\]

The Distribution of the Residuals

Estimating \(b_0\) and \(b_1\) by minimizing \(SSE = \sum_{i = 1}^n e_i^2\) is only justified if:
- The distribution of residuals is nearly normal (unimodal, symmetric)
- The standard deviation of the residuals is about the same across all values of the explanatory variable

Checking Equal Residual Std. Dev.

"Does the plot thicken?"

Mortality Example:
Median family income vs. housing cost index in the 50 states

Summary of Conditions to Check

Linear Models are only appropriate for quantitative variables (at least, in this Chapter)
Four conditions to check:
1. Linear relationship (straight enough)
  - scatterplot of response_variable vs. explanatory_variable
2. No Outliers
  - scatterplot of response_variable vs. explanatory_variable
3. Normally Distributed Residuals
  - density plot of residuals
4. Equal Spread (does the plot thicken?)
  - scatterplot of response_variable vs. explanatory_variable

Linear Models in R

Fitting Linear Models in R

General Template:

linear_fit <- lm(response_variable ~ explanatory_variable,
  data = data_frame)

Our Example:

linear_fit <- lm(Mortality ~ Calcium, data = mortality_water)

"Fit a linear model for response_variable as a function of explanatory_variable, with the variables taken from data_frame, and save the result in an object called linear_fit"

Getting the Estimated Coefficients in R

Use the coef() function to view the estimated coefficients:

coef(linear_fit)

## (Intercept)     Calcium 
## 1676.355601   -3.226092

\(b_0\) is labeled (Intercept)
\(b_1\) is labeled with the explanatory variable name

Getting Predicted Values in R

Use predict() to get the predicted responses for all observations in the data set

predict(linear_fit)

##        1        2        3        4        5        6        7        8 
## 1534.408 1486.016 1247.285 1589.251 1656.999 1331.164 1660.225 1386.007 
##        9       10       11       12       13       14       15       16 
## 1656.999 1350.520 1634.416 1631.190 1518.277 1650.547 1631.190 1456.981 
##       17       18       19       20       21       22       23       24 
## 1515.051 1434.399 1447.303 1634.416 1492.468 1447.303 1611.834 1482.790 
##       25       26       27       28       29       30       31       32 
## 1505.373 1282.772 1415.042 1447.303 1608.608 1631.190 1634.416 1337.616 
##       33       34       35       36       37       38       39       40 
## 1621.512 1660.225 1631.190 1618.286 1644.095 1627.964 1424.720 1644.095 
##       41       42       43       44       45       46       47       48 
## 1405.364 1440.851 1637.643 1424.720 1366.651 1611.834 1550.538 1550.538 
##       49       50       51       52       53       54       55       56 
## 1282.772 1608.608 1534.408 1373.103 1650.547 1647.321 1382.781 1231.155 
##       57       58       59       60       61 
## 1624.738 1556.990 1627.964 1650.547 1592.477

Getting Residuals in R

Use residuals() to get the residuals for all observations in the data set

residuals(linear_fit)

##           1           2           3           4           5           6 
##  167.592430 -177.016196   11.714583 -162.251127   67.000949 -156.163799 
##           7           8           9          10          11          12 
## -174.225143   69.992644   39.000949 -114.520348   76.583590 -187.190318 
##          13          14          15          16          17          18 
##   72.722888  336.453132 -136.190318  -87.981371 -258.051020  152.601270 
##          19          20          21          22          23          24 
##  265.696903  -77.416410  147.531621  261.696903   13.166231   44.209896 
##          25          26          27          28          29          30 
##  121.627254  203.227575   69.957819  -69.303097  -89.607677  -50.190318 
##          31          32          33          34          35          36 
##   -9.416410  -90.615982   46.487957 -194.225143  168.809682   -9.285952 
##          37          38          39          40          41          42 
##  -86.094685  179.035773 -125.720455   -7.094685  -46.363906  -48.850913 
##          43          44          45          46          47          48 
##  117.357499 -117.720455 -112.650807 -120.833769    4.461972 -122.538028 
##          49          50          51          52          53          54 
##   35.227575 -348.607677  188.592430    5.897010   91.453132  -73.320776 
##          55          56          57          58          59          60 
##  186.218735 -135.154959  -33.738135 -154.990211  144.035773  177.453132 
##          61 
##  111.522781

Saving the Predictions and Residuals

For making plots, it's helpful to add the predictions and residuals to the original data frame:

mortality_water <- mutate(mortality_water,
  predicted = predict(linear_fit),
  residual = residuals(linear_fit))
head(mortality_water)

## # A tibble: 6 x 5
##   Mortality Calcium Derby predicted   residual
##       <dbl>   <int> <chr>     <dbl>      <dbl>
## 1      1702      44 South  1534.408  167.59243
## 2      1309      59 South  1486.016 -177.01620
## 3      1259     133 South  1247.285   11.71458
## 4      1427      27 North  1589.251 -162.25113
## 5      1724       6 North  1656.999   67.00095
## 6      1175     107 South  1331.164 -156.16380

A Density Plot of Residuals

ggplot() +
  geom_density(mapping = aes(x = residual), data = mortality_water)

Adding the line to a scatterplot

ggplot() +
  geom_point(mapping = aes(x = Calcium, y = Mortality),
    data = mortality_water) +
  geom_smooth(mapping = aes(x = Calcium, y = Mortality),
    data = mortality_water,
    method = "lm",
    se = FALSE)

Summary of R Functions:

Function	What does it do?
`linear_fit <- lm(response ~ explanatory,` `data = data_frame)`	Fit a linear model and save the fit in an object named `linear_fit`
`coef(linear_fit)`	Get the estimated coefficients \(b_0\) and \(b_1\)
`predict(linear_fit)`	Get the predicted response for each observation
`residuals(linear_fit)`	Get the residuals for each observation
`geom_smooth(mapping = aes(...), data = ..., method = "lm", se = FALSE)`	Add a line to a scatter plot

You probably want to save the predictions and residuals in the original data frame by using mutate.