- Linear Models – the Statistical Ideas
- Linear Models in R
September 27, 2017
Scientists believe that hard water (water with high concentrations of calcium and magnesium) is beneficial for health (Sengupta, P. (2013). IJPM, 4(8), 866-875.)
We have recordings of the mortality rate (deaths per 100,000 population) and concentration of calcium in drinking water (parts per million) in 61 large towns in England and Wales
Equation: Predicted Mortality = 1676.36 - 3.23 Calcium
Interpretation of \(b_0\): If a town's drinking water contained no calcium, the predicted mortality rate would be 1676.36 people per 100,000, annually.
Interpretation of \(b_1\): For each unit increase in Calcium, the predicted mortality rate goes down by 3.23 people per 100,000, annually.
Interpretation of \(b_1\): For each unit increase in Calcium, the predicted mortality rate goes down by 3.23 people per 100,000, annually.
Does this mean a unit increase in Calcium causes a reduction in mortality rate of 3.23 people per 100,000?
Equation: Predicted Mortality = 1676.36 - 3.23 Calcium
The word Predicted is essential! The model's predictions don't exactly equal the observed values.
Residual = Observed - Predicted
\(\definecolor{residual}{RGB}{230,159,0}\color{residual}e_i\) = \(\definecolor{observed}{RGB}{0,158,115}\color{observed}y_i\) - \(\definecolor{predicted}{RGB}{86,180,233}\color{predicted}\widehat{y}_i\)
\[b_1 = r \frac{s_y}{s_x} \qquad \qquad b_0 = \bar{y} - b_1 \bar{x}\]
Mortality Example:
response_variable
vs. explanatory_variable
response_variable
vs. explanatory_variable
response_variable
vs. explanatory_variable
linear_fit <- lm(response_variable ~ explanatory_variable, data = data_frame)
linear_fit <- lm(Mortality ~ Calcium, data = mortality_water)
response_variable
as a function of explanatory_variable
, with the variables taken from data_frame
, and save the result in an object called linear_fit
"coef()
function to view the estimated coefficients:coef(linear_fit)
## (Intercept) Calcium ## 1676.355601 -3.226092
(Intercept)
predict()
to get the predicted responses for all observations in the data setpredict(linear_fit)
## 1 2 3 4 5 6 7 8 ## 1534.408 1486.016 1247.285 1589.251 1656.999 1331.164 1660.225 1386.007 ## 9 10 11 12 13 14 15 16 ## 1656.999 1350.520 1634.416 1631.190 1518.277 1650.547 1631.190 1456.981 ## 17 18 19 20 21 22 23 24 ## 1515.051 1434.399 1447.303 1634.416 1492.468 1447.303 1611.834 1482.790 ## 25 26 27 28 29 30 31 32 ## 1505.373 1282.772 1415.042 1447.303 1608.608 1631.190 1634.416 1337.616 ## 33 34 35 36 37 38 39 40 ## 1621.512 1660.225 1631.190 1618.286 1644.095 1627.964 1424.720 1644.095 ## 41 42 43 44 45 46 47 48 ## 1405.364 1440.851 1637.643 1424.720 1366.651 1611.834 1550.538 1550.538 ## 49 50 51 52 53 54 55 56 ## 1282.772 1608.608 1534.408 1373.103 1650.547 1647.321 1382.781 1231.155 ## 57 58 59 60 61 ## 1624.738 1556.990 1627.964 1650.547 1592.477
residuals()
to get the residuals for all observations in the data setresiduals(linear_fit)
## 1 2 3 4 5 6 ## 167.592430 -177.016196 11.714583 -162.251127 67.000949 -156.163799 ## 7 8 9 10 11 12 ## -174.225143 69.992644 39.000949 -114.520348 76.583590 -187.190318 ## 13 14 15 16 17 18 ## 72.722888 336.453132 -136.190318 -87.981371 -258.051020 152.601270 ## 19 20 21 22 23 24 ## 265.696903 -77.416410 147.531621 261.696903 13.166231 44.209896 ## 25 26 27 28 29 30 ## 121.627254 203.227575 69.957819 -69.303097 -89.607677 -50.190318 ## 31 32 33 34 35 36 ## -9.416410 -90.615982 46.487957 -194.225143 168.809682 -9.285952 ## 37 38 39 40 41 42 ## -86.094685 179.035773 -125.720455 -7.094685 -46.363906 -48.850913 ## 43 44 45 46 47 48 ## 117.357499 -117.720455 -112.650807 -120.833769 4.461972 -122.538028 ## 49 50 51 52 53 54 ## 35.227575 -348.607677 188.592430 5.897010 91.453132 -73.320776 ## 55 56 57 58 59 60 ## 186.218735 -135.154959 -33.738135 -154.990211 144.035773 177.453132 ## 61 ## 111.522781
mortality_water <- mutate(mortality_water, predicted = predict(linear_fit), residual = residuals(linear_fit)) head(mortality_water)
## # A tibble: 6 x 5 ## Mortality Calcium Derby predicted residual ## <dbl> <int> <chr> <dbl> <dbl> ## 1 1702 44 South 1534.408 167.59243 ## 2 1309 59 South 1486.016 -177.01620 ## 3 1259 133 South 1247.285 11.71458 ## 4 1427 27 North 1589.251 -162.25113 ## 5 1724 6 North 1656.999 67.00095 ## 6 1175 107 South 1331.164 -156.16380
ggplot() + geom_density(mapping = aes(x = residual), data = mortality_water)
ggplot() + geom_point(mapping = aes(x = Calcium, y = Mortality), data = mortality_water) + geom_smooth(mapping = aes(x = Calcium, y = Mortality), data = mortality_water, method = "lm", se = FALSE)
Function | What does it do? |
---|---|
linear_fit <- lm(response ~ explanatory, data = data_frame) |
Fit a linear model and save the fit in an object named linear_fit |
coef(linear_fit) |
Get the estimated coefficients \(b_0\) and \(b_1\) |
predict(linear_fit) |
Get the predicted response for each observation |
residuals(linear_fit) |
Get the residuals for each observation |
geom_smooth(mapping = aes(...), data = ..., method = "lm", se = FALSE) |
Add a line to a scatter plot |
predict
ions and residuals
in the original data frame by using mutate
.