The following R chunk loads a data set with the average number of acres burned per wildfire (in hundreds of thousands of acres) in each year from 1985 to 2012. I’ve written this code out for you, but at this point in the class it would be good if you understand what each line of code in this R chunk does. Let me know if any of this isn’t clear!
wildfires <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Wildfires_2012.csv")
names(wildfires) <- c("num_fires", "years_since_1985", "ave_acres_burned")
wildfires <- mutate(wildfires, year = years_since_1985 + 1985)
wildfires <- arrange(wildfires, year)
head(wildfires)
## # A tibble: 6 x 4
## num_fires years_since_1985 ave_acres_burned year
## <int> <int> <dbl> <dbl>
## 1 82591 0 29.0 1985
## 2 85907 1 27.2 1986
## 3 71300 2 24.5 1987
## 4 72750 3 50.1 1988
## 5 48949 4 18.3 1989
## 6 66481 5 46.2 1990
str(wildfires)
## Classes 'tbl_df', 'tbl' and 'data.frame': 28 obs. of 4 variables:
## $ num_fires : int 82591 85907 71300 72750 48949 66481 75754 87394 58810 79107 ...
## $ years_since_1985: int 0 1 2 3 4 5 6 7 8 9 ...
## $ ave_acres_burned: num 29 27.2 24.5 50.1 18.3 ...
## $ year : num 1985 1986 1987 1988 1989 ...
years_since_1985
on the horizontal axis and ave_acres_burned
on the vertical axis.SOLUTION:
# Your code goes here
ggplot() +
geom_point(mapping = aes(x = years_since_1985, y = ave_acres_burned), data = wildfires)
SOLUTION:
Both variables are quantitative.
The relationship is fairly straight, although I could also imagine that the relationship is flat for the first 10 or so years, and the average number of acres burned starts increasing after that.
There aren’t any outliers.
The plot does not thicken – there seems to be about equal vertical spread of the points across the full range of years in our data set.
ave_acres_burned
as the response variable and years_since_1985
as the explanatory variable.SOLUTION:
# Your code goes here
linear_fit <- lm(ave_acres_burned ~ years_since_1985, data = wildfires)
geom_smooth
# Your code goes here
ggplot() +
geom_point(mapping = aes(x = years_since_1985, y = ave_acres_burned),
data = wildfires) +
geom_smooth(mapping = aes(x = years_since_1985, y = ave_acres_burned),
data = wildfires,
method = "lm",
se = FALSE
)
mutate
function to add two new variables to the wildfires
data frame: one with the predicted values from the model, and one with the residuals (we won’t need the predicted values in this lab, but we will use them in future classes and it’s good to get in the habit of saving them). Then, make a density plot of the residuals. Is the assumption that the residuals follow a nearly normal distribution satisfied?SOLUTION:
wildfires <- mutate(wildfires,
residual = residuals(linear_fit))
ggplot() +
geom_density(mapping = aes(x = residual), data = wildfires)
It looks like the distribution is bi-modal and skewed to the left slightly – the nearly normal condition is not satisfied. If you look back at the plot from part (d), it seems that the points mainly fall a bit above the line, or a bit below the line, but few points are very close to the line. This is what we’re seeing in the residuals density plot.
coef
function to print out the estimated coefficients. Write down the equation of the line of regression predicting number of acres burned from the number of years since 1985. Interpret the intercept and slope in the context of the problem. Does the interpretation of the intercept make sense?SOLUTION:
# Your code goes here
coef(linear_fit)
## (Intercept) years_since_1985
## 19.62 2.22
You could write either of the following for the equation:
Predicted Average Acres Burned = 19.62 + 2.22 (Years Since 1985)
\(\hat{y}_i = 19.62 + 2.22 x_i\)
Interpretation of intercept: The predicted average acres burned in 1985 (when the years since 1985 is 0) is 19.62 hundred thousand.
Interpretation of slope: For each additional year, the predicted average number of acres burned increases by 2.22 hundred thousand.
The interpretation of these coefficients seems reasonable, and matches the data in the plots above.
SOLUTION:
In 1985, the number of years since 1985 is 0. So the predicted average acres burned is 19.62 + 2.22 * 0 = 19.62 hundred thousand acres.
In 2018, the number of years since 1985 is 33. So the predicted average acres burned is 19.62 + 2.22 * 33 = 92.9 hundred thousand acres.
year
as the explanatory variable. Use the coef
function to obtain the estimated model coefficients, and write down the estimated regression equation. Interpret the coefficients in this model. In this model, does the interpretation of the intercept make sense? Compare the slope to what you got in part (f).SOLUTION:
# Your code goes here
linear_fit <- lm(ave_acres_burned ~ year, data = wildfires)
coef(linear_fit)
## (Intercept) year
## -4390.60 2.22
You could write either of the following for the equation:
Predicted Average Acres Burned = -4390.6 + 2.22 (Year)
\(\hat{y}_i = -4390.6 + 2.22 x_i\)
Interpretation of intercept: The predicted average acres burned in year 0 is -4390.6 hundred thousand. That doesn’t make any sense!
Interpretation of slope: For each additional year, the predicted average number of acres burned increases by 2.22 hundred thousand. This is the same as what we got in part (f).
SOLUTION:
For 1985: \(-4390.6 + 2.22 * 1985 = 16.1\). The predicted average acres burned is 16.1 hundred thousand acres.
For 2018: \(-4390.6 + 2.22 * 2018 = 89.4\). The predicted average acres burned is 89.4 hundred thousand acres.
These don’t exactly match the answers from part (g) because of rounding errors. We can get more precise results by using more digits:
options(digits = 10)
coef(linear_fit)
## (Intercept) year
## -4390.598311439 2.221770662
For 1985: -4390.598311439 + 2.221770662 * 1985 = 19.61645263
For 2018: -4390.598311439 + 2.221770662 * 2018 = 92.93488448