Wildfires (based on SDM4 5.58)

The following R chunk loads a data set with the average number of acres burned per wildfire (in hundreds of thousands of acres) in each year from 1985 to 2012. I’ve written this code out for you, but at this point in the class it would be good if you understand what each line of code in this R chunk does. Let me know if any of this isn’t clear!

wildfires <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Wildfires_2012.csv")
names(wildfires) <- c("num_fires", "years_since_1985", "ave_acres_burned")
wildfires <- mutate(wildfires, year = years_since_1985 + 1985)
wildfires <- arrange(wildfires, year)
head(wildfires)
## # A tibble: 6 x 4
##   num_fires years_since_1985 ave_acres_burned  year
##       <int>            <int>            <dbl> <dbl>
## 1     82591                0             29.0  1985
## 2     85907                1             27.2  1986
## 3     71300                2             24.5  1987
## 4     72750                3             50.1  1988
## 5     48949                4             18.3  1989
## 6     66481                5             46.2  1990
str(wildfires)
## Classes 'tbl_df', 'tbl' and 'data.frame':    28 obs. of  4 variables:
##  $ num_fires       : int  82591 85907 71300 72750 48949 66481 75754 87394 58810 79107 ...
##  $ years_since_1985: int  0 1 2 3 4 5 6 7 8 9 ...
##  $ ave_acres_burned: num  29 27.2 24.5 50.1 18.3 ...
##  $ year            : num  1985 1986 1987 1988 1989 ...

(a) We suspect that there may be a trend in the number of acres burned over time. Make a scatter plot with years_since_1985 on the horizontal axis and ave_acres_burned on the vertical axis.

SOLUTION:

# Your code goes here
ggplot() +
  geom_point(mapping = aes(x = years_since_1985, y = ave_acres_burned), data = wildfires)

(b) Is a linear regression model appropriate for these data? Check for (1) quantitative variables, (2) straight enough, (3) outliers, and (4) does the plot thicken? Note that we can’t yet check whether the residuals follow a nearly normal distribution, since we haven’t yet fit the model in order to get the residuals to plot!

SOLUTION:

  1. Both variables are quantitative.

  2. The relationship is fairly straight, although I could also imagine that the relationship is flat for the first 10 or so years, and the average number of acres burned starts increasing after that.

  3. There aren’t any outliers.

  4. The plot does not thicken – there seems to be about equal vertical spread of the points across the full range of years in our data set.

(c) Regardless of your answer to (c), fit the regression model using ave_acres_burned as the response variable and years_since_1985 as the explanatory variable.

SOLUTION:

# Your code goes here
linear_fit <- lm(ave_acres_burned ~ years_since_1985, data = wildfires)

(d) Make a scatter plot with the regression line. You can use the same code you used for part (a) as a starter, but how add a second layer to the plot using geom_smooth

# Your code goes here
ggplot() +
  geom_point(mapping = aes(x = years_since_1985, y = ave_acres_burned),
    data = wildfires) + 
  geom_smooth(mapping = aes(x = years_since_1985, y = ave_acres_burned),
    data = wildfires,
    method = "lm",
    se = FALSE
  )

(e) Now that we’ve fit a tentative model, we can check whether the residuals follow a nearly normal distribution. Use the mutate function to add two new variables to the wildfires data frame: one with the predicted values from the model, and one with the residuals (we won’t need the predicted values in this lab, but we will use them in future classes and it’s good to get in the habit of saving them). Then, make a density plot of the residuals. Is the assumption that the residuals follow a nearly normal distribution satisfied?

SOLUTION:

wildfires <- mutate(wildfires,
  residual = residuals(linear_fit))
ggplot() +
  geom_density(mapping = aes(x = residual), data = wildfires)

It looks like the distribution is bi-modal and skewed to the left slightly – the nearly normal condition is not satisfied. If you look back at the plot from part (d), it seems that the points mainly fall a bit above the line, or a bit below the line, but few points are very close to the line. This is what we’re seeing in the residuals density plot.

(f) Regardless of your answer to part (e), let’s go ahead with using this model. Use the coef function to print out the estimated coefficients. Write down the equation of the line of regression predicting number of acres burned from the number of years since 1985. Interpret the intercept and slope in the context of the problem. Does the interpretation of the intercept make sense?

SOLUTION:

# Your code goes here
coef(linear_fit)
##      (Intercept) years_since_1985 
##            19.62             2.22

You could write either of the following for the equation:

Predicted Average Acres Burned = 19.62 + 2.22 (Years Since 1985)

\(\hat{y}_i = 19.62 + 2.22 x_i\)

Interpretation of intercept: The predicted average acres burned in 1985 (when the years since 1985 is 0) is 19.62 hundred thousand.

Interpretation of slope: For each additional year, the predicted average number of acres burned increases by 2.22 hundred thousand.

The interpretation of these coefficients seems reasonable, and matches the data in the plots above.

(g) Using the equation you wrote down in part (f), calculate the linear model’s predicted values for 1985 and for 2018.

SOLUTION:

In 1985, the number of years since 1985 is 0. So the predicted average acres burned is 19.62 + 2.22 * 0 = 19.62 hundred thousand acres.

In 2018, the number of years since 1985 is 33. So the predicted average acres burned is 19.62 + 2.22 * 33 = 92.9 hundred thousand acres.

(h) Fit a new model using year as the explanatory variable. Use the coef function to obtain the estimated model coefficients, and write down the estimated regression equation. Interpret the coefficients in this model. In this model, does the interpretation of the intercept make sense? Compare the slope to what you got in part (f).

SOLUTION:

# Your code goes here
linear_fit <- lm(ave_acres_burned ~ year, data = wildfires)
coef(linear_fit)
## (Intercept)        year 
##    -4390.60        2.22

You could write either of the following for the equation:

Predicted Average Acres Burned = -4390.6 + 2.22 (Year)

\(\hat{y}_i = -4390.6 + 2.22 x_i\)

Interpretation of intercept: The predicted average acres burned in year 0 is -4390.6 hundred thousand. That doesn’t make any sense!

Interpretation of slope: For each additional year, the predicted average number of acres burned increases by 2.22 hundred thousand. This is the same as what we got in part (f).

(i) Using the equation you wrote down in part (h), calculate the linear model’s predicted values for 1985 and for 2018. Compare to the answer you got in part (g) (you should have the same results, up to rounding errors).

SOLUTION:

For 1985: \(-4390.6 + 2.22 * 1985 = 16.1\). The predicted average acres burned is 16.1 hundred thousand acres.

For 2018: \(-4390.6 + 2.22 * 2018 = 89.4\). The predicted average acres burned is 89.4 hundred thousand acres.

These don’t exactly match the answers from part (g) because of rounding errors. We can get more precise results by using more digits:

options(digits = 10)
coef(linear_fit)
##     (Intercept)            year 
## -4390.598311439     2.221770662

For 1985: -4390.598311439 + 2.221770662 * 1985 = 19.61645263

For 2018: -4390.598311439 + 2.221770662 * 2018 = 92.93488448