Data were collected on the volume of users on the Northampton Rail Trail in Florence, Massachusetts. Variables in the dataset include the number of crossings on a particular day (measured by a sensor near the intersection with Chestnut Street, volume
), the average of the min and max temperature in degrees Fahrenheit for that day (avgtemp
), and a dichotomous indicator of whether the day was a weekday or a weekend/holiday (weekday
).
RailTrail <- mutate(RailTrail, daytype = ifelse(weekday==1, "Weekday", "Wkend/Holiday"))
head(RailTrail)
## hightemp lowtemp avgtemp spring summer fall cloudcover precip volume
## 1 83 50 66.5 0 1 0 7.6 0.00 501
## 2 73 49 61.0 0 1 0 6.3 0.29 419
## 3 74 52 63.0 1 0 0 7.5 0.32 397
## 4 95 61 78.0 0 1 0 2.6 0.00 385
## 5 44 52 48.0 1 0 0 10.0 0.14 200
## 6 69 54 61.5 1 0 0 6.6 0.02 375
## weekday daytype
## 1 1 Weekday
## 2 1 Weekday
## 3 1 Weekday
## 4 0 Wkend/Holiday
## 5 1 Weekday
## 6 1 Weekday
str(RailTrail)
## 'data.frame': 90 obs. of 11 variables:
## $ hightemp : int 83 73 74 95 44 69 66 66 80 79 ...
## $ lowtemp : int 50 49 52 61 52 54 39 38 55 45 ...
## $ avgtemp : num 66.5 61 63 78 48 61.5 52.5 52 67.5 62 ...
## $ spring : int 0 0 1 0 1 1 1 1 0 0 ...
## $ summer : int 1 1 0 1 0 0 0 0 1 1 ...
## $ fall : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cloudcover: num 7.6 6.3 7.5 2.6 10 ...
## $ precip : num 0 0.29 0.32 0 0.14 ...
## $ volume : int 501 419 397 385 200 375 417 629 533 547 ...
## $ weekday : Factor w/ 2 levels "0","1": 2 2 2 1 2 2 2 1 1 2 ...
## $ daytype : chr "Weekday" "Weekday" "Weekday" "Wkend/Holiday" ...
volume
variable on the vertical axis and the avgtemp
variable on the horizontal axis.ggplot() +
geom_point(mapping = aes(x = avgtemp, y = volume), data = RailTrail)
avgtemp
(average of min and max temperatures). Be sure to describe the direction, form, strength, and any unusual features.SOLUTION:
Direction: There is a positive association between temperature and the number of crossings on the bike path up until about 65 degrees, and after that there is a negative association between temperature and the number of crossings.
Form: The scatter plot shows a generally curved shape.
Strength: There is a moderately strong association between temperature and volume.
Unusual features: There are some days with unusually high or low numbers of crossings that don’t match the rest of the data well. Two days with temperatures around 60 degrees stand out as having low bike path volume, and one day around 65 degrees has unusually high bike path volume.
cor
function.cor(RailTrail$avgtemp, RailTrail$volume)
## [1] 0.427
SOLUTION:
The correlation coefficient is 0.47, indicating a moderately strong positive association between temperature and volume. However, from the scatter plot above, we saw that this relationship is not strictly linear. The correlation coefficient should really only be used to describe linear relationships.
Here are some data about the total revenue of arcades in the US (in millions of dollars) and the number of lawyers in Wyoming in each year from 2000 to 2009.
Arcades_and_Lawyers <-
read_csv("https://mhc-stat140-2017.github.io/labs/20170922_correlation/data/arcade_revenue_lawyers_Wyoming.csv")
## Parsed with column specification:
## cols(
## Year = col_integer(),
## `Total revenue generated by arcades (US, millions of dollars)` = col_integer(),
## `Lawyers in Wyoming` = col_integer()
## )
names(Arcades_and_Lawyers) <- c("year", "arcade_revenue", "lawyers_in_Wyoming")
arcade_revenue
on the horizontal axis and lawyers_in_Wyoming
on the vertical axis.SOLUTION:
ggplot() +
geom_point(mapping = aes(x = arcade_revenue, y = lawyers_in_Wyoming), data = Arcades_and_Lawyers)
SOLUTION:
There is a very strong, linear, positive association between arcade revenue and the number of lawyers in Wyoming.
arcade_revenue
and lawyers_in_Wyoming
SOLUTION:
cor(Arcades_and_Lawyers$arcade_revenue, Arcades_and_Lawyers$lawyers_in_Wyoming)
## [1] 0.989
The correlation coefficient is 0.989, indicating a very strong positive association between arcade revenue and the number of lawyers in Wyoming.
SOLUTION: No! There is no reason to think that there is a causal relationship between these variables.
Go to http://www.tylervigen.com/spurious-correlations
and browse through some of the plots there. Become fully and deeply convinced that if two variables have a high correlation, that does not tell you anything about one variable causing the other.