Scatterplots and correlation

A First Example

Rail Trails

Data were collected on the volume of users on the Northampton Rail Trail in Florence, Massachusetts. Variables in the dataset include the number of crossings on a particular day (measured by a sensor near the intersection with Chestnut Street, volume), the average of the min and max temperature in degrees Fahrenheit for that day (avgtemp), and a dichotomous indicator of whether the day was a weekday or a weekend/holiday (weekday).

RailTrail <- mutate(RailTrail, daytype = ifelse(weekday==1, "Weekday", "Wkend/Holiday"))
head(RailTrail)
##   hightemp lowtemp avgtemp spring summer fall cloudcover precip volume
## 1       83      50    66.5      0      1    0        7.6   0.00    501
## 2       73      49    61.0      0      1    0        6.3   0.29    419
## 3       74      52    63.0      1      0    0        7.5   0.32    397
## 4       95      61    78.0      0      1    0        2.6   0.00    385
## 5       44      52    48.0      1      0    0       10.0   0.14    200
## 6       69      54    61.5      1      0    0        6.6   0.02    375
##   weekday       daytype
## 1       1       Weekday
## 2       1       Weekday
## 3       1       Weekday
## 4       0 Wkend/Holiday
## 5       1       Weekday
## 6       1       Weekday
str(RailTrail)
## 'data.frame':    90 obs. of  11 variables:
##  $ hightemp  : int  83 73 74 95 44 69 66 66 80 79 ...
##  $ lowtemp   : int  50 49 52 61 52 54 39 38 55 45 ...
##  $ avgtemp   : num  66.5 61 63 78 48 61.5 52.5 52 67.5 62 ...
##  $ spring    : int  0 0 1 0 1 1 1 1 0 0 ...
##  $ summer    : int  1 1 0 1 0 0 0 0 1 1 ...
##  $ fall      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cloudcover: num  7.6 6.3 7.5 2.6 10 ...
##  $ precip    : num  0 0.29 0.32 0 0.14 ...
##  $ volume    : int  501 419 397 385 200 375 417 629 533 547 ...
##  $ weekday   : Factor w/ 2 levels "0","1": 2 2 2 1 2 2 2 1 1 2 ...
##  $ daytype   : chr  "Weekday" "Weekday" "Weekday" "Wkend/Holiday" ...

Make a scatter plot using the RailTrail data with the volume variable on the vertical axis and the avgtemp variable on the horizontal axis.

ggplot() +
  geom_point(mapping = aes(x = avgtemp, y = volume), data = RailTrail)

Describe the relationship between the number of crossings and avgtemp (average of min and max temperatures). Be sure to describe the direction, form, strength, and any unusual features.

SOLUTION:

Direction: There is a positive association between temperature and the number of crossings on the bike path up until about 65 degrees, and after that there is a negative association between temperature and the number of crossings.

Form: The scatter plot shows a generally curved shape.

Strength: There is a moderately strong association between temperature and volume.

Unusual features: There are some days with unusually high or low numbers of crossings that don’t match the rest of the data well. Two days with temperatures around 60 degrees stand out as having low bike path volume, and one day around 65 degrees has unusually high bike path volume.

Report and interpret the correlation between average temp and number of crossings. Use the cor function.

cor(RailTrail$avgtemp, RailTrail$volume)
## [1] 0.427

SOLUTION:

The correlation coefficient is 0.47, indicating a moderately strong positive association between temperature and volume. However, from the scatter plot above, we saw that this relationship is not strictly linear. The correlation coefficient should really only be used to describe linear relationships.

Thinking about correlation and dependence (THEY ARE NOT THE SAME!!!)

Arcade Revenue and Lawyers in Wyoming

Here are some data about the total revenue of arcades in the US (in millions of dollars) and the number of lawyers in Wyoming in each year from 2000 to 2009.

Arcades_and_Lawyers <-
  read_csv("https://mhc-stat140-2017.github.io/labs/20170922_correlation/data/arcade_revenue_lawyers_Wyoming.csv")
## Parsed with column specification:
## cols(
##   Year = col_integer(),
##   `Total revenue generated by arcades (US, millions of dollars)` = col_integer(),
##   `Lawyers in Wyoming` = col_integer()
## )
names(Arcades_and_Lawyers) <- c("year", "arcade_revenue", "lawyers_in_Wyoming")

Make a scatter plot with arcade_revenue on the horizontal axis and lawyers_in_Wyoming on the vertical axis.

SOLUTION:

ggplot() +
  geom_point(mapping = aes(x = arcade_revenue, y = lawyers_in_Wyoming), data = Arcades_and_Lawyers)

Describe the direction, form, and strength of the relationship between arcade revenue and the number of lawyers in Wyoming

SOLUTION:

There is a very strong, linear, positive association between arcade revenue and the number of lawyers in Wyoming.

Calculate the correlation between arcade_revenue and lawyers_in_Wyoming

SOLUTION:

cor(Arcades_and_Lawyers$arcade_revenue, Arcades_and_Lawyers$lawyers_in_Wyoming)
## [1] 0.989

The correlation coefficient is 0.989, indicating a very strong positive association between arcade revenue and the number of lawyers in Wyoming.

Do the correlation and relationship that you described above mean that there is a causal relationship between arcade revenue and the number of lawyers in Wyoming?

SOLUTION: No! There is no reason to think that there is a causal relationship between these variables.

Spurious Correlations

Go to http://www.tylervigen.com/spurious-correlations and browse through some of the plots there. Become fully and deeply convinced that if two variables have a high correlation, that does not tell you anything about one variable causing the other.