Below are summary statistics on fuel efficiency (in miles/gallon) from random samples of cars with manual and automatic transmissions manufactured in 2012. Do these data provide strong evidence of a difference between the average fuel efficiency of cars with manual and automatic transmissions in terms of their average city mileage?
# load data ---------------------------------------------------------
fuel_eff <- read_csv("https://mhc-stat140-2017.github.io/data/misc/fuel_eff.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## model_yr = col_integer(),
## model_type_index = col_integer(),
## engine_displacement = col_double(),
## no_cylinders = col_integer(),
## city_mpg = col_integer(),
## hwy_mpg = col_integer(),
## comb_mpg = col_integer(),
## no_gears = col_integer()
## )
## See spec(...) for full column specifications.
# select a small sample ---------------------------------------------
man_rows <- which(fuel_eff$transmission == "M")
aut_rows <- which(fuel_eff$transmission == "A")
set.seed(3583)
man_rows_samp <- sample(man_rows, 26)
aut_rows_samp <- sample(aut_rows, 26)
fuel_eff_samp <- fuel_eff[c(man_rows_samp,aut_rows_samp), ]
fuel_eff_samp$transmission <- factor(fuel_eff_samp$transmission)
levels(fuel_eff_samp$transmission) <- c("automatic", "manual")
ggplot() +
geom_density(mapping = aes(x = comb_mpg, color = transmission), data = fuel_eff_samp)
fuel_eff_man <- filter(fuel_eff_samp, transmission == "manual")
fuel_eff_aut <- filter(fuel_eff_samp, transmission == "automatic")
mean(fuel_eff_man$comb_mpg)
## [1] 22.85
sd(fuel_eff_man$comb_mpg)
## [1] 4.73
mean(fuel_eff_aut$comb_mpg)
## [1] 18.65
sd(fuel_eff_aut$comb_mpg)
## [1] 4.137
Define \(\mu_1\) to be the mean fuel efficiency among the population of all automatic transmission cars manufactured in 2012 and \(\mu_2\) to be the mean fuel efficiency among the population of all manual transmission cars manufactured in 2012.
\(H_0\): \(\mu_1 = \mu_2\), or \(\mu_1 - \mu_2 = 0\)
\(H_A\): \(\mu_1 \neq \mu_2\), or \(\mu_1 - \mu_2 \neq 0\)
Are these data paired or unpaired?
SOLUTION:
There is no indication that these data are paired. To be paired, there would have to be a situation where for each car model, we measured fuel efficiency for a version of that car model with an automatic transmission and a version with a manual transmission. I will treat them as unpaired data.
SOLUTION:
Since we’re measuring fuel efficiency for different randomly selected cars, it’s reasonable to assume that their fuel efficencies are independent within each group and across the different groups.
SOLUTION:
Within each group, the density plot above shows that the distribution of fuel efficiency measurements are nearly normal.
SOLUTION:
How big the sample size has to be depends on how far from normal the distribution of values within each group is. Since the distributions are quite close to normal within each group, a sample size of 26 in each group is certainly large enough.
You can use t.test()
function.
# call to t.test() here
t.test(fuel_eff_aut$comb_mpg, fuel_eff_man$comb_mpg)
##
## Welch Two Sample t-test
##
## data: fuel_eff_aut$comb_mpg and fuel_eff_man$comb_mpg
## t = -3.4, df = 49, p-value = 0.001
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.669 -1.716
## sample estimates:
## mean of x mean of y
## 18.65 22.85
SOLUTION:
Since the p-value is 0.001339, which is less than commonly used significance levels such as \(\alpha = 0.05\), we can reject the null hypothesis. The data provide enough evidence to conclude that the mean fuel efficiency for automatic transmission cars is different from the mean fuel efficiency for manual transmission acars, at the \(\alpha = 0.01\) significance level.
SOLUTION:
We are 95% confident that the difference in the mean fuel efficiency for the population of all automatic transmission cars and the mean fuel efficiency for the population of all manual transmission cars is between -6.67 mpg and -1.72 mpg. If we were to take many samples from these populations, and use each sample to compute a 95% confidence interval for the difference in population means, about 95% of those confidence intervals would contain the true difference in the population mean fuel efficiency for automatic transmission cars and manual transmission cars.
The British Medical Journal published an article titled “Is Friday the 13th Bad for Your Health?” The article examined the number of people admitted to emergency rooms for vehicular accidents on 12 Friday evenings (6 each on the 6th and 13th). Here are the data:
friday13 <- read_csv("https://mhc-stat140-2017.github.io/data/sdm4/Friday_the_13th_Part_2.csv")
## Parsed with column specification:
## cols(
## `Year and Month` = col_character(),
## `6th` = col_integer(),
## `13th` = col_integer()
## )
names(friday13) <- c("year_month", "accidents_6th", "accidents_13th")
friday13 <- mutate(friday13, difference = accidents_13th - accidents_6th)
head(friday13)
## # A tibble: 6 x 4
## year_month accidents_6th accidents_13th difference
## <chr> <int> <int> <int>
## 1 Oct-89 9 13 4
## 2 Jul-90 6 12 6
## 3 Sep-91 11 14 3
## 4 Dec-91 11 10 -1
## 5 Mar-92 3 4 1
## 6 Nov-92 5 12 7
Is there a difference between rates of accidents on Friday the 13th and Friday the 6th?
SOLUTION:
Define \(mu_1\) to be the mean number of accidents that occur on Friday the 13th and \(\mu_2\) to be the mean number of accidents that occur on Friday the 6th.
\(H_0\): \(\mu_1 = \mu_2\), or \(\mu_1 - \mu_2 = 0\)
\(H_A\): \(\mu_1 \neq \mu_2\), or \(\mu_1 - \mu_2 \neq 0\)
Are these data paired or unpaired?
SOLUTION:
These are paired data since for each month, we have observations of the number of accidents on two consecutive Fridays.
SOLUTION:
Each pair of observations (for the number of accidents on the 6th and on the 13th) occurs in a different month and year. There is no reason to think there would be a connection between the different months and years in this data set.
SOLUTION:
ggplot() +
geom_density(mapping = aes(x = difference), data = friday13)
ggplot() +
geom_histogram(mapping = aes(x = difference), data = friday13)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
It’s difficult to assess the form of the distribution with a sample size of only 6, but it does seem that the distribution may be skewed slightly to the left. However, it does seem like the mean will be a reasonably good summary of the center of this distribution.
SOLUTION:
A sample size of only 6 is pretty small. This sample size really might not be big enough to support reliable inference with these data.
You can use t.test()
function.
SOLUTION:
# Your code goes here
t.test(friday13$accidents_13th, friday13$accidents_6th, paired = TRUE, conf.level = 0.99)
##
## Paired t-test
##
## data: friday13$accidents_13th and friday13$accidents_6th
## t = 2.7, df = 5, p-value = 0.04
## alternative hypothesis: true difference in means is not equal to 0
## 99 percent confidence interval:
## -1.623 8.290
## sample estimates:
## mean of the differences
## 3.333
SOLUTION:
The p-value for this test is 0.042. This p-value is less than the significance cut-off of \(\alpha = 0.05\), so we can reject the null hypothesis. These data provide enough evidence to conclude that there is a difference in the mean number of accidents on Friday the 13th and Friday the 6th, at the \(\alpha = 0.05\) significance level. However, we should remain cautious about the strength of this inference given the small sample size noted above.
SOLUTION:
We are 99% confident that the difference between the mean number of accidents that occur on Fridays the 13th and the mean number of accidents that occur on Fridays the 6th is between -1.62 and 8.29. If we took many different samples and computed a similar confidence interval based on each sample, about 99% of those confidence intervals would contain the true population difference in the number of accidents that occur on the 13th and on the 6th.