There are three main goals/learning objectives for this lab:
Get experience with checking conditions and interpreting confidence intervals for proportions.
Get some experience using R to calculate confidence intervals for proportions.
Understand how the width of the confidence interval depends on various factors:
the confidence level
the proportion that is being estimated
the sample size
This lab will not be graded for credit. However, I will ask you turn it in; there are several common errors in interpreting confidence intervals, and I want to have a chance to read these and give you feedback before you are responsible for this material on graded assignments. Please email the completed lab (Rmd file only) to me, cc-ing anyone you worked with, by 5pm on Monday, Nov 6. You will have class time on Wednesday and Friday to work on this, but we will not spend time on this lab in class on Monday. To download the Rmd file from Rstudio, click the check box next to the file name in the lower right panel of RStudio, then click “More” (top right of that Files panel), and choose “Export…” and save the file to your computer. Then you can attach it to an email.
In August of 2012, news outlets ranging from the Washington Post to the Huffington Post ran a story about the rise of atheism in America. The source for the story was a poll that asked people, “Irrespective of whether you attend a place of worship or not, would you say you are a religious person, not a religious person or a convinced atheist?” The full press release for the poll, conducted by WIN-Gallup International, is found at the following address:
SOLUTION:
These are sample statistics. They were calculated based on the responses from a relatively small number of people sampled from each country (generally less than 2000 per country). To be population parameters, we would have had to find out whether or not every person in each country identified as an atheist, to compute the proportion of the entire population who identify as atheists.
SOLUTION:
To generalize the findings from the sample to the global population, we would have to know that the sampling method resulted in a sample that was representative of the population in terms of their religious views. Without knowing the details of the sampling methods, we just have to trust the Pew took their sample carefully.
Turn your attention to Table 6 of the press release (pages 15 and 16), which reports the sample size and response percentages for all 57 countries. While this is a useful format to summarize the data, we will base our analysis on the original data set of individual responses to the survey. Load this data set into R with the following commands.
atheism <- read_csv("https://mhc-stat140-2017.github.io/data/openintro/atheism/atheism.csv")
## Parsed with column specification:
## cols(
## nationality = col_character(),
## response = col_character(),
## year = col_integer()
## )
head(atheism)
## # A tibble: 6 x 3
## nationality response year
## <chr> <chr> <int>
## 1 Afghanistan non-atheist 2012
## 2 Afghanistan non-atheist 2012
## 3 Afghanistan non-atheist 2012
## 4 Afghanistan non-atheist 2012
## 5 Afghanistan non-atheist 2012
## 6 Afghanistan non-atheist 2012
str(atheism)
## Classes 'tbl_df', 'tbl' and 'data.frame': 88032 obs. of 3 variables:
## $ nationality: chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ response : chr "non-atheist" "non-atheist" "non-atheist" "non-atheist" ...
## $ year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 3
## .. ..$ nationality: list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ response : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ year : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
atheism
correspond to?SOLUTION:
Each row of Table 6 in the press release corresponds to a country, summarizing responses from all participants in that country.
Each row of the atheism data frame corresponds to an individual survey participant.
To investigate the link between these two ways of organizing this data, take a look at the estimated proportion of atheists in the United States. Towards the bottom of Table 6, we see that this is 5%. We can check this number using the atheism
data by running the commands below. Make sure you understand what each of the commands below does after running it.
us_2012 <- filter(atheism, nationality == "United States", year == "2012")
nrow(us_2012)
## [1] 1002
head(us_2012)
## # A tibble: 6 x 3
## nationality response year
## <chr> <chr> <int>
## 1 United States non-atheist 2012
## 2 United States non-atheist 2012
## 3 United States non-atheist 2012
## 4 United States non-atheist 2012
## 5 United States non-atheist 2012
## 6 United States non-atheist 2012
table(us_2012$response)
##
## atheist non-atheist
## 50 952
table(us_2012$response) / nrow(us_2012)
##
## atheist non-atheist
## 0.0499002 0.9500998
SOLUTION:
ca_2012 <- filter(atheism, nationality == "Canada", year == "2012")
nrow(ca_2012)
## [1] 1002
head(ca_2012)
## # A tibble: 6 x 3
## nationality response year
## <chr> <chr> <int>
## 1 Canada non-atheist 2012
## 2 Canada non-atheist 2012
## 3 Canada non-atheist 2012
## 4 Canada non-atheist 2012
## 5 Canada non-atheist 2012
## 6 Canada non-atheist 2012
table(ca_2012$response)
##
## atheist non-atheist
## 90 912
table(ca_2012$response) / nrow(ca_2012)
##
## atheist non-atheist
## 0.08982036 0.91017964
As was hinted at in Exercise 1, Table 6 provides statistics, that is, calculations made from the sample of 51,927 people. What we’d like, though, is insight into the population parameters. You answer the question, “What proportion of people in your sample reported being atheists?” with a statistic; while the question “What proportion of people on earth would report being atheists” is answered with an estimate of the parameter.
Here is how we’d compute a 95% confidence interval for the proportion of atheists in the United States in 2012.
confint(binom.test(us_2012$response, conf.level = 0.95, ci.method = "wald"))
## probability of success lower upper level
## 1 0.0499002 0.03641833 0.06338206 0.95
SOLUTION:
We are 95% confident that the proportion of the entire US population who identify as atheists is between 3.64% and 6.34%. If we took many different random samples of size 1002 from the US population, and used the data in each of those samples to calculate a different 95% confidence interval, about 95% of those intervals would contain the proportion of the US population who identify as atheists.
SOLUTION:
We need to check the following conditions:
For each person in our sample there are two outcomes (at least, two outcomes that are relevant to this analysis): each person either identifies as an atheist or does not identify as an atheist.
Each person we pick has the same probability of being an atheist (which is the proportion of all US citizens who identify as atheists)
The people in our sample are independent. There are two conditions to check here:
the sample was taken randomly
the sample size (n = 1002) is less than 10% of the size of the US population
SOLUTION:
The margin of error is the quantity we add and subtract from \(\hat{p}\) to get the confidence interval (for a 95% confidence interval, this is \(2 * SE(\hat{p}))\). That means that we can calculate the margin of error from the R output above in any of three ways:
The margin of error is the total width of the confidence interval divided by two. The total width of the confidence interval is 0.0634 - 0.0364 = 0.027, so the margin of error is 0.0135, or 1.35%.
The sample proportion minus the lower confidence interval limit: 0.0499 - 0.0364 = 0.0135
The upper confidence interval limit minus the sample proportion: 0.0634 - 0.0499 = 0.0135
This margin of error computed using R is smaller than the margin of error suggested by Pew. It could be that their suggested margin of error takes into account a potentially more complicated sampling design than the simple random sample our method is based on.
SOLUTION:
confint(binom.test(us_2012$response, conf.level = 0.9, ci.method = "wald"))
## probability of success lower upper level
## 1 0.0499002 0.03858586 0.06121454 0.9
This 90% confidence interval is narrower than the 95% confidence interval we calculated above.
This makes sense, because we’re only requiring that 90% of confidence intervals we’d get from different samples include the true proportion of the population who identify as atheists, rather than 95% of the confidence intervals. Each interval can be narrower, because fewer of them need to include the populationo proportion.
SOLUTION:
austria_2012 <- filter(atheism, nationality == "Austria", year == "2012")
nrow(austria_2012)
## [1] 1002
cr_2012 <- filter(atheism, nationality == "Czech Republic", year == "2012")
nrow(cr_2012)
## [1] 1000
kenya_2012 <- filter(atheism, nationality == "Kenya", year == "2012")
nrow(kenya_2012)
## [1] 1000
## Add confidence interval calculations for the proportion of the population
## who identify as atheists in each of Austria, the Czech Republic, and Kenya.
confint(binom.test(austria_2012$response, conf.level = 0.95, ci.method = "wald"))
## probability of success lower upper level
## 1 0.0998004 0.0812416 0.1183592 0.95
confint(binom.test(cr_2012$response, conf.level = 0.95, ci.method = "wald"))
## probability of success lower upper level
## 1 0.3 0.2715974 0.3284026 0.95
confint(binom.test(kenya_2012$response, conf.level = 0.95, ci.method = "wald"))
## probability of success lower upper level
## 1 0.02 0.01132287 0.02867713 0.95
SOLUTION:
Here is a table of the sample size (\(n\)), sample proportion \(\hat{p}\), and confidence interval widths for each of the US, Austria, the Czech Republic, and Kenya. I have ordered the countries from lowest sample proportion to largest sample proportion.
Country | Sample Size (n) | Sample Proportion (\(\hat{p}\)) | Confidence Interval Width |
---|---|---|---|
Kenya |
1000 |
0.02 |
0.017 |
United States |
1002 |
0.05 |
0.027 |
Austria |
1002 |
0.1 |
0.037 |
Czech Republic |
1000 |
0.3 |
0.057 |
As we can see, the sample size was similar for all four countries, but the sample proportions were different and the confidence interval widths were different. Recall that when we calculate the confidence interval, the lower bound of the interval is calculated as \(\hat{p} - 2SE(\hat{p})\) and the upper bound is calculated as \(\hat{p} + 2SE(\hat{p})\), where \[SE(\hat{p}) = \sqrt{\frac{\widehat{p}(1 - \widehat{p})}{n}}.\]
This means that the standard error of \(\hat{p}\), and therefore the confidence interval width, changes with the sample proportion. In these examples, the confidence interval width gets wider as the sample proportion increases. In general, for a given sample size, the widest intervals will be when the sample proportion is 0.5.
SOLUTION:
I would expect the interval width to be about the same for a country with 70% atheists as for a country with about 30% atheists, if the sample sizes are similar. This is because \(\hat{p}(1 - \hat{p}) = 0.21\) whether \(\hat{p}\) = 0.3 or \(\hat{p}\) = 0.7. That means the interval width should be about 0.057, similar to the interval width for the Czech Republic above.
SOLUTION:
saudi_arabia_12 <- filter(atheism, nationality == "Saudi Arabia", year == "2012")
nrow(saudi_arabia_12)
## [1] 500
south_africa_12 <- filter(atheism, nationality == "South Africa", year == "2012")
nrow(south_africa_12)
## [1] 202
## Add confidence interval calculations for the proportion of the population
## who identify as atheists in each of Saudi Arabia and South Africa.
confint(binom.test(saudi_arabia_12$response, conf.level = 0.95, ci.method = "wald"))
## probability of success lower upper level
## 1 0.05 0.03089663 0.06910337 0.95
confint(binom.test(south_africa_12$response, conf.level = 0.95, ci.method = "wald"))
## probability of success lower upper level
## 1 0.03960396 0.01270925 0.06649867 0.95
SOLUTION:
Here is a table of the sample size (\(n\)), sample proportion \(\hat{p}\), and confidence interval widths for each of the US, Austria, the Czech Republic, and Kenya. I have ordered the countries from lowest sample proportion to largest sample proportion.
Country | Sample Size (n) | Sample Proportion (\(\hat{p}\)) | Confidence Interval Width |
---|---|---|---|
South Africa |
202 |
0.04 |
0.054 |
Saudi Arabia |
500 |
0.05 |
0.038 |
United States |
1002 |
0.05 |
0.027 |
Again, the confidence interval width depends on the standard error, \[SE(\hat{p}) = \sqrt{\frac{\widehat{p}(1 - \widehat{p})}{n}}.\]
As the sample size \(n\) increases, the standard error decreases, and as a result the width of the confidence interval decreases.
SOLUTION:
If the total width of the confidence interval is 0.04, then the margin of error is 0.02.
The margin of error for a 95% confidence interval is \(2 SE(\hat{p})\), so we get an equation like
\[0.02 = 2 \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}\]
Plugging in our best guess at the population proportion, \(p = 0.15\), for the sample proportion in this equation, we obtain
\[0.02 = 2 \sqrt{\frac{0.15(1 - 0.15)}{n}}\]
We can now solve this equation for the sample size \(n\):
\[0.01 = \sqrt{\frac{0.15(1 - 0.15)}{n}}\]
\[0.01 = \sqrt{\frac{0.15(1 - 0.15)}{n}}\]
\[0.0001 = \frac{0.15(1 - 0.15)}{n}\]
\[n = \frac{0.15(1 - 0.15)}{0.0001}\]
\[n = 1275\]
We should use a sample size of about 1275 to get a confidence interval width of about 0.04 in this country.