Goals

There are three main goals/learning objectives for this lab:

  1. Get experience with checking conditions and interpreting confidence intervals for proportions.

  2. Get some experience using R to calculate confidence intervals for proportions.

  3. Understand how the width of the confidence interval depends on various factors:

    1. the confidence level

    2. the proportion that is being estimated

    3. the sample size

Grading

This lab will not be graded for credit. However, I will ask you turn it in; there are several common errors in interpreting confidence intervals, and I want to have a chance to read these and give you feedback before you are responsible for this material on graded assignments. Please email the completed lab (Rmd file only) to me, cc-ing anyone you worked with, by 5pm on Monday, Nov 6. You will have class time on Wednesday and Friday to work on this, but we will not spend time on this lab in class on Monday. To download the Rmd file from Rstudio, click the check box next to the file name in the lower right panel of RStudio, then click “More” (top right of that Files panel), and choose “Export…” and save the file to your computer. Then you can attach it to an email.

Introduction

In August of 2012, news outlets ranging from the Washington Post to the Huffington Post ran a story about the rise of atheism in America. The source for the story was a poll that asked people, “Irrespective of whether you attend a place of worship or not, would you say you are a religious person, not a religious person or a convinced atheist?” The full press release for the poll, conducted by WIN-Gallup International, is found at the following address:

*<“https://mhc-stat140-2017.github.io/labs/20171101_p_ci/Global_INDEX_of_Religiosity_and_Atheism_PR__6.pdf>*

Preliminary Questions

1. In the first paragraph of the press release, several key findings are reported. Do these percentages appear to be sample statistics (derived from the data sample) or population parameters?

SOLUTION:

These are sample statistics. They were calculated based on the responses from a relatively small number of people sampled from each country (generally less than 2000 per country). To be population parameters, we would have had to find out whether or not every person in each country identified as an atheist, to compute the proportion of the entire population who identify as atheists.

2. The title of the report is “Global Index of Religiosity and Atheism”. To generalize the report’s findings to the global human population, what must we assume about the sampling method? Does that seem like a reasonable assumption?

SOLUTION:

To generalize the findings from the sample to the global population, we would have to know that the sampling method resulted in a sample that was representative of the population in terms of their religious views. Without knowing the details of the sampling methods, we just have to trust the Pew took their sample carefully.

The data

Turn your attention to Table 6 of the press release (pages 15 and 16), which reports the sample size and response percentages for all 57 countries. While this is a useful format to summarize the data, we will base our analysis on the original data set of individual responses to the survey. Load this data set into R with the following commands.

atheism <- read_csv("https://mhc-stat140-2017.github.io/data/openintro/atheism/atheism.csv")
## Parsed with column specification:
## cols(
##   nationality = col_character(),
##   response = col_character(),
##   year = col_integer()
## )
head(atheism)
## # A tibble: 6 x 3
##   nationality    response  year
##         <chr>       <chr> <int>
## 1 Afghanistan non-atheist  2012
## 2 Afghanistan non-atheist  2012
## 3 Afghanistan non-atheist  2012
## 4 Afghanistan non-atheist  2012
## 5 Afghanistan non-atheist  2012
## 6 Afghanistan non-atheist  2012
str(atheism)
## Classes 'tbl_df', 'tbl' and 'data.frame':    88032 obs. of  3 variables:
##  $ nationality: chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ response   : chr  "non-atheist" "non-atheist" "non-atheist" "non-atheist" ...
##  $ year       : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 3
##   .. ..$ nationality: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ response   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ year       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

3. What does each row of Table 6 correspond to? What does each row of atheism correspond to?

SOLUTION:

Each row of Table 6 in the press release corresponds to a country, summarizing responses from all participants in that country.

Each row of the atheism data frame corresponds to an individual survey participant.

To investigate the link between these two ways of organizing this data, take a look at the estimated proportion of atheists in the United States. Towards the bottom of Table 6, we see that this is 5%. We can check this number using the atheism data by running the commands below. Make sure you understand what each of the commands below does after running it.

us_2012 <- filter(atheism, nationality == "United States", year == "2012")
nrow(us_2012)
## [1] 1002
head(us_2012)
## # A tibble: 6 x 3
##     nationality    response  year
##           <chr>       <chr> <int>
## 1 United States non-atheist  2012
## 2 United States non-atheist  2012
## 3 United States non-atheist  2012
## 4 United States non-atheist  2012
## 5 United States non-atheist  2012
## 6 United States non-atheist  2012
table(us_2012$response)
## 
##     atheist non-atheist 
##          50         952
table(us_2012$response) / nrow(us_2012)
## 
##     atheist non-atheist 
##   0.0499002   0.9500998

4. Using a similar series of commands, confirm the calculation of the proportion of atheist responses in our neighboring country of Canada. Does it agree with the percentage of 9% in Table 6?

SOLUTION:

ca_2012 <- filter(atheism, nationality == "Canada", year == "2012")
nrow(ca_2012)
## [1] 1002
head(ca_2012)
## # A tibble: 6 x 3
##   nationality    response  year
##         <chr>       <chr> <int>
## 1      Canada non-atheist  2012
## 2      Canada non-atheist  2012
## 3      Canada non-atheist  2012
## 4      Canada non-atheist  2012
## 5      Canada non-atheist  2012
## 6      Canada non-atheist  2012
table(ca_2012$response)
## 
##     atheist non-atheist 
##          90         912
table(ca_2012$response) / nrow(ca_2012)
## 
##     atheist non-atheist 
##  0.08982036  0.91017964

Inference on proportions

As was hinted at in Exercise 1, Table 6 provides statistics, that is, calculations made from the sample of 51,927 people. What we’d like, though, is insight into the population parameters. You answer the question, “What proportion of people in your sample reported being atheists?” with a statistic; while the question “What proportion of people on earth would report being atheists” is answered with an estimate of the parameter.

A confidence interval

Here is how we’d compute a 95% confidence interval for the proportion of atheists in the United States in 2012.

confint(binom.test(us_2012$response, conf.level = 0.95, ci.method = "wald"))
##   probability of success      lower      upper level
## 1              0.0499002 0.03641833 0.06338206  0.95

5. Interpret this confidence interval in the context of the problem.

SOLUTION:

We are 95% confident that the proportion of the entire US population who identify as atheists is between 3.64% and 6.34%. If we took many different random samples of size 1002 from the US population, and used the data in each of those samples to calculate a different 95% confidence interval, about 95% of those intervals would contain the proportion of the US population who identify as atheists.

6. Write out the conditions for inference to construct a 95% confidence interval for the proportion of atheists in the United States in 2012. Are you confident all conditions are met?

SOLUTION:

We need to check the following conditions:

  1. For each person in our sample there are two outcomes (at least, two outcomes that are relevant to this analysis): each person either identifies as an atheist or does not identify as an atheist.

  2. Each person we pick has the same probability of being an atheist (which is the proportion of all US citizens who identify as atheists)

  3. The people in our sample are independent. There are two conditions to check here:

  1. the sample was taken randomly

  2. the sample size (n = 1002) is less than 10% of the size of the US population

  1. We have observed at least 10 “successes” and at least 10 “failures” in our sample. We had 50 atheists and 952 non-atheists in this sample (from the R output above).

7. Although formal confidence intervals don’t show up in the report, suggestions of inference appear at the bottom of page 7: “In general, the error margin for surveys of this kind is plus or minus 3-5% at 95% confidence”. Based on the R output, what is the margin of error for the estimate of the proportion of the proportion of atheists in US in 2012?

SOLUTION:

The margin of error is the quantity we add and subtract from \(\hat{p}\) to get the confidence interval (for a 95% confidence interval, this is \(2 * SE(\hat{p}))\). That means that we can calculate the margin of error from the R output above in any of three ways:

  1. The margin of error is the total width of the confidence interval divided by two. The total width of the confidence interval is 0.0634 - 0.0364 = 0.027, so the margin of error is 0.0135, or 1.35%.

  2. The sample proportion minus the lower confidence interval limit: 0.0499 - 0.0364 = 0.0135

  3. The upper confidence interval limit minus the sample proportion: 0.0634 - 0.0499 = 0.0135

This margin of error computed using R is smaller than the margin of error suggested by Pew. It could be that their suggested margin of error takes into account a potentially more complicated sampling design than the simple random sample our method is based on.

Confidence interval width and the confidence level

8. Calculate a 90% confidence interval for the proportion of atheists in the United States in 2012. Does it make sense that this confidence interval would be wider or narrower than the 95% confidence interval we already calculated?

SOLUTION:

confint(binom.test(us_2012$response, conf.level = 0.9, ci.method = "wald"))
##   probability of success      lower      upper level
## 1              0.0499002 0.03858586 0.06121454   0.9

This 90% confidence interval is narrower than the 95% confidence interval we calculated above.

This makes sense, because we’re only requiring that 90% of confidence intervals we’d get from different samples include the true proportion of the population who identify as atheists, rather than 95% of the confidence intervals. Each interval can be narrower, because fewer of them need to include the populationo proportion.

Confidence interval width and the proportion being estimated

9. Modify the R chunk below to calculate 95% confidence intervals for the proportion of the population who identify as atheists in Austria, the Czech Republic, and Kenya. (We should check the conditions for constructing the confidence interval as in question number 6 above – but let’s ignore that step here in order to focus our time on other issues.) Note that for each of these countries, as well as the U.S., the sample size is similar (about 1000 respondants). Then answer the questions below.

SOLUTION:

austria_2012 <- filter(atheism, nationality == "Austria", year == "2012")
nrow(austria_2012)
## [1] 1002
cr_2012 <- filter(atheism, nationality == "Czech Republic", year == "2012")
nrow(cr_2012)
## [1] 1000
kenya_2012 <- filter(atheism, nationality == "Kenya", year == "2012")
nrow(kenya_2012)
## [1] 1000
## Add confidence interval calculations for the proportion of the population
## who identify as atheists in each of Austria, the Czech Republic, and Kenya.
confint(binom.test(austria_2012$response, conf.level = 0.95, ci.method = "wald"))
##   probability of success     lower     upper level
## 1              0.0998004 0.0812416 0.1183592  0.95
confint(binom.test(cr_2012$response, conf.level = 0.95, ci.method = "wald"))
##   probability of success     lower     upper level
## 1                    0.3 0.2715974 0.3284026  0.95
confint(binom.test(kenya_2012$response, conf.level = 0.95, ci.method = "wald"))
##   probability of success      lower      upper level
## 1                   0.02 0.01132287 0.02867713  0.95

(i) Is the width of the confidence intervals the same for these three countries and the U.S.? How does this relate to the formula we learned for the margin of error used in calculating the confidence interval?

SOLUTION:

Here is a table of the sample size (\(n\)), sample proportion \(\hat{p}\), and confidence interval widths for each of the US, Austria, the Czech Republic, and Kenya. I have ordered the countries from lowest sample proportion to largest sample proportion.

Country Sample Size (n) Sample Proportion (\(\hat{p}\)) Confidence Interval Width

Kenya

1000

0.02

0.017

United States

1002

0.05

0.027

Austria

1002

0.1

0.037

Czech Republic

1000

0.3

0.057

As we can see, the sample size was similar for all four countries, but the sample proportions were different and the confidence interval widths were different. Recall that when we calculate the confidence interval, the lower bound of the interval is calculated as \(\hat{p} - 2SE(\hat{p})\) and the upper bound is calculated as \(\hat{p} + 2SE(\hat{p})\), where \[SE(\hat{p}) = \sqrt{\frac{\widehat{p}(1 - \widehat{p})}{n}}.\]

This means that the standard error of \(\hat{p}\), and therefore the confidence interval width, changes with the sample proportion. In these examples, the confidence interval width gets wider as the sample proportion increases. In general, for a given sample size, the widest intervals will be when the sample proportion is 0.5.

(ii) If we surveyed a new country where 70% of the population were atheists, and obtained a sample size of about 1000 people, how wide would you expect the confidence interval to be? Would it have the same or similar width as the interval we’ve seen for one of the other countries we’ve already looked at, or would it be larger or smaller?

SOLUTION:

I would expect the interval width to be about the same for a country with 70% atheists as for a country with about 30% atheists, if the sample sizes are similar. This is because \(\hat{p}(1 - \hat{p}) = 0.21\) whether \(\hat{p}\) = 0.3 or \(\hat{p}\) = 0.7. That means the interval width should be about 0.057, similar to the interval width for the Czech Republic above.

Confidence interval width and the sample size

10. Calculate 95% Confidence Intervals for the proportion of the population who identify as atheists in Saudi Arabia and South Africa (again, skip checking the conditions for now). Note that these countries have similar proportions who identify as atheists as the U.S., but the sample sizes are different. Then answer the questions below.

SOLUTION:

saudi_arabia_12 <- filter(atheism, nationality == "Saudi Arabia", year == "2012")
nrow(saudi_arabia_12)
## [1] 500
south_africa_12 <- filter(atheism, nationality == "South Africa", year == "2012")
nrow(south_africa_12)
## [1] 202
## Add confidence interval calculations for the proportion of the population
## who identify as atheists in each of Saudi Arabia and South Africa.
confint(binom.test(saudi_arabia_12$response, conf.level = 0.95, ci.method = "wald"))
##   probability of success      lower      upper level
## 1                   0.05 0.03089663 0.06910337  0.95
confint(binom.test(south_africa_12$response, conf.level = 0.95, ci.method = "wald"))
##   probability of success      lower      upper level
## 1             0.03960396 0.01270925 0.06649867  0.95

(i) Is the width of the confidence intervals the same for these two countries and the U.S.? How does this relate to the formula we learned for the margin of error used in calculating the confidence interval?

SOLUTION:

Here is a table of the sample size (\(n\)), sample proportion \(\hat{p}\), and confidence interval widths for each of the US, Austria, the Czech Republic, and Kenya. I have ordered the countries from lowest sample proportion to largest sample proportion.

Country Sample Size (n) Sample Proportion (\(\hat{p}\)) Confidence Interval Width

South Africa

202

0.04

0.054

Saudi Arabia

500

0.05

0.038

United States

1002

0.05

0.027

Again, the confidence interval width depends on the standard error, \[SE(\hat{p}) = \sqrt{\frac{\widehat{p}(1 - \widehat{p})}{n}}.\]

As the sample size \(n\) increases, the standard error decreases, and as a result the width of the confidence interval decreases.

(ii) Suppose we are planning a survey of a new country, and our initial guess is that about the proportion of the population who identify as atheists is about 0.15 (or 15%). If we want to ensure that the total width of a 95% confidence interval for that proportion will be no larger than 0.04, how large should our sample size be? Use the formula for the margin of error for a confidence interval based on the normal approximation. (Recall that the margin of error is half of the width of the interval!)

SOLUTION:

If the total width of the confidence interval is 0.04, then the margin of error is 0.02.

The margin of error for a 95% confidence interval is \(2 SE(\hat{p})\), so we get an equation like

\[0.02 = 2 \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}\]

Plugging in our best guess at the population proportion, \(p = 0.15\), for the sample proportion in this equation, we obtain

\[0.02 = 2 \sqrt{\frac{0.15(1 - 0.15)}{n}}\]

We can now solve this equation for the sample size \(n\):

\[0.01 = \sqrt{\frac{0.15(1 - 0.15)}{n}}\]

\[0.01 = \sqrt{\frac{0.15(1 - 0.15)}{n}}\]

\[0.0001 = \frac{0.15(1 - 0.15)}{n}\]

\[n = \frac{0.15(1 - 0.15)}{0.0001}\]

\[n = 1275\]

We should use a sample size of about 1275 to get a confidence interval width of about 0.04 in this country.