Processing math: 100%

Confidence Intervals for Population Proportions

Evan L. Ray

November 1, 2017

Warm Up

Suppose $X \sim \text{Normal}(\mu, \sigma).$

Define a new random variable $Z$ by $Z = \frac{X - \mu}{\sigma}.$

Fact: $Z$ also follows a Normal distribution. What are the mean (i.e., expected value) and variance of $Z$ ?

"Recall" that if $X$ is a random variable and $a$ is a number, then

$E(aX) = a E(X)$

$E(X + a) = E(X) + a$

$\text{SD}(aX) = a^2 \text{SD}(X)$

More Babies

The Apgar score gives a quick sense of a baby's physical health, and is used to determine whether a baby needs immediate medical care.
It ranges from 0 (critical health problems) to 10 (no health problems).
Let's try to estimate the proportion of babies in the population who have an Apgar score of 10 using a sample of $n = 300$ babies.

A New Variable…

babies <- mutate(babies, apgar_eq_10 = (apgar5 == 10))
head(babies[, c("gestation", "apgar5", "apgar_eq_10")])

## # A tibble: 6 x 3
##   gestation apgar5 apgar_eq_10
##       <int>  <int>       <lgl>
## 1        41      9       FALSE
## 2        47      6       FALSE
## 3        37      9       FALSE
## 4        35      9       FALSE
## 5        37     10        TRUE
## 6        35      9       FALSE

Population Proportion

table(babies$apgar_eq_10)

## 
##  FALSE   TRUE 
## 236381  21648

table(babies$apgar_eq_10) / nrow(babies)

## 
##      FALSE       TRUE 
## 0.91610245 0.08389755

Sample Proportion

babies_sample <- sample_n(babies, size = 300)
table(babies_sample$apgar_eq_10) / nrow(babies_sample)

## 
##     FALSE      TRUE 
## 0.8866667 0.1133333

Our estimate of the population proportion based on this sample is WRONG!
Can we get a sense of how wrong it might be, using only the data in our sample?

Sampling Distribution of $\widehat{p}$

On Monday we said that if $n$ is big enough, $\widehat{p} \sim \text{Normal}\left(p, \sqrt{\frac{p(1-p)}{n}}\right)$
In this case, the population proportion is $p = 0.084$ , and $n = 300$ , so… $\widehat{p} \sim \text{Normal}\left(0.084, 0.016\right)$

Interpretation with 68-95-99.7 Rule

$\widehat{p} \sim \text{Normal}\left(0.084, 0.016\right)$

For about 68% of samples of size $n$ we could take, the sample proportion $\widehat{p}$ will be within $\pm$ 1 standard deviation ( $\pm$ 0.016) of the population proportion $p = 0.084$
For about 95% of samples of size $n$ we could take, the sample proportion $\widehat{p}$ will be within $\pm$ 2 standard deviations ( $\pm$ 0.032) of the population proportion $p = 0.084$

A Confidence Interval

If $\widehat{p}$ is within $\pm$ 2 standard deviations of $p$ , then $p$ is contained in the interval

$[\widehat{p} - 2 \, \text{SD}(\widehat{p}), \widehat{p} + 2 \, \text{SD}(\widehat{p})]$

We are "95% Confident" that the population proportion $p$ is in the interval [0.081, 0.145].
For 95% of samples, an interval constructed this way contains $p$ .

95% C.I.s from 100 Different Samples

A Minor Problem

The 95% confidence interval from a couple of slides ago was

$[\widehat{p} - 2 \, \text{SD}(\widehat{p}), \widehat{p} + 2 \, \text{SD}(\widehat{p})]$

But SD $(\widehat{p})$ depends on the (unknown) population parameter $p$ :

$\text{SD}(\widehat{p}) = \sqrt{\frac{p(1-p)}{n}}$

A Minor Problem

The 95% confidence interval from a couple of slides ago was

$[\widehat{p} - 2 \, \text{SD}(\widehat{p}), \widehat{p} + 2 \, \text{SD}(\widehat{p})]$

But SD $(\widehat{p})$ depends on the (unknown) population parameter $p$ :

$\text{SD}(\widehat{p}) = \sqrt{\frac{p(1-p)}{n}}$

We can estimate SD $(\widehat{p})$ by plugging our estimate of $p$ into this formula. An estimate of the standard deviation of a sampling distribution is called a standard error:

$\text{SE}(\widehat{p}) = \sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}}$

Critical Values

What if we want a 90% CI instead of a 95% CI?
We need to know: 90% of sample means will be within how many standard deviations of the population mean?
This is called the critical value, and denoted by $z^*$

Our new CI formula: $[\widehat{p} - z^* \text{SE}(\widehat{p}), \widehat{p} + z^* \text{SE}(\widehat{p})]$

Finding the Critical Value (Short Version)

For a 90% CI, the critical value is the 95th percentile of a Normal(0, 1) distribution:

qnorm(0.95, mean = 0, sd = 1)

## [1] 1.644854

More generally: for a (1−α)×100% CI, the critical value is the (1 - )th quantile of a Normal(0, 1) distribution:
- $\alpha = 0.1$ -> 90% CI. 1 - 0.05 = 0.95th quantile.
- $\alpha = 0.05$ -> 95% CI. 1 - 0.025 = 0.975th quantile.
- $\alpha = 0.01$ -> 99% CI. 1 - 0.005 = 0.995th quantile.

Finding the Critical Value

$\widehat{p} \sim \text{Normal}(p, \text{SD}(\widehat{p}))$

For a 90% CI, we need the total area to the left of $p + z^* \text{SD}(\widehat{p})$ to be 0.95, in a Normal( $p$ , SD( $\widehat{p}$ )) distribution.

Finding the Critical Value (continued)

$\widehat{p} \sim \text{Normal}(p, \text{SD}(\widehat{p}))$

For a 90% CI, we need the total area to the left of $p + z^* \text{SD}(\widehat{p})$ to be 0.95, in a Normal( $p$ , SD( $\widehat{p}$ )) distribution.
Let's define $Z = \frac{\widehat{p} - p}{\text{SD}(\widehat{p})}$ . Then $Z \sim \text{Normal}(0, 1)$ (see warmup)

Finding the Critical Value (continued)

For a 90% CI, area to the left of $p + z^* \text{SD}(\widehat{p})$ is 0.95.
Define $Z = \frac{\widehat{p} - p}{\text{SD}(\widehat{p})}$ . Then $Z \sim \text{Normal}(0, 1)$

Area to the left of $\frac{[p + z^* \text{SD}(\widehat{p})] - p}{\text{SD}(\widehat{p})} = z^*$ is 0.95.

Putting it All Together

CI formula: $[\widehat{p} - z^* \text{SE}(\widehat{p}), \widehat{p} + z^* \text{SE}(\widehat{p})]$
Standard Error of $\widehat{p}$ : $\text{SE}(\widehat{p}) = \sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}}$
Critical Value: z∗ is the 97.5th percentile of a standard normal distribution if we want a 95% CI
- Use qnorm function in R
Margin of Error: $z^* \text{SE}(\widehat{p})$ (how much we add and subtract from the point estimate $\widehat{p}$ )
Interpretation: In repeated sampling, a confidence interval constructed using this procedure contains the population parameter for 95% of samples (or whatever your confidence level is).

Assumptions to Check

Two outcomes (that are relevant to this analysis)
Same probability of success
People/items in our sample are independent
- Think about how data were collected/if there is a connection between units
- 10% Condition: Sample size less than 10% of population size?
Sample size large enough to use normal approximation to the sampling distribution:
- $np \geq 10$ and $n(1 - p) \geq 10$
- … but we don't actually know $p$ !
- Check that there are at least 10 "successes" and 10 "failures" in the data set.

Manual Calculations in R

table(babies_sample$apgar_eq_10) / nrow(babies_sample)

## 
##     FALSE      TRUE 
## 0.8866667 0.1133333

p_hat <- 0.1133333
se_p_hat <- sqrt(p_hat * (1 - p_hat) / 300)
z_star <- qnorm(0.975, mean = 0, sd = 1)
p_hat - z_star * se_p_hat

## [1] 0.07746206

p_hat + z_star * se_p_hat

## [1] 0.1492045

Automagic Calculations in R

library(mosaic)
confint(binom.test(
  babies_sample$apgar_eq_10,
  conf.level = 0.95,
  ci.method = "wald",
  success = TRUE))

##   probability of success      lower     upper level
## 1              0.1133333 0.07746209 0.1492046  0.95

Confidence Intervals for Population Proportions

Warm Up

More Babies

A New Variable…

Population Proportion

Sample Proportion

Sampling Distribution of ˆp\widehat{p}

Interpretation with 68-95-99.7 Rule

A Confidence Interval

95% C.I.s from 100 Different Samples

A Minor Problem

A Minor Problem

Critical Values

Finding the Critical Value (Short Version)

Finding the Critical Value

Finding the Critical Value (continued)

Finding the Critical Value (continued)

Putting it All Together

Assumptions to Check

Manual Calculations in R

Automagic Calculations in R

Sampling Distribution of $\widehat{p}$