--- title: "Confidence Intervals for Population Proportions" author: "Evan L. Ray" date: "November 1, 2017" output: ioslides_presentation --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE, cache = FALSE) require(ggplot2) require(scales) require(dplyr) require(tidyr) require(readr) require(mosaic) ``` ## Warm Up Suppose $X \sim \text{Normal}(\mu, \sigma).$ Define a new random variable $Z$ by $Z = \frac{X - \mu}{\sigma}.$ Fact: $Z$ also follows a Normal distribution. What are the mean (i.e., expected value) and variance of $Z$? "Recall" that if $X$ is a random variable and $a$ is a number, then $E(aX) = a E(X)$ $E(X + a) = E(X) + a$ $\text{SD}(aX) = a^2 \text{SD}(X)$ ## More Babies * The Apgar score gives a quick sense of a baby's physical health, and is used to determine whether a baby needs immediate medical care. * It ranges from 0 (critical health problems) to 10 (no health problems). * Let's try to estimate the proportion of babies in the population who have an Apgar score of 10 using a sample of $n = 300$ babies. ```{r, echo = FALSE, message = FALSE, warning = FALSE, cache = TRUE} babies <- read_csv("https://mhc-stat140-2017.github.io/data/misc/babies1998/babies_dec_1998.csv") babies <- filter(babies, !is.na(apgar5)) set.seed(1) ``` ## A New Variable... ```{r, echo = TRUE} babies <- mutate(babies, apgar_eq_10 = (apgar5 == 10)) head(babies[, c("gestation", "apgar5", "apgar_eq_10")]) ``` ## Population Proportion ```{r, echo = TRUE} table(babies$apgar_eq_10) table(babies$apgar_eq_10) / nrow(babies) ``` ```{r, echo = FALSE, fig.height=0.75} pop_prop <- data.frame( x = mean(babies$apgar_eq_10), y = 0 ) ggplot() + geom_segment(mapping = aes(x = x, xend = xend, y = y, yend = yend), data = data.frame(x = 0, xend = 1, y = 0, yend = 0)) + geom_point(mapping = aes(x = x, y = y), size = 3, data = pop_prop) + xlim(c(0, 1)) + xlab("Proportion with Apgar = 10") + theme_bw(base_size = 14) + theme( axis.line.y=element_blank(), axis.line.x=element_blank(), axis.text.y=element_blank(), axis.ticks.y=element_blank(), axis.ticks.x=element_blank(), axis.title.y=element_blank(), panel.grid.minor.y=element_blank(), panel.grid.minor.x=element_blank(), panel.grid.major.y=element_blank(), panel.grid.major.x=element_line(color = "black"), panel.border = element_blank()) ``` ## Sample Proportion ```{r, echo = FALSE} set.seed(5) ``` ```{r, echo = TRUE} babies_sample <- sample_n(babies, size = 300) table(babies_sample$apgar_eq_10) / nrow(babies_sample) ``` ```{r, echo = FALSE, fig.height=0.75} sample_prop <- data.frame( x = mean(babies_sample$apgar_eq_10), y = 0 ) ggplot() + geom_segment(mapping = aes(x = x, xend = xend, y = y, yend = yend), data = data.frame(x = 0, xend = 1, y = 0, yend = 0)) + geom_point(mapping = aes(x = x, y = y), size = 3, data = pop_prop) + geom_point(mapping = aes(x = x, y = y), color = "orange", size = 3, data = sample_prop) + xlim(c(0, 1)) + xlab("Proportion with Apgar = 10") + theme_bw(base_size = 14) + theme( axis.line.y=element_blank(), axis.line.x=element_blank(), axis.text.y=element_blank(), axis.ticks.y=element_blank(), axis.ticks.x=element_blank(), axis.title.y=element_blank(), panel.grid.minor.y=element_blank(), panel.grid.minor.x=element_blank(), panel.grid.major.y=element_blank(), panel.grid.major.x=element_line(color = "black"), panel.border = element_blank()) ``` * Our estimate of the population proportion based on this sample is WRONG! * Can we get a sense of how wrong it might be, using only the data in our sample? ## Sampling Distribution of $\widehat{p}$ * On Monday we said that if $n$ is big enough, $$\widehat{p} \sim \text{Normal}\left(p, \sqrt{\frac{p(1-p)}{n}}\right)$$ * In this case, the population proportion is $p = 0.084$, and $n = 300$, so... $$\widehat{p} \sim \text{Normal}\left(0.084, 0.016\right)$$ ## Interpretation with 68-95-99.7 Rule $$\widehat{p} \sim \text{Normal}\left(0.084, 0.016\right)$$ ```{r, fig.height = 2} p <- .084 n <- 300 sdp <- sqrt(p*(1-p)/n) normal_mean <- p normal_sd <- sdp plot_df <- data.frame( x = seq(from = 0, to = 0.17, length = 101) ) ggplot() + stat_function(mapping = aes(x = x), fun = dnorm, args = list(mean = normal_mean, sd = normal_sd), data = plot_df) + geom_vline(xintercept = normal_mean - 3*normal_sd) + geom_vline(xintercept = normal_mean - 2*normal_sd) + geom_vline(xintercept = normal_mean - 1*normal_sd) + geom_vline(xintercept = normal_mean) + geom_vline(xintercept = normal_mean + 1*normal_sd) + geom_vline(xintercept = normal_mean + 2*normal_sd) + geom_vline(xintercept = normal_mean + 3*normal_sd) + geom_point(mapping = aes(x = x, y = y), size = 3, data = pop_prop) + geom_point(mapping = aes(x = x, y = y), color = "orange", size = 3, data = sample_prop) + xlab("Proportion with Apgar = 10") + scale_x_continuous( breaks = c(0, normal_mean + seq(from = -3, to = 3)*normal_sd, 0.170), labels = c("0.000", format(round(normal_mean + seq(from = -3, to = 3)*normal_sd, 3), 3), "0.170")) + theme_gray(base_size = 14) + theme( panel.grid.minor.x=element_blank() ) ``` * For about 68% of samples of size $n$ we could take, the sample proportion $\widehat{p}$ will be within $\pm$ 1 standard deviation ($\pm$ 0.016) of the population proportion $p = 0.084$ * For about 95% of samples of size $n$ we could take, the sample proportion $\widehat{p}$ will be within $\pm$ 2 standard deviations ($\pm$ 0.032) of the population proportion $p = 0.084$ ## A Confidence Interval * If $\widehat{p}$ is within $\pm$ 2 standard deviations of $p$, then $p$ is contained in the interval $$[\widehat{p} - 2 \, \text{SD}(\widehat{p}), \widehat{p} + 2 \, \text{SD}(\widehat{p})]$$ ```{r, fig.height = 2} p <- .084 n <- 300 sdp <- sqrt(p*(1-p)/n) almost_ci_df <- data.frame( y = 0, xmin = sample_prop$x - 2 * sdp, xmax = sample_prop$x + 2 * sdp, x = sample_prop$x ) normal_mean <- p normal_sd <- sdp plot_df <- data.frame( x = seq(from = 0, to = 0.17, length = 101) ) ggplot() + stat_function(mapping = aes(x = x), fun = dnorm, args = list(mean = normal_mean, sd = normal_sd), data = plot_df) + geom_vline(xintercept = normal_mean - 3*normal_sd) + geom_vline(xintercept = normal_mean - 2*normal_sd) + geom_vline(xintercept = normal_mean - 1*normal_sd) + geom_vline(xintercept = normal_mean) + geom_vline(xintercept = normal_mean + 1*normal_sd) + geom_vline(xintercept = normal_mean + 2*normal_sd) + geom_vline(xintercept = normal_mean + 3*normal_sd) + geom_errorbarh(mapping = aes(y = y, x = x, xmin = xmin, xmax = xmax), size = 1, height = 6, color = "orange", data = almost_ci_df) + geom_point(mapping = aes(x = x, y = y), size = 3, data = pop_prop) + geom_point(mapping = aes(x = x, y = y), color = "orange", size = 3, data = sample_prop) + xlab("Proportion with Apgar = 10") + scale_x_continuous( breaks = c(0, normal_mean + seq(from = -3, to = 3)*normal_sd, 0.170), labels = c("0.000", format(round(normal_mean + seq(from = -3, to = 3)*normal_sd, 3), 3), "0.170")) + theme_gray(base_size = 14) + theme( panel.grid.minor.x=element_blank() ) ``` * We are "95% Confident" that the population proportion $p$ is in the interval [0.081, 0.145]. * For 95% of samples, an interval constructed this way contains $p$. ## 95% C.I.s from 100 Different Samples ```{r, fig.height = 5.5} p <- .084 n <- 300 sdp <- sqrt(p*(1-p)/n) set.seed(1) almost_ci_df <- do(100) * { babies_sample <- sample_n(babies, size = 300) sample_prop <- mean(babies_sample$apgar_eq_10) data.frame( y = 0, xmin = sample_prop - 2 * sdp, xmax = sample_prop + 2 * sdp, x = sample_prop ) } almost_ci_df$simulation_index <- seq_len(100) almost_ci_df$contains_truth <- (almost_ci_df$xmin < p & almost_ci_df$xmax > p) normal_mean <- p normal_sd <- sdp plot_df <- data.frame( x = seq(from = 0, to = 0.17, length = 101) ) ggplot() + geom_errorbarh(mapping = aes(y = simulation_index, x = x, xmin = xmin, xmax = xmax, color = contains_truth), size = 1, height = 4, data = almost_ci_df) + geom_vline(xintercept = normal_mean - 3*normal_sd) + geom_vline(xintercept = normal_mean - 2*normal_sd) + geom_vline(xintercept = normal_mean - 1*normal_sd) + geom_vline(xintercept = normal_mean) + geom_vline(xintercept = normal_mean + 1*normal_sd) + geom_vline(xintercept = normal_mean + 2*normal_sd) + geom_vline(xintercept = normal_mean + 3*normal_sd) + geom_point(mapping = aes(x = x, y = simulation_index, color = contains_truth), size = 3, data = almost_ci_df) + xlab("Proportion with Apgar = 10") + ylab("Sample Index") + scale_color_manual("Interval\nContains\nTrue p", breaks = c("FALSE", "TRUE"), labels = c("No", "Yes"), values = c("blue", "orange")) + scale_x_continuous( breaks = c(0, normal_mean + seq(from = -3, to = 3)*normal_sd, 0.170), labels = c("0.000", format(round(normal_mean + seq(from = -3, to = 3)*normal_sd, 3), 3), "0.170")) + theme_gray(base_size = 14) + theme( panel.grid.minor.x=element_blank() ) ``` ## A Minor Problem * The 95% confidence interval from a couple of slides ago was $$[\widehat{p} - 2 \, \text{SD}(\widehat{p}), \widehat{p} + 2 \, \text{SD}(\widehat{p})]$$ * But SD$(\widehat{p})$ depends on the (unknown) population parameter $p$: $$\text{SD}(\widehat{p}) = \sqrt{\frac{p(1-p)}{n}}$$ ## A Minor Problem * The 95% confidence interval from a couple of slides ago was $$[\widehat{p} - 2 \, \text{SD}(\widehat{p}), \widehat{p} + 2 \, \text{SD}(\widehat{p})]$$ * But SD$(\widehat{p})$ depends on the (unknown) population parameter $p$: $$\text{SD}(\widehat{p}) = \sqrt{\frac{p(1-p)}{n}}$$ * We can **estimate** SD$(\widehat{p})$ by plugging our **estimate** of $p$ into this formula. An estimate of the standard deviation of a sampling distribution is called a **standard error**: $$\text{SE}(\widehat{p}) = \sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}}$$ ## Critical Values * What if we want a 90% CI instead of a 95% CI? * We need to know: 90% of sample means will be within how many standard deviations of the population mean? * This is called the **critical value**, and denoted by $z^*$ ```{r, fig.height = 2} set.seed(5) babies_sample <- sample_n(babies, size = 300) sample_prop <- data.frame( x = mean(babies_sample$apgar_eq_10), y = 0 ) p <- .084 n <- 300 sdp <- sqrt(p*(1-p)/n) z_star <- qnorm(1 - (.1 / 2), mean = 0, sd = 1) ci_90_df <- data.frame( y = 0, xmin = sample_prop$x - z_star * sdp, xmax = sample_prop$x + z_star * sdp, x = sample_prop$x ) normal_mean <- p normal_sd <- sdp plot_df <- data.frame( x = seq(from = 0, to = 0.17, length = 101) ) plot_df_poly <- data.frame( x = c(normal_mean - z_star * normal_sd, seq(from = normal_mean - z_star * normal_sd, to = normal_mean + z_star * normal_sd, length = 101), normal_mean + z_star * normal_sd ), density = c(0, dnorm(seq(from = normal_mean - z_star * normal_sd, to = normal_mean + z_star * normal_sd, length = 101), mean = normal_mean, sd = normal_sd), 0) ) ggplot() + geom_polygon(aes(x = x, y = density), fill = "blue", alpha = 0.4, data = plot_df_poly) + stat_function(mapping = aes(x = x), fun = dnorm, args = list(mean = normal_mean, sd = normal_sd), data = plot_df) + geom_vline(xintercept = normal_mean - z_star*normal_sd) + geom_vline(xintercept = normal_mean) + geom_vline(xintercept = normal_mean + z_star*normal_sd) + geom_errorbarh(mapping = aes(y = y, x = x, xmin = xmin, xmax = xmax), size = 1, height = 6, color = "orange", data = ci_90_df) + geom_point(mapping = aes(x = x, y = y), size = 3, data = pop_prop) + geom_point(mapping = aes(x = x, y = y), color = "orange", size = 3, data = sample_prop) + geom_segment(mapping = aes(x = x, y = y, xend = xend, yend = yend), size = 0.8, arrow = arrow(length = unit(0.08, "npc"), type = "closed"), data = data.frame(x = 0.11, y = 20, xend = 0.09, yend = 12)) + geom_label(mapping = aes(x = x, y = y, label = label), data = data.frame(x = 0.12, y = 20, label = "Blue Area = 0.9")) + xlab("Proportion with Apgar = 10") + scale_x_continuous( breaks = c(0, normal_mean + c(-z_star, 0, z_star)*normal_sd, 0.170), labels = c("0.000", expression(paste("p - ", z, "* ", "SD(", hat(p), ")")), "p", expression(paste("p + ", z, "* ", "SD(", hat(p), ")")), "0.170")) + theme_gray(base_size = 14) + theme( panel.grid.minor.x = element_blank() ) ``` * Our new CI formula: $[\widehat{p} - z^* \text{SE}(\widehat{p}), \widehat{p} + z^* \text{SE}(\widehat{p})]$ ## Finding the Critical Value (Short Version) * For a 90% CI, the critical value is the 95th percentile of a Normal(0, 1) distribution: ```{r, echo = TRUE} qnorm(0.95, mean = 0, sd = 1) ``` * More generally: for a $(1 - \alpha) \times 100$% CI, the critical value is the (1 - \frac{\alpha}{2})th quantile of a Normal(0, 1) distribution: * $\alpha = 0.1$ -> 90% CI. 1 - 0.05 = 0.95th quantile. * $\alpha = 0.05$ -> 95% CI. 1 - 0.025 = 0.975th quantile. * $\alpha = 0.01$ -> 99% CI. 1 - 0.005 = 0.995th quantile. ## Finding the Critical Value * $\widehat{p} \sim \text{Normal}(p, \text{SD}(\widehat{p}))$ ```{r, fig.height = 2} set.seed(5) babies_sample <- sample_n(babies, size = 300) sample_prop <- data.frame( x = mean(babies_sample$apgar_eq_10), y = 0 ) p <- .084 n <- 300 sdp <- sqrt(p*(1-p)/n) z_star <- qnorm(1 - (.1 / 2), mean = 0, sd = 1) ci_90_df <- data.frame( y = 0, xmin = sample_prop$x - z_star * sdp, xmax = sample_prop$x + z_star * sdp, x = sample_prop$x ) normal_mean <- p normal_sd <- sdp plot_df <- data.frame( x = seq(from = 0, to = 0.17, length = 101) ) plot_df_poly <- data.frame( x = c(normal_mean - z_star * normal_sd, seq(from = normal_mean - z_star * normal_sd, to = normal_mean + z_star * normal_sd, length = 101), normal_mean + z_star * normal_sd ), density = c(0, dnorm(seq(from = normal_mean - z_star * normal_sd, to = normal_mean + z_star * normal_sd, length = 101), mean = normal_mean, sd = normal_sd), 0) ) ggplot() + geom_polygon(aes(x = x, y = density), fill = "blue", alpha = 0.4, data = plot_df_poly) + stat_function(mapping = aes(x = x), fun = dnorm, args = list(mean = normal_mean, sd = normal_sd), data = plot_df) + geom_vline(xintercept = normal_mean - z_star*normal_sd) + geom_vline(xintercept = normal_mean) + geom_vline(xintercept = normal_mean + z_star*normal_sd) + geom_errorbarh(mapping = aes(y = y, x = x, xmin = xmin, xmax = xmax), size = 1, height = 6, color = "orange", data = ci_90_df) + geom_point(mapping = aes(x = x, y = y), size = 3, data = pop_prop) + geom_point(mapping = aes(x = x, y = y), color = "orange", size = 3, data = sample_prop) + geom_segment(mapping = aes(x = x, y = y, xend = xend, yend = yend), size = 0.8, arrow = arrow(length = unit(0.08, "npc"), type = "closed"), data = data.frame(x = 0.11, y = 20, xend = 0.09, yend = 12)) + geom_label(mapping = aes(x = x, y = y, label = label), data = data.frame(x = 0.12, y = 20, label = "Blue Area = 0.9")) + geom_segment(mapping = aes(x = x, y = y, xend = xend, yend = yend), size = 0.8, arrow = arrow(length = unit(0.08, "npc"), type = "closed"), data = data.frame(x = 0.14, y = 8, xend = 0.112, yend = 2.5)) + geom_label(mapping = aes(x = x, y = y, label = label), data = data.frame(x = 0.14, y = 8, label = "Area = 0.05")) + geom_segment(mapping = aes(x = x, y = y, xend = xend, yend = yend), size = 0.8, arrow = arrow(length = unit(0.08, "npc"), type = "closed"), data = data.frame(x = 0.03, y = 8, xend = 0.055, yend = 2.5)) + geom_label(mapping = aes(x = x, y = y, label = label), data = data.frame(x = 0.03, y = 8, label = "Area = 0.05")) + xlab("Proportion with Apgar = 10") + scale_x_continuous( breaks = c(0, normal_mean + c(-z_star, 0, z_star)*normal_sd, 0.170), labels = c("0.000", expression(paste("p - ", z, "* ", "SD(", hat(p), ")")), "p", expression(paste("p + ", z, "* ", "SD(", hat(p), ")")), "0.170")) + theme_gray(base_size = 14) + theme( panel.grid.minor.x = element_blank() ) ``` * For a 90% CI, we need the total area to the left of $p + z^* \text{SD}(\widehat{p})$ to be 0.95, in a Normal($p$, SD($\widehat{p}$)) distribution. ## Finding the Critical Value (continued) * $\widehat{p} \sim \text{Normal}(p, \text{SD}(\widehat{p}))$ ```{r, fig.height = 2} set.seed(5) babies_sample <- sample_n(babies, size = 300) sample_prop <- data.frame( x = mean(babies_sample$apgar_eq_10), y = 0 ) p <- .084 n <- 300 sdp <- sqrt(p*(1-p)/n) z_star <- qnorm(1 - (.1 / 2), mean = 0, sd = 1) ci_90_df <- data.frame( y = 0, xmin = sample_prop$x - z_star * sdp, xmax = sample_prop$x + z_star * sdp, x = sample_prop$x ) normal_mean <- p normal_sd <- sdp plot_df <- data.frame( x = seq(from = 0, to = 0.17, length = 101) ) plot_df_poly <- data.frame( x = c(normal_mean - z_star * normal_sd, seq(from = normal_mean - z_star * normal_sd, to = normal_mean + z_star * normal_sd, length = 101), normal_mean + z_star * normal_sd ), density = c(0, dnorm(seq(from = normal_mean - z_star * normal_sd, to = normal_mean + z_star * normal_sd, length = 101), mean = normal_mean, sd = normal_sd), 0) ) ggplot() + geom_polygon(aes(x = x, y = density), fill = "blue", alpha = 0.4, data = plot_df_poly) + stat_function(mapping = aes(x = x), fun = dnorm, args = list(mean = normal_mean, sd = normal_sd), data = plot_df) + geom_vline(xintercept = normal_mean - z_star*normal_sd) + geom_vline(xintercept = normal_mean) + geom_vline(xintercept = normal_mean + z_star*normal_sd) + geom_errorbarh(mapping = aes(y = y, x = x, xmin = xmin, xmax = xmax), size = 1, height = 6, color = "orange", data = ci_90_df) + geom_point(mapping = aes(x = x, y = y), size = 3, data = pop_prop) + geom_point(mapping = aes(x = x, y = y), color = "orange", size = 3, data = sample_prop) + geom_segment(mapping = aes(x = x, y = y, xend = xend, yend = yend), size = 0.8, arrow = arrow(length = unit(0.08, "npc"), type = "closed"), data = data.frame(x = 0.11, y = 20, xend = 0.09, yend = 12)) + geom_label(mapping = aes(x = x, y = y, label = label), data = data.frame(x = 0.12, y = 20, label = "Blue Area = 0.9")) + geom_segment(mapping = aes(x = x, y = y, xend = xend, yend = yend), size = 0.8, arrow = arrow(length = unit(0.08, "npc"), type = "closed"), data = data.frame(x = 0.14, y = 8, xend = 0.112, yend = 2.5)) + geom_label(mapping = aes(x = x, y = y, label = label), data = data.frame(x = 0.14, y = 8, label = "Area = 0.05")) + geom_segment(mapping = aes(x = x, y = y, xend = xend, yend = yend), size = 0.8, arrow = arrow(length = unit(0.08, "npc"), type = "closed"), data = data.frame(x = 0.03, y = 8, xend = 0.055, yend = 2.5)) + geom_label(mapping = aes(x = x, y = y, label = label), data = data.frame(x = 0.03, y = 8, label = "Area = 0.05")) + xlab("Proportion with Apgar = 10") + scale_x_continuous( breaks = c(0, normal_mean + c(-z_star, 0, z_star)*normal_sd, 0.170), labels = c("0.000", expression(paste("p - ", z, "* ", "SD(", hat(p), ")")), "p", expression(paste("p + ", z, "* ", "SD(", hat(p), ")")), "0.170")) + theme_gray(base_size = 14) + theme( panel.grid.minor.x = element_blank() ) ``` * For a 90% CI, we need the total area to the left of $p + z^* \text{SD}(\widehat{p})$ to be 0.95, in a Normal($p$, SD($\widehat{p}$)) distribution. * Let's define $Z = \frac{\widehat{p} - p}{\text{SD}(\widehat{p})}$. Then $Z \sim \text{Normal}(0, 1)$ (see warmup) ## Finding the Critical Value (continued) ```{r, fig.height = 1.3} set.seed(5) babies_sample <- sample_n(babies, size = 300) sample_prop <- data.frame( x = mean(babies_sample$apgar_eq_10), y = 0 ) p <- .084 n <- 300 sdp <- sqrt(p*(1-p)/n) z_star <- qnorm(1 - (.1 / 2), mean = 0, sd = 1) ci_90_df <- data.frame( y = 0, xmin = sample_prop$x - z_star * sdp, xmax = sample_prop$x + z_star * sdp, x = sample_prop$x ) normal_mean <- p normal_sd <- sdp plot_df <- data.frame( x = seq(from = 0, to = 0.17, length = 101) ) plot_df_poly <- data.frame( x = c(normal_mean - z_star * normal_sd, seq(from = normal_mean - z_star * normal_sd, to = normal_mean + z_star * normal_sd, length = 101), normal_mean + z_star * normal_sd ), density = c(0, dnorm(seq(from = normal_mean - z_star * normal_sd, to = normal_mean + z_star * normal_sd, length = 101), mean = normal_mean, sd = normal_sd), 0) ) ggplot() + geom_polygon(aes(x = x, y = density), fill = "blue", alpha = 0.4, data = plot_df_poly) + stat_function(mapping = aes(x = x), fun = dnorm, args = list(mean = normal_mean, sd = normal_sd), data = plot_df) + geom_vline(xintercept = normal_mean - z_star*normal_sd) + geom_vline(xintercept = normal_mean) + geom_vline(xintercept = normal_mean + z_star*normal_sd) + geom_segment(mapping = aes(x = x, y = y, xend = xend, yend = yend), size = 0.8, arrow = arrow(length = unit(0.08, "npc"), type = "closed"), data = data.frame(x = 0.11, y = 20, xend = 0.09, yend = 12)) + geom_label(mapping = aes(x = x, y = y, label = label), data = data.frame(x = 0.12, y = 20, label = "Blue Area = 0.9")) + geom_segment(mapping = aes(x = x, y = y, xend = xend, yend = yend), size = 0.8, arrow = arrow(length = unit(0.08, "npc"), type = "closed"), data = data.frame(x = 0.14, y = 8, xend = 0.112, yend = 2.5)) + geom_label(mapping = aes(x = x, y = y, label = label), data = data.frame(x = 0.14, y = 8, label = "Area = 0.05")) + geom_segment(mapping = aes(x = x, y = y, xend = xend, yend = yend), size = 0.8, arrow = arrow(length = unit(0.08, "npc"), type = "closed"), data = data.frame(x = 0.03, y = 8, xend = 0.055, yend = 2.5)) + geom_label(mapping = aes(x = x, y = y, label = label), data = data.frame(x = 0.03, y = 8, label = "Area = 0.05")) + xlab("Proportion with Apgar = 10") + scale_x_continuous( breaks = c(0, normal_mean + c(-z_star, 0, z_star)*normal_sd, 0.170), labels = c("0.000", expression(paste("p - ", z, "* ", "SD(", hat(p), ") ")), "p", expression(paste(" p + ", z, "* ", "SD(", hat(p), ")")), "0.170")) + theme_gray(base_size = 14) + theme( panel.grid.minor.x = element_blank() ) ``` * For a 90% CI, area to the left of $p + z^* \text{SD}(\widehat{p})$ is 0.95. * Define $Z = \frac{\widehat{p} - p}{\text{SD}(\widehat{p})}$. Then $Z \sim \text{Normal}(0, 1)$ ```{r, fig.height = 1.3} set.seed(5) babies_sample <- sample_n(babies, size = 300) sample_prop <- data.frame( x = mean(babies_sample$apgar_eq_10), y = 0 ) p <- .084 n <- 300 sdp <- sqrt(p*(1-p)/n) z_star <- qnorm(1 - (.1 / 2), mean = 0, sd = 1) ci_90_df <- data.frame( y = 0, xmin = sample_prop$x - z_star * sdp, xmax = sample_prop$x + z_star * sdp, x = sample_prop$x ) normal_mean <- p normal_sd <- sdp plot_df <- data.frame( x = seq(from = 0, to = 0.17, length = 101) ) plot_df_poly <- data.frame( x = c(normal_mean - z_star * normal_sd, seq(from = normal_mean - z_star * normal_sd, to = normal_mean + z_star * normal_sd, length = 101), normal_mean + z_star * normal_sd ), density = c(0, dnorm(seq(from = normal_mean - z_star * normal_sd, to = normal_mean + z_star * normal_sd, length = 101), mean = normal_mean, sd = normal_sd), 0) ) ggplot() + geom_polygon(aes(x = x, y = density), fill = "blue", alpha = 0.4, data = plot_df_poly) + stat_function(mapping = aes(x = x), fun = dnorm, args = list(mean = normal_mean, sd = normal_sd), data = plot_df) + geom_vline(xintercept = normal_mean - z_star*normal_sd) + geom_vline(xintercept = normal_mean) + geom_vline(xintercept = normal_mean + z_star*normal_sd) + geom_segment(mapping = aes(x = x, y = y, xend = xend, yend = yend), size = 0.8, arrow = arrow(length = unit(0.08, "npc"), type = "closed"), data = data.frame(x = 0.11, y = 20, xend = 0.09, yend = 12)) + geom_label(mapping = aes(x = x, y = y, label = label), data = data.frame(x = 0.12, y = 20, label = "Blue Area = 0.9")) + geom_segment(mapping = aes(x = x, y = y, xend = xend, yend = yend), size = 0.8, arrow = arrow(length = unit(0.08, "npc"), type = "closed"), data = data.frame(x = 0.14, y = 8, xend = 0.112, yend = 2.5)) + geom_label(mapping = aes(x = x, y = y, label = label), data = data.frame(x = 0.14, y = 8, label = "Area = 0.05")) + geom_segment(mapping = aes(x = x, y = y, xend = xend, yend = yend), size = 0.8, arrow = arrow(length = unit(0.08, "npc"), type = "closed"), data = data.frame(x = 0.03, y = 8, xend = 0.055, yend = 2.5)) + geom_label(mapping = aes(x = x, y = y, label = label), data = data.frame(x = 0.03, y = 8, label = "Area = 0.05")) + xlab("") + scale_x_continuous( breaks = c(normal_mean + c(-z_star, 0, z_star)*normal_sd), labels = c( "-z*", "0", "z*" )) + theme_gray(base_size = 14) + theme( panel.grid.minor.x = element_blank() ) ``` * Area to the left of $\frac{[p + z^* \text{SD}(\widehat{p})] - p}{\text{SD}(\widehat{p})} = z^*$ is 0.95. ## Putting it All Together * **CI formula**: $[\widehat{p} - z^* \text{SE}(\widehat{p}), \widehat{p} + z^* \text{SE}(\widehat{p})]$ * **Standard Error** of $\widehat{p}$: $\text{SE}(\widehat{p}) = \sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}}$ * **Critical Value**: $z^*$ is the 97.5th percentile of a standard normal distribution if we want a 95% CI * Use `qnorm` function in R * **Margin of Error**: $z^* \text{SE}(\widehat{p})$ (how much we add and subtract from the point estimate $\widehat{p}$) * **Interpretation**: In repeated sampling, a confidence interval constructed using this procedure contains the population parameter for 95% of samples (or whatever your confidence level is). ## Assumptions to Check * Two outcomes (that are relevant to this analysis) * Same probability of success * People/items in our sample are **independent** * Think about how data were collected/if there is a connection between units * 10% Condition: Sample size less than 10% of population size? * **Sample size** large enough to use normal approximation to the sampling distribution: * $np \geq 10$ and $n(1 - p) \geq 10$ * ... but we don't actually know $p$! * Check that there are at least 10 "successes" and 10 "failures" in the data set. ## Manual Calculations in R ```{r, echo = TRUE} table(babies_sample$apgar_eq_10) / nrow(babies_sample) p_hat <- 0.1133333 se_p_hat <- sqrt(p_hat * (1 - p_hat) / 300) z_star <- qnorm(0.975, mean = 0, sd = 1) p_hat - z_star * se_p_hat p_hat + z_star * se_p_hat ``` ## Automagic Calculations in R ```{r, echo = TRUE, message = FALSE} library(mosaic) confint(binom.test( babies_sample$apgar_eq_10, conf.level = 0.95, ci.method = "wald", success = TRUE)) ```