--- title: "Population Distribution,
Sample Distribution,
Sampling Distribution,
and Confidence Intervals" author: "Evan L. Ray" date: "November 17, 2017" output: ioslides_presentation --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE, cache = TRUE) require(ggplot2) require(scales) require(dplyr) require(tidyr) require(readr) require(mosaic) ``` ```{r, echo = FALSE, message = FALSE, warning = FALSE} babies <- read_csv("https://mhc-stat140-2017.github.io/data/misc/babies1998/babies_dec_1998.csv") babies <- filter(babies, !is.na(gestation)) set.seed(1) ``` ## Distribution of the Population * For each possible gestation time, what proportion of babies in the population had that gestation time? ```{r, echo = FALSE, fig.height=1.5} ggplot() + geom_histogram(mapping = aes(x = gestation, y = ..density..), binwidth = 1, data = babies) + # geom_density(mapping = aes(x = gestation, y = ..density..), bw = 1, data = babies) + xlim(range(babies$gestation)) + xlab("Gestation Time (weeks) -- Population") + theme(plot.margin = unit(x = c(0, 0, 0, 0), units = "cm")) ``` * Population mean: 38.8 weeks * Population standard deviation: 2.6 weeks * About 95% of babies in the population had gestation times between $(38.8 - 2 * 2.6)$ weeks and $(38.8 + 2 * 2.6)$ weeks ## Distribution of a Sample * For each possible length of gestation time, what proportion of babies in the **sample** had that gestation time? ```{r, echo = TRUE} babies_sample <- sample_n(babies, size = 30) ``` ```{r, echo = FALSE, fig.height=1.25} orig_babies_sample <- babies_sample ggplot() + geom_histogram(mapping = aes(x = gestation, y = ..density..), binwidth = 1, data = babies_sample) + xlim(range(babies$gestation)) + xlab("Gestation Time (weeks) -- Population") + theme(plot.margin = unit(x = c(0, 0, 0, 0), units = "cm")) ``` * Sample mean: 38.7 weeks * Sample standard deviation: 2.2 weeks * About 95% of babies in the sample had gestation times between $(38.7 - 2 * 2.2)$ weeks and $(38.7 + 2 * 2.2)$ weeks ## Sampling Distribution of Sample Mean * The **sampling distribution** is the distribution of values of the sample mean, across all different samples of a certain size $n$. * If $n$ is large enough, $\bar{X} \sim \text{Normal}(\mu, \sigma/\sqrt{n})$ ```{r, echo = FALSE, fig.height=1.25, cache = TRUE} sample_means <- bind_rows( {do(10000) * { babies_sample <- babies %>% sample_n(size = 30) data.frame( sample_mean = mean(babies_sample$gestation) ) }} %>% select(sample_mean) ) ``` ```{r, echo = FALSE, warning = FALSE, fig.height = 1.25, cache = FALSE} ggplot() + geom_density(mapping = aes(x = sample_mean, y = ..density..), bw = 1, data = sample_means) + # geom_histogram(mapping = aes(x = sample_mean, y = ..density..), binwidth = .5, data = sample_means) + xlim(range(babies$gestation)) + xlab("Sample Means, n = 30") + theme(plot.margin = unit(x = c(0, 0, 0, 0), units = "cm")) ``` * Population mean: 38.8 weeks * Population standard deviation: 2.6 weeks * About 95% of samples of size 30 have sample mean gestation times between $(38.8 - 2 * \frac{2.6}{\sqrt{30}})$ and $(38.8 + 2 * \frac{2.6}{\sqrt{30}})$ ## 95% Conf. Interval for Population Mean * (best guess of population mean) $\pm$ (margin of error) * $\bar{x} \pm 2 s / \sqrt{n}$ ```{r, echo = FALSE, fig.height=1.25, cache = TRUE} sample_means <- bind_rows( {do(10000) * { babies_sample <- babies %>% sample_n(size = 30) data.frame( sample_mean = mean(babies_sample$gestation) ) }} %>% select(sample_mean) ) ``` ```{r, echo = FALSE, fig.height=1} ggplot() + geom_histogram(mapping = aes(x = gestation, y = ..density..), binwidth = 1, data = orig_babies_sample) + xlim(range(babies$gestation)) + xlab("Gestation Time (weeks) -- Sample") + theme(plot.margin = unit(x = c(0, 0, 0, 0), units = "cm")) ``` * Sample mean: 38.7 weeks * Sample standard deviation: 2.2 weeks * We are "95% Confident" that the population mean gestation time is between $(38.7 - 2 * \frac{2.2}{\sqrt{30}})$ and $(38.7 + 2 * \frac{2.2}{\sqrt{30}})$ * "95% Confident" means: 95% of intervals constructed this way from different samples will contain the population mean