---
title: "Population Distribution,
Sample Distribution,
Sampling Distribution,
and Confidence Intervals"
author: "Evan L. Ray"
date: "November 17, 2017"
output: ioslides_presentation
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, cache = TRUE)
require(ggplot2)
require(scales)
require(dplyr)
require(tidyr)
require(readr)
require(mosaic)
```
```{r, echo = FALSE, message = FALSE, warning = FALSE}
babies <- read_csv("https://mhc-stat140-2017.github.io/data/misc/babies1998/babies_dec_1998.csv")
babies <- filter(babies, !is.na(gestation))
set.seed(1)
```
## Distribution of the Population
* For each possible gestation time, what proportion of babies in the population had that gestation time?
```{r, echo = FALSE, fig.height=1.5}
ggplot() +
geom_histogram(mapping = aes(x = gestation, y = ..density..), binwidth = 1, data = babies) +
# geom_density(mapping = aes(x = gestation, y = ..density..), bw = 1, data = babies) +
xlim(range(babies$gestation)) +
xlab("Gestation Time (weeks) -- Population") +
theme(plot.margin = unit(x = c(0, 0, 0, 0), units = "cm"))
```
* Population mean: 38.8 weeks
* Population standard deviation: 2.6 weeks
* About 95% of babies in the population had gestation times between $(38.8 - 2 * 2.6)$ weeks and $(38.8 + 2 * 2.6)$ weeks
## Distribution of a Sample
* For each possible length of gestation time, what proportion of babies in the **sample** had that gestation time?
```{r, echo = TRUE}
babies_sample <- sample_n(babies, size = 30)
```
```{r, echo = FALSE, fig.height=1.25}
orig_babies_sample <- babies_sample
ggplot() +
geom_histogram(mapping = aes(x = gestation, y = ..density..), binwidth = 1, data = babies_sample) +
xlim(range(babies$gestation)) +
xlab("Gestation Time (weeks) -- Population") +
theme(plot.margin = unit(x = c(0, 0, 0, 0), units = "cm"))
```
* Sample mean: 38.7 weeks
* Sample standard deviation: 2.2 weeks
* About 95% of babies in the sample had gestation times between $(38.7 - 2 * 2.2)$ weeks and $(38.7 + 2 * 2.2)$ weeks
## Sampling Distribution of Sample Mean
* The **sampling distribution** is the distribution of values of the sample mean, across all different samples of a certain size $n$.
* If $n$ is large enough, $\bar{X} \sim \text{Normal}(\mu, \sigma/\sqrt{n})$
```{r, echo = FALSE, fig.height=1.25, cache = TRUE}
sample_means <- bind_rows(
{do(10000) * {
babies_sample <- babies %>% sample_n(size = 30)
data.frame(
sample_mean = mean(babies_sample$gestation)
)
}} %>% select(sample_mean)
)
```
```{r, echo = FALSE, warning = FALSE, fig.height = 1.25, cache = FALSE}
ggplot() +
geom_density(mapping = aes(x = sample_mean, y = ..density..), bw = 1, data = sample_means) +
# geom_histogram(mapping = aes(x = sample_mean, y = ..density..), binwidth = .5, data = sample_means) +
xlim(range(babies$gestation)) +
xlab("Sample Means, n = 30") +
theme(plot.margin = unit(x = c(0, 0, 0, 0), units = "cm"))
```
* Population mean: 38.8 weeks
* Population standard deviation: 2.6 weeks
* About 95% of samples of size 30 have sample mean gestation times between $(38.8 - 2 * \frac{2.6}{\sqrt{30}})$ and $(38.8 + 2 * \frac{2.6}{\sqrt{30}})$
## 95% Conf. Interval for Population Mean
* (best guess of population mean) $\pm$ (margin of error)
* $\bar{x} \pm 2 s / \sqrt{n}$
```{r, echo = FALSE, fig.height=1.25, cache = TRUE}
sample_means <- bind_rows(
{do(10000) * {
babies_sample <- babies %>% sample_n(size = 30)
data.frame(
sample_mean = mean(babies_sample$gestation)
)
}} %>% select(sample_mean)
)
```
```{r, echo = FALSE, fig.height=1}
ggplot() +
geom_histogram(mapping = aes(x = gestation, y = ..density..), binwidth = 1, data = orig_babies_sample) +
xlim(range(babies$gestation)) +
xlab("Gestation Time (weeks) -- Sample") +
theme(plot.margin = unit(x = c(0, 0, 0, 0), units = "cm"))
```
* Sample mean: 38.7 weeks
* Sample standard deviation: 2.2 weeks
* We are "95% Confident" that the population mean gestation time is between $(38.7 - 2 * \frac{2.2}{\sqrt{30}})$ and $(38.7 + 2 * \frac{2.2}{\sqrt{30}})$
* "95% Confident" means: 95% of intervals constructed this way from different samples will contain the population mean