---
title: "Sampling Distributions"
author: "Evan L. Ray"
date: "October 27, 2017"
output: ioslides_presentation
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
require(ggplot2)
require(scales)
require(dplyr)
require(tidyr)
require(readr)
```
## Is Paul the Octopus Psychic?
Recall our procedure for hypothesis testing:
1. Collect **data**: for each of 8 trials, was the prediction correct?
2. Calculate a **sample statistic** (called the test statistic):
* $x =$ total number correct (8 in our case)
3. Obtain the **sampling distribution** of the test statistic, assuming a **null hypothesis** of no effect (in this case, assuming Paul is just guessing)
4. Calculate the **p-value**: probability of getting a test statistic "at least as extreme" as what we observed in step 2
5. If the p-value is low, reject the null hypothesis and conclude that Paul is psychic!
## 2 Strategies for the Sampling Dist'n
1. **Simulation**:
* Repeatedly simulate 8 trials with probability of success = 0.5. In each simulation, count the number of successes.
* As the number of simulations increases, we get a more accurate **approximation** to the sampling distribution.
2. **Probability**:
* Calculate probabilties from the sampling distribution **exactly** using a $\text{Binomial}(8, 0.5)$ model
```{r, echo = FALSE, fig.height = 1.8, fig.width = 4}
sim_n_vals <- c(1, 5, 10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000)
sim_n_vals <- c(1, 5, 10, 50, 100, 500, 1000, 5000, 10000)
sim_n_vals <- 10^5
sim_results <- lapply(sim_n_vals,
function(sim_n) {
set.seed(123)
data.frame(
sim_n = sim_n,
x = rbinom(sim_n, size = 8, prob = 0.5)
)
}) %>%
bind_rows()
ggplot() +
geom_bar(mapping = aes(x = x, y = (..count..)/sum(..count..)), data = sim_results) +
ylab("probability") +
ggtitle("100,000 Simulations")
```
```{r, echo = FALSE, fig.height = 1.8, fig.width = 4}
exact_results <- data.frame(
x = seq(from = 0, to = 8),
probability = dbinom(x = seq(from = 0, to = 8), size = 8, prob = 0.5))
ggplot() +
geom_col(mapping = aes(x = x, y = probability),
data = exact_results) +
ggtitle("Exact Probabilities")
```
## Other Common Sample Statistics
* So, we now have 2 ways to get the sampling distribution for the **total number of successes** in $n$ trials!
* Let's discuss sampling distributions for two other common sample statistics:
* The **proportion of successes** in $n$ trials
* The **sample mean** of a quantitative variable
* For today (and possibly the rest of this class), we'll just focus on the approach using **probability**
## Sample Mean: Central Limit Theorem
* If $Y_1, Y_2, \ldots, Y_n$ are independent observations from a population having mean $\mu$ and finite standard deviation $\sigma$, then the sampling distribution of $\bar{Y} = \frac{1}{n}\sum_{i=1}^n Y_i$ is approximately Normal($\mu$, $\sigma/\sqrt{n}$) for large enough $n$.
(On the board: derive the mean and standard deviation of $\bar{Y}$)
* You will explore the part about an approximately normal distribution in a lab.
* For quantitative variables, there's no rule for how big $n$ must be -- depends on how skewed the distribution is (see lab)
* But remember that we don't want to calculate means anyways if the distribution is very skewed!
## Estimating the Success Probability
* Suppose we want to estimate the proportion $p$ of US households who own the home they live in.
* We take a sample of size $n$ and count the number of households in our sample who own their home:
$$X \sim \text{Binomial}(n, p)$$
* How can we estimate $p$ using $X$?
## Sampling distribution of $\hat{p}$
* We will estimate the probability of success using
$$\hat{p} = \frac{X}{n}$$
* Remember that we can write $X$ as a sum of independent Bernoulli Random Variables:
$X = X_1 + X_2 + \cdots + X_n$
* So $\hat{p} = \frac{X}{n} = \frac{1}{n} \sum_i X_i$ is a sample mean of independent Bernoulli random variables
* Since $\hat{p} = \frac{1}{n} \sum_i X_i$, the Central Limit Theorem tells us the approximate sampling distribution of $\hat{p}$, for large enough $n$.
## Sampling distribution of $\hat{p}$
* For a single Bernoulli random variable,
* $E(X_i) = p$
* $SD(X_i) = \sqrt{p(1 - p)}$
* The CLT says that for large enough $n$, the sampling distribution of $\hat{p}$ is approximately $$\hat{p} \sim \text{Normal}(p, \sqrt{p(1 - p)/n})$$
* For estimating a proportion/probability $p$, we say $n$ is large enough if the **success/failures** condition is satisfied:
* $np \geq 10$ and $n(1 - p) \geq 10$
* See lab for more