--- title: "Sampling Distributions" author: "Evan L. Ray" date: "October 27, 2017" output: ioslides_presentation --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) require(ggplot2) require(scales) require(dplyr) require(tidyr) require(readr) ``` ## Is Paul the Octopus Psychic? Recall our procedure for hypothesis testing: 1. Collect **data**: for each of 8 trials, was the prediction correct? 2. Calculate a **sample statistic** (called the test statistic): * $x =$ total number correct (8 in our case) 3. Obtain the **sampling distribution** of the test statistic, assuming a **null hypothesis** of no effect (in this case, assuming Paul is just guessing) 4. Calculate the **p-value**: probability of getting a test statistic "at least as extreme" as what we observed in step 2 5. If the p-value is low, reject the null hypothesis and conclude that Paul is psychic! ## 2 Strategies for the Sampling Dist'n 1. **Simulation**: * Repeatedly simulate 8 trials with probability of success = 0.5. In each simulation, count the number of successes. * As the number of simulations increases, we get a more accurate **approximation** to the sampling distribution. 2. **Probability**: * Calculate probabilties from the sampling distribution **exactly** using a $\text{Binomial}(8, 0.5)$ model
```{r, echo = FALSE, fig.height = 1.8, fig.width = 4} sim_n_vals <- c(1, 5, 10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000) sim_n_vals <- c(1, 5, 10, 50, 100, 500, 1000, 5000, 10000) sim_n_vals <- 10^5 sim_results <- lapply(sim_n_vals, function(sim_n) { set.seed(123) data.frame( sim_n = sim_n, x = rbinom(sim_n, size = 8, prob = 0.5) ) }) %>% bind_rows() ggplot() + geom_bar(mapping = aes(x = x, y = (..count..)/sum(..count..)), data = sim_results) + ylab("probability") + ggtitle("100,000 Simulations") ``` ```{r, echo = FALSE, fig.height = 1.8, fig.width = 4} exact_results <- data.frame( x = seq(from = 0, to = 8), probability = dbinom(x = seq(from = 0, to = 8), size = 8, prob = 0.5)) ggplot() + geom_col(mapping = aes(x = x, y = probability), data = exact_results) + ggtitle("Exact Probabilities") ```
## Other Common Sample Statistics * So, we now have 2 ways to get the sampling distribution for the **total number of successes** in $n$ trials! * Let's discuss sampling distributions for two other common sample statistics: * The **proportion of successes** in $n$ trials * The **sample mean** of a quantitative variable * For today (and possibly the rest of this class), we'll just focus on the approach using **probability** ## Sample Mean: Central Limit Theorem * If $Y_1, Y_2, \ldots, Y_n$ are independent observations from a population having mean $\mu$ and finite standard deviation $\sigma$, then the sampling distribution of $\bar{Y} = \frac{1}{n}\sum_{i=1}^n Y_i$ is approximately Normal($\mu$, $\sigma/\sqrt{n}$) for large enough $n$. (On the board: derive the mean and standard deviation of $\bar{Y}$) * You will explore the part about an approximately normal distribution in a lab. * For quantitative variables, there's no rule for how big $n$ must be -- depends on how skewed the distribution is (see lab) * But remember that we don't want to calculate means anyways if the distribution is very skewed! ## Estimating the Success Probability * Suppose we want to estimate the proportion $p$ of US households who own the home they live in. * We take a sample of size $n$ and count the number of households in our sample who own their home: $$X \sim \text{Binomial}(n, p)$$ * How can we estimate $p$ using $X$? ## Sampling distribution of $\hat{p}$ * We will estimate the probability of success using $$\hat{p} = \frac{X}{n}$$ * Remember that we can write $X$ as a sum of independent Bernoulli Random Variables: $X = X_1 + X_2 + \cdots + X_n$ * So $\hat{p} = \frac{X}{n} = \frac{1}{n} \sum_i X_i$ is a sample mean of independent Bernoulli random variables * Since $\hat{p} = \frac{1}{n} \sum_i X_i$, the Central Limit Theorem tells us the approximate sampling distribution of $\hat{p}$, for large enough $n$. ## Sampling distribution of $\hat{p}$ * For a single Bernoulli random variable, * $E(X_i) = p$ * $SD(X_i) = \sqrt{p(1 - p)}$ * The CLT says that for large enough $n$, the sampling distribution of $\hat{p}$ is approximately $$\hat{p} \sim \text{Normal}(p, \sqrt{p(1 - p)/n})$$ * For estimating a proportion/probability $p$, we say $n$ is large enough if the **success/failures** condition is satisfied: * $np \geq 10$ and $n(1 - p) \geq 10$ * See lab for more