--- title: "Hypothesis Tests for Population Means" author: "Evan L. Ray" date: "November 13, 2017" output: ioslides_presentation --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE, cache = FALSE) require(ggplot2) require(scales) require(dplyr) require(tidyr) require(readr) require(mosaic) ``` ## Outline of Hypothesis Tests (Again) 1) **Collect Data**: (For each of 8 attempts, was Paul's prediction right?) 2) Calculate a **test statistic**: $x = 8$ (observed number correct) 3) Write down **hypotheses**: * **Null Hypothesis**: Paul was just guessing: $p = 0.5$ * **Alternative Hypothesis**: Paul is psychic: $p > 0.5$ 5) **Sampling Distribution** of the test statistic, assuming null hypothesis is true. 6) **p-value**: probability of getting a test statistic at least as extreme as what we observed, assuming null hypothesis is true. 7) **Conclusion**: Compare the p-value to the significance level $\alpha$. If the p-value is small, it's unlikely that Paul would get 8/8 right if he was just guessing, so we reject the null ## Example: Body Temperatures ```{r, echo = FALSE} bodytemp = read.table('http://www.amstat.org/publications/jse/datasets/normtemp.dat.txt') names(bodytemp) = c('temp','sex','hr') bodytemp$sex = factor(bodytemp$sex) levels(bodytemp$sex) = c("Males","Females") ``` * It's generally believed that the average body temperature is 98.6 degrees Farenheit (37 degrees Celsius). * Let's investigate with measurements of the temperatures of 130 adults. ```{r, fig.height=2} ggplot() + geom_density(mapping = aes(x = temp), data = bodytemp) ``` * Hypotheses: * $H_0$: $\mu = 98.6$ * $H_A$: $\mu \neq 98.6$ * What should our test statistic be? ## A Key Result from Last Class * $\bar{X} \sim \text{Normal}(\mu, \sigma / \sqrt{n})$ * Across all samples, on average the sample mean is equal to the population mean $\mu$. * The standard deviation of $\bar{X}$ is $\frac{1}{\sqrt{n}}$ as much as the standard deviation $\sigma$ of values in the population. * $$\frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \sim \text{Normal}(0, 1)$$ * $\frac{\bar{X} - \mu}{\sigma / \sqrt{n}}$ is the distance of $\bar{X}$ from $\mu$, in units of $SD(\bar{X})$. * $$\frac{\bar{X} - \mu}{s / \sqrt{n}} \sim t_{n-1} \text{ (replace $\sigma$ with its estimate, $s$).}$$ * $\frac{\bar{X} - \mu}{s / \sqrt{n}}$ is the distance of $\bar{X}$ from $\mu$, in units of $SE(\bar{X})$. ## Test Statistic for a Mean * Let's define our test statistic to be $$t = \frac{\bar{X} - \mu_0}{s / \sqrt{n}} \text{, where}$$ $\mu_0$ is the value of $\mu$ specified in $H_0$ (98.6 in this case) * How far was the sample mean from the hypothesized population mean, in units of our best guess at the standard deviation of $\bar{X}$? * If the null hypothesis is true, then $$t = \frac{\bar{X} - \mu_0}{s / \sqrt{n}} \sim t_{n - 1}$$ ## Conditions to Check * Observations are **independent** * Population is **nearly normal** (unimodal, approximately symmetric)... * ...and **sample size** $n$ is large enough (how big depends on how asymmetric distribution is) ## Back to Body Temperatures ```{r, fig.height=2} ggplot() + geom_density(mapping = aes(x = temp), data = bodytemp) ``` Assumptions for hypothesis tests about means: * Independence * Data distribution is nearly normal (unimodal and symmetric) * Sufficient sample size ## Hypotheses * Null Hypothesis ($H_0$): $\mu = 98.6$ (where $\mu$ is the population mean temperature) * Alternative Hypothesis ($H_A$): $\mu \neq 98.6$ ## Test Statistic ```{r, echo = TRUE} nrow(bodytemp) mean(bodytemp$temp) sd(bodytemp$temp) ``` $$ t = \frac{\bar{X} - \mu_0}{s / \sqrt{n}} = \frac{98.249 - 98.6}{0.733 / \sqrt{130}} = -5.460 $$ ## Test Statistic in R $$ t = \frac{\bar{X} - \mu_0}{s / \sqrt{n}} $$ ```{r, echo = TRUE} n <- nrow(bodytemp) x_bar <- mean(bodytemp$temp) s <- sd(bodytemp$temp) mu_0 <- 98.6 t <- (x_bar - mu_0) / (s / sqrt(n)) t ``` ## P-value * Probability of getting a test statistic at least as extreme as what we observed, assuming the null hypothesis was true. * "At least as extreme" in either direction, since $H_A: \mu \neq 98.6$ * $t \sim t_{129}$ (since $n = 130$ and the degrees of freedom is $n - 1$) ```{r, echo = FALSE, fig.height=4, fig.width=7} plot_df <- data.frame( x = seq(from = -6, to = 6, length = 101) ) ggplot() + # geom_polygon(aes(x = x, y = density), fill = "blue", alpha = 0.4, data = plot_df2) + stat_function(mapping = aes(x = x), fun = dt, args = list(df = 129), data = plot_df) + geom_vline(xintercept = t) + geom_vline(xintercept = -t) ``` ## Calculation of p-value ```{r, echo = TRUE} pt(-5.455, df = 129) # probability to the left of -5.455 1 - pt(5.455, df = 129) # probability to the right of 5.455 ``` * Combined p-value is 0.000000241 ## Alternative Calculation in R ```{r, echo = TRUE} t.test(bodytemp$temp, mu = 98.6, alternative = "two.sided") ``` ## Conclusion * Compare the p-value to the significance level $\alpha$. For example, if $\alpha = 0.001$ then $$0.000000241 < 0.001 \text{, so}$$ * The data provide enough evidence to conclude that the mean temperature is not 98.6 degrees F, at the $\alpha = 0.001$ significance level. ## From Wikipedia "The range for normal human body temperatures, taken orally, is 36.8 $\pm$ 0.5 °C (98.2 $\pm$ 0.9 °F). This means that any oral temperature between 36.3 and 37.3 °C (97.3 and 99.1 °F) is likely to be normal. The normal human body temperature is often stated as 36.5-37.5 °C (97.7-99.5 °F). In adults a review of the literature has found a wider range of 33.2-38.2 °C (91.8-100.8 °F) for normal temperatures, depending on the gender and location measured." * https://en.wikipedia.org/wiki/Human_body_temperature * Never cite Wikipedia