--- title: "Sampling" author: "Evan L. Ray" date: "October 4, 2017" output: ioslides_presentation --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) require(ggplot2) require(dplyr) require(tidyr) require(readr) ``` ## The Course So Far Describing a data set: | Variable Type(s) | Plot | Description/Model | |-------|------|-------------------| | 1 Categorical | Bar | (Marginal) distribution | | 2 Categorical | Bar | Joint Distribution, Conditional Distribution | | 1 Quantitative | Histogram or Density | mean, median, quantiles, standard deviation, variance, IQR, normal model | | 2 Quantitative | Scatter Plot | correlation, linear model | | 1 Categorical, 1 Quantitative | Density Plot or Box Plot | summary statistics of the quantitative variable for each level of the categorical variable; model later in this course or future classes | ## Goal for the rest of the class * **Use data from a Sample to learn about a Population** ## Goal for the rest of the class * **Use data from a Sample to learn about a Population** * Example: * **Question**: How many cats does the average household have? * **Population**: households in the United States * **Sample**: a few chosen households * **Population Parameter**: a number summarizing the distribution (in the population) of values for a particular variable (mean number of cats across all US households) * **Sample Statistic**: a number summarizing the distribution (in the sample) of values for a particular variable (mean number of cats in the households in our sample) * Our Hope: The sample statistic will be a good guess of the population parameter. ## How Do We Get Our Sample?
Simple Random Sample Stratified Sampling Systematic Sampling Cluster Sampling
## Bias * For the sample statistic to be a good guess of the population parameter, the sample needs to be representative of the population. * Definition: Sampling methods that tend to over-emphasize or under-emphasize some characteristics of the population are **biased**. * Common sources of bias: * **Sample Volunteers/Convenience Sampling**: just include people in the sample who are easy to recruit * **Bad Sampling Frame/Undercoverage**: only choose your sample from among a subset of the population * **Nonresponse**: some people selected for your sample choose not to respond * **Response bias**: your phrasing or survey design encourages people to answer a certain way ## Sampling Variabilty * Every sample you take is different! * Imagine taking 10 different samples of households in the US * Each group of households you select will have different numbers of cats * So each sample will have a different mean number of cats per household. * Definition: The **sampling distribution** is the distribution of values of the sample statistic that you would get from all possible samples of a given size. (We will explore this more in the lab today and in Chapter 17.) s