October 4, 2017

The Course So Far

Describing a data set:

Variable Type(s) Plot Description/Model
1 Categorical Bar (Marginal) distribution
2 Categorical Bar Joint Distribution, Conditional Distribution
1 Quantitative Histogram or Density mean, median, quantiles, standard deviation, variance, IQR, normal model
2 Quantitative Scatter Plot correlation, linear model
1 Categorical, 1 Quantitative Density Plot or Box Plot summary statistics of the quantitative variable for each level of the categorical variable; model later in this course or future classes

Goal for the rest of the class

  • Use data from a Sample to learn about a Population

Goal for the rest of the class

  • Use data from a Sample to learn about a Population
  • Example:
    • Question: How many cats does the average household have?
    • Population: households in the United States
    • Sample: a few chosen households
  • Population Parameter: a number summarizing the distribution (in the population) of values for a particular variable (mean number of cats across all US households)
  • Sample Statistic: a number summarizing the distribution (in the sample) of values for a particular variable (mean number of cats in the households in our sample)
  • Our Hope: The sample statistic will be a good guess of the population parameter.

How Do We Get Our Sample?

Simple Random Sample

Stratified Sampling

Systematic Sampling

Cluster Sampling

Bias

  • For the sample statistic to be a good guess of the population parameter, the sample needs to be representative of the population.
  • Definition: Sampling methods that tend to over-emphasize or under-emphasize some characteristics of the population are biased.
  • Common sources of bias:
    • Sample Volunteers/Convenience Sampling: just include people in the sample who are easy to recruit
    • Bad Sampling Frame/Undercoverage: only choose your sample from among a subset of the population
    • Nonresponse: some people selected for your sample choose not to respond
    • Response bias: your phrasing or survey design encourages people to answer a certain way

Sampling Variabilty

  • Every sample you take is different!
  • Imagine taking 10 different samples of households in the US
  • Each group of households you select will have different numbers of cats
  • So each sample will have a different mean number of cats per household.
  • Definition: The sampling distribution is the distribution of values of the sample statistic that you would get from all possible samples of a given size. (We will explore this more in the lab today and in Chapter 17.) s