--- title: "Scatter Plots and Correlation" author: "Evan L. Ray" date: "September 20, 2017" output: ioslides_presentation --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) require(ggplot2) require(dplyr) require(tidyr) require(readr) ``` ## Warmup with a neighbor (~5 min) * What are the observational units, variable(s), and variable type(s)? * What did the code I used to make the plot look like?
```{r, echo=FALSE, message=FALSE, fig.height = 4, fig.width=4} data(iris) iris <- transmute(iris, sepal_length = Sepal.Length, sepal_width = Sepal.Width, petal_length = Petal.Length, petal_width = Petal.Width, species = Species) ggplot() + geom_point(mapping = aes(x = petal_length, y = petal_width), data = iris) + ggtitle("Petal Length (cm) vs. Petal Width (cm)\nfor 150 Iris Flowers") ``` ![](Iris_germanica_wikipedia.jpg) (image source: Wikipedia)
## Summarizing Scatter Plots * Recall: we summarize the distribution of **one continuous variable** with: * **center** (mean, median) * **spread** (standard deviation, IQR) * **shape** (symmetric/skewed, unimodal/bimodal/multimodal) * **unusual features** (gaps, outliers) * For **two continuous variables**, describe: * **direction** (positive association, negative association) * **shape** (linear, curved) * **unusual features** (gaps, outliers) ## Describe the relationship... ```{r, echo=TRUE, message=FALSE, fig.height = 4, fig.width=4} ggplot() + geom_point(mapping = aes(x = petal_length, y = petal_width), data = iris) + ggtitle("Petal Length (cm) vs. Petal Width (cm)\nfor 150 Iris Flowers") ``` * **direction** (positive association, negative association) * **shape** (linear, curved) * **unusual features** (gaps, outliers) ## Coloring by Species... ```{r, echo=TRUE, message=FALSE, fig.height = 3.75, fig.width=4} ggplot() + geom_point(mapping = aes(x = petal_length, y = petal_width, color = species), data = iris) + ggtitle("Petal Length (cm) vs. Petal Width (cm)\nfor 150 Iris Flowers") ``` ## Just the versicolor species ```{r, echo = TRUE, fig.height = 3.5, fig.width=4} versicolor <- filter(iris, species == "versicolor") ggplot() + geom_point(mapping = aes(x = petal_length, y = petal_width), data = versicolor) + ggtitle("Petal Length (cm) vs. Petal Width (cm)\nfor 150 Iris Flowers") ``` ## Units I understand: 1 cm = 0.3937 in ```{r, echo = TRUE, eval = FALSE} versicolor <- mutate(versicolor, petal_length_in = petal_length * 0.3937, petal_width_in = petal_width * 0.3937) ggplot() + geom_point(mapping = aes(x = petal_length_in, y = petal_width_in), data = versicolor) + ggtitle("Petal Length (in) vs. Petal Width (in)") ```
```{r, echo = FALSE, fig.height = 3.25, fig.width=4} ggplot() + geom_point(mapping = aes(x = petal_length, y = petal_width), data = versicolor) + ggtitle("Petal Length (cm) vs. Petal Width (cm)") ``` ```{r, echo = FALSE, fig.height = 3.25, fig.width=4} versicolor <- mutate(versicolor, petal_length_in = petal_length * 0.3937, petal_width_in = petal_width * 0.3937) ggplot() + geom_point(mapping = aes(x = petal_length_in, y = petal_width_in), data = versicolor) + ggtitle("Petal Length (in) vs. Petal Width (in)") ```
## Shape of Plot Doesn't Depend on Units ```{r, echo = TRUE, eval = FALSE} versicolor <- mutate(versicolor, z_score_length = (petal_length - mean(petal_length))/sd(petal_length), z_score_width = (petal_width - mean(petal_width))/sd(petal_width)) ggplot() + geom_point(mapping = aes(x = z_score_length, y = z_score_width), data = versicolor) + ggtitle("Petal Length vs. Petal Width") ```
```{r, echo = FALSE, fig.height = 3.25, fig.width=4} ggplot() + geom_point(mapping = aes(x = petal_length, y = petal_width), data = versicolor) + ggtitle("Petal Length (cm) vs. Petal Width (cm)") ``` ```{r, echo = FALSE, fig.height = 3.25, fig.width=4} versicolor <- mutate(versicolor, z_score_length = (petal_length - mean(petal_length))/sd(petal_length), z_score_width = (petal_width - mean(petal_width))/sd(petal_width)) ggplot() + geom_point(mapping = aes(x = z_score_length, y = z_score_width), data = versicolor) + ggtitle("Petal Length vs. Petal Width") ```
## Correlation * The (almost) average of products of $z$-scores: $r = \frac{\sum_{i=1}^n z^x_{i} z^y_{i}}{n - 1}$
```{r, echo = FALSE, fig.height = 3.5, fig.width=4} versicolor <- mutate(versicolor, z_product_sign = ifelse((z_score_length * z_score_width) > 0, "Positive", "Negative")) ggplot() + geom_point(mapping = aes(x = z_score_length, y = z_score_width, color = z_product_sign), data = versicolor) + geom_vline(xintercept = 0) + geom_hline(yintercept = 0) + scale_color_manual(values = c("orange", "blue")) + ggtitle("Petal Length vs. Petal Width") ```
* Using $z$-scores instead of original units means $-1 \leq r \leq 1$ * In this case, about 0.79 ## Calculation in R ```{r, echo = TRUE} cor(versicolor$petal_length, versicolor$petal_width) ```