The Bechdel Test

A movie passes the Bechdel test if it satisfies 3 rules:

  1. it has at least two women;
  2. the women talk to each other; and
  3. they talk to each other about something or someone other than a man.

The Bechdel test originated in this comic by Alison Bechdel (image source http://dykestowatchoutfor.com/the-rule):

Data Source

The data we’re going to work with today have been gathered from a variety of sources by several people. The Bechdel test ratings themselves are from www.bechdeltest.com, where the general public can rate movies according to whether they pass or fail the Bechdel test. Some additional information about the movies comes from www.the-numbers.com. These data were the basis of an article on the topic at http://fivethirtyeight.com/features/the-dollar-and-cents-case-against-hollywoods-exclusion-of-women/. The data have since been added to the fivethirtyeight package for R. I took those data, scraped some additional information about the movies like the MPAA rating, run time, and ratings from IMDB users from imdb.com. Note that this is not a random sample of movies – which movies made it into the data set was basically determined by which movies were rated by users of www.bechdeltest.com. That means any findings from your analysis in this lab are only tentative.

Initial Data Exploration

The following R chunk loads the data and sets the factor levels for categorical variables:

movies <- read_csv("https://mhc-stat140-2017.github.io/data/bechdel/bechdel.csv") %>%
  mutate(
    bechdel_test = factor(
      bechdel_test,
      ordered = TRUE,
      levels = c("nowomen", "notalk", "men", "dubious", "ok")),
    bechdel_test_binary = factor(
      bechdel_test_binary,
      ordered = TRUE,
      levels = c("FAIL", "PASS")),
    mpaa_rating = factor(
      mpaa_rating,
      ordered = TRUE,
      levels = c("UNRATED", "NOT RATED", "G", "PG", "TV-PG", "PG-13", "TV-14", "R", "NC-17"))
  )
## Warning: 185 problems parsing 'https://mhc-stat140-2017.github.io/data/
## bechdel/bechdel.csv'. See problems(...) for more details.
dim(movies)
## [1] 1794   14
names(movies)
##  [1] "year"                "title"               "bechdel_test"       
##  [4] "bechdel_test_binary" "budget"              "domgross"           
##  [7] "intgross"            "budget_2013"         "domgross_2013"      
## [10] "intgross_2013"       "imdb_rating"         "num_imdb_ratings"   
## [13] "mpaa_rating"         "run_time_min"
head(movies)
## Source: local data frame [6 x 14]
## 
##   year            title bechdel_test bechdel_test_binary    budget
## 1 2013        21 & Over       notalk                FAIL  13000000
## 2 2012         Dredd 3D           ok                PASS  45000000
## 3 2013 12 Years a Slave       notalk                FAIL  20000000
## 4 2013           2 Guns       notalk                FAIL  61000000
## 5 2013               42          men                FAIL  40000000
## 6 2013         47 Ronin          men                FAIL 225000000
## Variables not shown: domgross (int), intgross (int), budget_2013 (int),
##   domgross_2013 (int), intgross_2013 (int), imdb_rating (dbl),
##   num_imdb_ratings (int), mpaa_rating (fctr), run_time_min (int)
str(movies)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1794 obs. of  14 variables:
##  $ year               : int  2013 2012 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ title              : chr  "21 & Over" "Dredd 3D" "12 Years a Slave" "2 Guns" ...
##  $ bechdel_test       : Ord.factor w/ 5 levels "nowomen"<"notalk"<..: 2 5 2 2 3 3 2 5 5 2 ...
##  $ bechdel_test_binary: Ord.factor w/ 2 levels "FAIL"<"PASS": 1 2 1 1 1 1 1 2 2 1 ...
##  $ budget             : int  13000000 45000000 20000000 61000000 40000000 225000000 92000000 12000000 13000000 130000000 ...
##  $ domgross           : int  25682380 13414714 53107035 75612460 95020213 38362475 67349198 15323921 18007317 60522097 ...
##  $ intgross           : int  42195766 40868994 158607035 132493015 95020213 145803842 304249198 87324746 18007317 244373198 ...
##  $ budget_2013        : int  13000000 45658735 20000000 61000000 40000000 225000000 92000000 12000000 13000000 130000000 ...
##  $ domgross_2013      : int  25682380 13611086 53107035 75612460 95020213 38362475 67349198 15323921 18007317 60522097 ...
##  $ intgross_2013      : int  42195766 41467257 158607035 132493015 95020213 145803842 304249198 87324746 18007317 244373198 ...
##  $ imdb_rating        : num  5.9 7.1 8.1 6.7 7.5 6.3 5.3 7.8 5.7 4.9 ...
##  $ num_imdb_ratings   : int  64520 217487 501013 168308 70755 125401 175360 227378 29984 168942 ...
##  $ mpaa_rating        : Ord.factor w/ 9 levels "UNRATED"<"NOT RATED"<..: 8 8 8 8 6 6 8 8 6 6 ...
##  $ run_time_min       : int  93 95 134 109 128 128 98 123 107 100 ...

I think most of the variables are self-explanatory, but a couple require explanation. The bechdel_test variable has five levels:

  1. “nowomen” means there are not at least two women in the movie
  2. “notalk” means there are at least two women in the movie, but they don’t talk to each other;
  3. “men” means there are at least two women in the movie, but they only talk to each other about men;
  4. “dubious” means there was some disagreement among users of bechdeltest.com about whether or not the movie passed the test;
  5. “ok” means that the movie passes the test.

The bechdel_test_binary variable has two levels:

  1. “PASS” means that the movie passed the test (i.e., its value for bechdel_test is “ok”)
  2. “FAIL” means it did not pass the test (i.e., its value for bechdel_test is something other than “ok”)

Univariate Exploration

Make an appropriate plot of the imdb_rating variable

The imdb_rating variable is quantitative (look at the output from str(movies) above). That means either a density plot or a histogram would be appropriate.

ggplot() +
  geom_histogram(mapping = aes(x = imdb_rating), data = movies)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot() +
  geom_density(mapping = aes(x = imdb_rating), data = movies)

Note that in the code to make those two plots, everything is the same except for the type of geometry.

Calculate the mean, median, standard deviation and IQR of the imdb_rating variable

You should use the summarize() function so that the result is stored in a data frame.

summarize(movies,
  mean_imdb_rating = mean(imdb_rating),
  median_imdb_rating = median(imdb_rating),
  sd_imdb_rating = sd(imdb_rating),
  iqr_imdb_rating = IQR(imdb_rating)
  )
## Source: local data frame [1 x 4]
## 
##   mean_imdb_rating median_imdb_rating sd_imdb_rating iqr_imdb_rating
## 1         6.757246                6.8      0.9259628             1.2

Plot of bechdel_test

Make an appropriate plot of the bechdel_test variable.

The bechdel_test variable is categorical, so we make a bar plot.

ggplot() +
  geom_bar(aes(x = bechdel_test), data = movies)

Multivariate Exploration

Make an appropriate plot using the imdb_rating and bechdel_test variables

The imdb_rating variables is quantitative, and the bechdel_test variable is categorical. For that combination, we could make box plots, or use a density curve with the different levels of bechdel_test associated with different colors.

ggplot() +
  geom_boxplot(mapping = aes(x = bechdel_test, y = imdb_rating), data = movies)

ggplot() +
  geom_density(mapping = aes(x = imdb_rating, color = bechdel_test),
    data = movies)

Calculate the mean and standard deviation of the imdb_rating variable, but do it separately for each level of the bechdel_test variable

You should use the group_by function and the summarize() function so that the result is stored in a data frame.

movies %>%
  group_by(bechdel_test) %>%
  summarize(
    mean_imdb_rating = mean(imdb_rating),
    sd_imdb_rating = sd(imdb_rating))
## Source: local data frame [5 x 3]
## 
##   bechdel_test mean_imdb_rating sd_imdb_rating
## 1      nowomen         6.912057      0.9668413
## 2       notalk         6.896693      0.9014872
## 3          men         6.820619      0.8656889
## 4      dubious         6.780986      0.9773269
## 5           ok         6.621295      0.9215932

How would you make the plot below?

This is a bar plot based on two categorical variables: bechdel_test_binary on the horizontal axis and bechdel_test used for the fill color. You can tell how those two variables were used in making the plot from the labels on the horizontal axis and on the legend for the colors.

ggplot() +
  geom_bar(mapping = aes(x = bechdel_test_binary, fill = bechdel_test),
    data = movies)

Make an appropriate plot using the budget_2013 and imdb_rating variables

Both of these variables are quantitative, so we make a scatter plot (with geometry of type point)

ggplot() +
  geom_point(mapping = aes(x = budget_2013, y = imdb_rating), data = movies)

Additional exploration

Make a few more plots of your choosing. Try different plot types and see what relationships you can find among the variables in the data set. Add new R chunks as needed.

Here are a few more plots:

Is there a connection between MPAA rating (“G”, “R”, etc.) and Bechdel test results?

Both of these variables are categorical, so let’s use a bar plot.

ggplot() +
  geom_bar(mapping = aes(x = mpaa_rating, fill = bechdel_test), data = movies)

The relative proportion of different Bechdel test results seems fairly consistent across the different MPAA ratings categories.

Here’s another variation on this plot, where each bar is scaled to 1, and the plot shows the proportion of movies with each result for the Bechdel test within each MPAA ratings category.

ggplot() +
  geom_bar(mapping = aes(x = mpaa_rating, fill = bechdel_test),
    position = "fill",
    data = movies)

It does look like the distribution of Bechdel test results is similar across different MPAA rating categories, at least for the categories where we had a decent sample size (G, PG, PG-13, R).

Is there a connection between Bechdel test results and international gross earnings?

Let’s use the variable with international gross inflation-adjusted to 2013 dollars.

ggplot() +
  geom_boxplot(mapping = aes(x = bechdel_test, y = intgross_2013), data = movies)
## Warning: Removed 14 rows containing non-finite values (stat_boxplot).

ggplot() +
  geom_density(mapping = aes(x = intgross_2013, color = bechdel_test), data = movies)
## Warning: Removed 14 rows containing non-finite values (stat_density).

If you look reaaaaaallly closely at those plots, it seems like movies that pass the Bechdel test might tend to earn less than movies that fail the Bechdel test, on average. But from the first box plot, the highest-earning movie in our data set did pass the Bechdel test. What was that movie?

arrange(movies, desc(intgross_2013))
## Source: local data frame [1,794 x 14]
## 
##    year                                          title bechdel_test
## 1  1973                                   The Exorcist           ok
## 2  1975                                           Jaws       notalk
## 3  1982                    E.T.: The Extra-Terrestrial      dubious
## 4  1993                                  Jurassic Park           ok
## 5  1980 Star Wars: Episode V - The Empire Strikes Back      nowomen
## 6  1994                                  The Lion King       notalk
## 7  1972                                  The Godfather       notalk
## 8  2003  The Lord of the Rings: The Return of the King       notalk
## 9  1999      Star Wars: Episode I - The Phantom Menace           ok
## 10 1978                                         Grease           ok
## ..  ...                                            ...          ...
## Variables not shown: bechdel_test_binary (fctr), budget (int), domgross
##   (int), intgross (int), budget_2013 (int), domgross_2013 (int),
##   intgross_2013 (int), imdb_rating (dbl), num_imdb_ratings (int),
##   mpaa_rating (fctr), run_time_min (int)

The Exorcist! Surprising to me…

Anyways, does this become clearer if we look at just the binary pass/fail measure?

ggplot() +
  geom_boxplot(mapping = aes(x = bechdel_test_binary, y = intgross_2013), data = movies)
## Warning: Removed 14 rows containing non-finite values (stat_boxplot).

ggplot() +
  geom_density(mapping = aes(x = intgross_2013, color = bechdel_test_binary), data = movies)
## Warning: Removed 14 rows containing non-finite values (stat_density).

Yeah, it does look like movies that pass the test earn slightly less than movies that fail the test, in our data set. The difference is not huge though.

Bechdel test results over time

ggplot() +
  geom_histogram(mapping = aes(x = year, fill = bechdel_test_binary),
    data = movies)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot() +
  geom_histogram(mapping = aes(x = year, fill = bechdel_test_binary),
    position = "fill",
    data = movies)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Similar to the bar charts above, position = "fill" scales the values to show the proportion of movies in each category of bechdel_test_binary within each bin of years on the horizontal axis.

It does seem like more movies are passing the test now than used to, generally. But we’re still at less than 50% passing the test in recent years.