The Bechdel Test

A movie passes the Bechdel test if it satisfies 3 rules:

  1. it has at least two women;
  2. the women talk to each other; and
  3. they talk to each other about something or someone other than a man.

The Bechdel test originated in this comic by Alison Bechdel (image source http://dykestowatchoutfor.com/the-rule):

Data Source

The data we’re going to work with today have been gathered from a variety of sources by several people. The Bechdel test ratings themselves are from www.bechdeltest.com, where the general public can rate movies according to whether they pass or fail the Bechdel test. Some additional information about the movies comes from www.the-numbers.com. These data were the basis of an article on the topic at http://fivethirtyeight.com/features/the-dollar-and-cents-case-against-hollywoods-exclusion-of-women/. The data have since been added to the fivethirtyeight package for R. I took those data, scraped some additional information about the movies like the MPAA rating, run time, and ratings from IMDB users from imdb.com. Note that this is not a random sample of movies – which movies made it into the data set was basically determined by which movies were rated by users of www.bechdeltest.com. That means any findings from your analysis in this lab are only tentative.

Initial Data Exploration

The following R chunk loads the data and sets the factor levels for categorical variables:

movies <- read_csv("https://mhc-stat140-2017.github.io/data/bechdel/bechdel.csv") %>%
  mutate(
    bechdel_test = factor(
      bechdel_test,
      ordered = TRUE,
      levels = c("nowomen", "notalk", "men", "dubious", "ok")),
    bechdel_test_binary = factor(
      bechdel_test_binary,
      ordered = TRUE,
      levels = c("FAIL", "PASS")),
    mpaa_rating = factor(
      mpaa_rating,
      ordered = TRUE,
      levels = c("UNRATED", "NOT RATED", "G", "PG", "TV-PG", "PG-13", "TV-14", "R", "NC-17"))
  )
## Parsed with column specification:
## cols(
##   year = col_integer(),
##   title = col_character(),
##   bechdel_test = col_character(),
##   bechdel_test_binary = col_character(),
##   budget = col_integer(),
##   domgross = col_double(),
##   intgross = col_double(),
##   budget_2013 = col_integer(),
##   domgross_2013 = col_integer(),
##   intgross_2013 = col_double(),
##   imdb_rating = col_double(),
##   num_imdb_ratings = col_integer(),
##   mpaa_rating = col_character(),
##   run_time_min = col_integer()
## )
dim(movies)
## [1] 1794   14
names(movies)
##  [1] "year"                "title"               "bechdel_test"       
##  [4] "bechdel_test_binary" "budget"              "domgross"           
##  [7] "intgross"            "budget_2013"         "domgross_2013"      
## [10] "intgross_2013"       "imdb_rating"         "num_imdb_ratings"   
## [13] "mpaa_rating"         "run_time_min"
head(movies)
## # A tibble: 6 x 14
##    year            title bechdel_test bechdel_test_binary    budget
##   <int>            <chr>        <ord>               <ord>     <int>
## 1  2013        21 & Over       notalk                FAIL  13000000
## 2  2012         Dredd 3D           ok                PASS  45000000
## 3  2013 12 Years a Slave       notalk                FAIL  20000000
## 4  2013           2 Guns       notalk                FAIL  61000000
## 5  2013               42          men                FAIL  40000000
## 6  2013         47 Ronin          men                FAIL 225000000
## # ... with 9 more variables: domgross <dbl>, intgross <dbl>,
## #   budget_2013 <int>, domgross_2013 <int>, intgross_2013 <dbl>,
## #   imdb_rating <dbl>, num_imdb_ratings <int>, mpaa_rating <ord>,
## #   run_time_min <int>
str(movies)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1794 obs. of  14 variables:
##  $ year               : int  2013 2012 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ title              : chr  "21 & Over" "Dredd 3D" "12 Years a Slave" "2 Guns" ...
##  $ bechdel_test       : Ord.factor w/ 5 levels "nowomen"<"notalk"<..: 2 5 2 2 3 3 2 5 5 2 ...
##  $ bechdel_test_binary: Ord.factor w/ 2 levels "FAIL"<"PASS": 1 2 1 1 1 1 1 2 2 1 ...
##  $ budget             : int  13000000 45000000 20000000 61000000 40000000 225000000 92000000 12000000 13000000 130000000 ...
##  $ domgross           : num  25682380 13414714 53107035 75612460 95020213 ...
##  $ intgross           : num  4.22e+07 4.09e+07 1.59e+08 1.32e+08 9.50e+07 ...
##  $ budget_2013        : int  13000000 45658735 20000000 61000000 40000000 225000000 92000000 12000000 13000000 130000000 ...
##  $ domgross_2013      : int  25682380 13611086 53107035 75612460 95020213 38362475 67349198 15323921 18007317 60522097 ...
##  $ intgross_2013      : num  4.22e+07 4.15e+07 1.59e+08 1.32e+08 9.50e+07 ...
##  $ imdb_rating        : num  5.9 7.1 8.1 6.7 7.5 6.3 5.3 7.8 5.7 4.9 ...
##  $ num_imdb_ratings   : int  64520 217487 501013 168308 70755 125401 175360 227378 29984 168942 ...
##  $ mpaa_rating        : Ord.factor w/ 9 levels "UNRATED"<"NOT RATED"<..: 8 8 8 8 6 6 8 8 6 6 ...
##  $ run_time_min       : int  93 95 134 109 128 128 98 123 107 100 ...

I think most of the variables are self-explanatory, but a couple require explanation. The bechdel_test variable has five levels:

  1. “nowomen” means there are not at least two women in the movie
  2. “notalk” means there are at least two women in the movie, but they don’t talk to each other;
  3. “men” means there are at least two women in the movie, but they only talk to each other about men;
  4. “dubious” means there was some disagreement among users of bechdeltest.com about whether or not the movie passed the test;
  5. “ok” means that the movie passes the test.

The bechdel_test_binary variable has two levels:

  1. “PASS” means that the movie passed the test (i.e., its value for bechdel_test is “ok”)
  2. “FAIL” means it did not pass the test (i.e., its value for bechdel_test is something other than “ok”)

Univariate Exploration

Make an appropriate plot of the imdb_rating variable

# Your code goes here

Calculate the mean, median, standard deviation and IQR of the imdb_rating variable

You should use the summarize() function so that the result is stored in a data frame.

# Your code goes here

Plot of bechdel_test

Make an appropriate plot of the bechdel_test variable.

# Your code goes here

Multivariate Exploration

Make an appropriate plot using the imdb_rating and bechdel_test variables

# Your code goes here

Calculate the mean and standard deviation of the imdb_rating variable, but do it separately for each level of the bechdel_test variable

You should use the group_by function and the summarize() function so that the result is stored in a data frame.

# Your code goes here

How would you make the plot below?

# Your code goes here

Make an appropriate plot using the budget_2013 and imdb_rating variables

# Your code goes here

Additional exploration

Make a few more plots of your choosing. Try different plot types and see what relationships you can find among the variables in the data set. Add new R chunks as needed.