A movie passes the Bechdel test if it satisfies 3 rules:
The Bechdel test originated in this comic by Alison Bechdel (image source http://dykestowatchoutfor.com/the-rule):
The data we’re going to work with today have been gathered from a variety of sources by several people. The Bechdel test ratings themselves are from www.bechdeltest.com, where the general public can rate movies according to whether they pass or fail the Bechdel test. Some additional information about the movies comes from www.the-numbers.com. These data were the basis of an article on the topic at http://fivethirtyeight.com/features/the-dollar-and-cents-case-against-hollywoods-exclusion-of-women/. The data have since been added to the fivethirtyeight
package for R. I took those data, scraped some additional information about the movies like the MPAA rating, run time, and ratings from IMDB users from imdb.com. Note that this is not a random sample of movies – which movies made it into the data set was basically determined by which movies were rated by users of www.bechdeltest.com. That means any findings from your analysis in this lab are only tentative.
The following R chunk loads the data and sets the factor levels for categorical variables:
movies <- read_csv("https://mhc-stat140-2017.github.io/data/bechdel/bechdel.csv") %>%
mutate(
bechdel_test = factor(
bechdel_test,
ordered = TRUE,
levels = c("nowomen", "notalk", "men", "dubious", "ok")),
bechdel_test_binary = factor(
bechdel_test_binary,
ordered = TRUE,
levels = c("FAIL", "PASS")),
mpaa_rating = factor(
mpaa_rating,
ordered = TRUE,
levels = c("UNRATED", "NOT RATED", "G", "PG", "TV-PG", "PG-13", "TV-14", "R", "NC-17"))
)
## Warning: 185 problems parsing 'https://mhc-stat140-2017.github.io/data/
## bechdel/bechdel.csv'. See problems(...) for more details.
dim(movies)
## [1] 1794 14
names(movies)
## [1] "year" "title" "bechdel_test"
## [4] "bechdel_test_binary" "budget" "domgross"
## [7] "intgross" "budget_2013" "domgross_2013"
## [10] "intgross_2013" "imdb_rating" "num_imdb_ratings"
## [13] "mpaa_rating" "run_time_min"
head(movies)
## Source: local data frame [6 x 14]
##
## year title bechdel_test bechdel_test_binary budget
## 1 2013 21 & Over notalk FAIL 13000000
## 2 2012 Dredd 3D ok PASS 45000000
## 3 2013 12 Years a Slave notalk FAIL 20000000
## 4 2013 2 Guns notalk FAIL 61000000
## 5 2013 42 men FAIL 40000000
## 6 2013 47 Ronin men FAIL 225000000
## Variables not shown: domgross (int), intgross (int), budget_2013 (int),
## domgross_2013 (int), intgross_2013 (int), imdb_rating (dbl),
## num_imdb_ratings (int), mpaa_rating (fctr), run_time_min (int)
str(movies)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1794 obs. of 14 variables:
## $ year : int 2013 2012 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ title : chr "21 & Over" "Dredd 3D" "12 Years a Slave" "2 Guns" ...
## $ bechdel_test : Ord.factor w/ 5 levels "nowomen"<"notalk"<..: 2 5 2 2 3 3 2 5 5 2 ...
## $ bechdel_test_binary: Ord.factor w/ 2 levels "FAIL"<"PASS": 1 2 1 1 1 1 1 2 2 1 ...
## $ budget : int 13000000 45000000 20000000 61000000 40000000 225000000 92000000 12000000 13000000 130000000 ...
## $ domgross : int 25682380 13414714 53107035 75612460 95020213 38362475 67349198 15323921 18007317 60522097 ...
## $ intgross : int 42195766 40868994 158607035 132493015 95020213 145803842 304249198 87324746 18007317 244373198 ...
## $ budget_2013 : int 13000000 45658735 20000000 61000000 40000000 225000000 92000000 12000000 13000000 130000000 ...
## $ domgross_2013 : int 25682380 13611086 53107035 75612460 95020213 38362475 67349198 15323921 18007317 60522097 ...
## $ intgross_2013 : int 42195766 41467257 158607035 132493015 95020213 145803842 304249198 87324746 18007317 244373198 ...
## $ imdb_rating : num 5.9 7.1 8.1 6.7 7.5 6.3 5.3 7.8 5.7 4.9 ...
## $ num_imdb_ratings : int 64520 217487 501013 168308 70755 125401 175360 227378 29984 168942 ...
## $ mpaa_rating : Ord.factor w/ 9 levels "UNRATED"<"NOT RATED"<..: 8 8 8 8 6 6 8 8 6 6 ...
## $ run_time_min : int 93 95 134 109 128 128 98 123 107 100 ...
I think most of the variables are self-explanatory, but a couple require explanation. The bechdel_test
variable has five levels:
The bechdel_test_binary
variable has two levels:
bechdel_test
is “ok”)bechdel_test
is something other than “ok”)The imdb_rating
variable is quantitative (look at the output from str(movies)
above). That means either a density plot or a histogram would be appropriate.
ggplot() +
geom_histogram(mapping = aes(x = imdb_rating), data = movies)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot() +
geom_density(mapping = aes(x = imdb_rating), data = movies)
Note that in the code to make those two plots, everything is the same except for the type of geometry.
You should use the summarize()
function so that the result is stored in a data frame.
summarize(movies,
mean_imdb_rating = mean(imdb_rating),
median_imdb_rating = median(imdb_rating),
sd_imdb_rating = sd(imdb_rating),
iqr_imdb_rating = IQR(imdb_rating)
)
## Source: local data frame [1 x 4]
##
## mean_imdb_rating median_imdb_rating sd_imdb_rating iqr_imdb_rating
## 1 6.757246 6.8 0.9259628 1.2
Make an appropriate plot of the bechdel_test
variable.
The bechdel_test
variable is categorical, so we make a bar plot.
ggplot() +
geom_bar(aes(x = bechdel_test), data = movies)
The imdb_rating
variables is quantitative, and the bechdel_test
variable is categorical. For that combination, we could make box plots, or use a density curve with the different levels of bechdel_test
associated with different colors.
ggplot() +
geom_boxplot(mapping = aes(x = bechdel_test, y = imdb_rating), data = movies)
ggplot() +
geom_density(mapping = aes(x = imdb_rating, color = bechdel_test),
data = movies)
You should use the group_by
function and the summarize()
function so that the result is stored in a data frame.
movies %>%
group_by(bechdel_test) %>%
summarize(
mean_imdb_rating = mean(imdb_rating),
sd_imdb_rating = sd(imdb_rating))
## Source: local data frame [5 x 3]
##
## bechdel_test mean_imdb_rating sd_imdb_rating
## 1 nowomen 6.912057 0.9668413
## 2 notalk 6.896693 0.9014872
## 3 men 6.820619 0.8656889
## 4 dubious 6.780986 0.9773269
## 5 ok 6.621295 0.9215932
This is a bar plot based on two categorical variables: bechdel_test_binary
on the horizontal axis and bechdel_test
used for the fill color. You can tell how those two variables were used in making the plot from the labels on the horizontal axis and on the legend for the colors.
ggplot() +
geom_bar(mapping = aes(x = bechdel_test_binary, fill = bechdel_test),
data = movies)
Both of these variables are quantitative, so we make a scatter plot (with geometry of type point
)
ggplot() +
geom_point(mapping = aes(x = budget_2013, y = imdb_rating), data = movies)
Make a few more plots of your choosing. Try different plot types and see what relationships you can find among the variables in the data set. Add new R chunks as needed.
Here are a few more plots:
Both of these variables are categorical, so let’s use a bar plot.
ggplot() +
geom_bar(mapping = aes(x = mpaa_rating, fill = bechdel_test), data = movies)
The relative proportion of different Bechdel test results seems fairly consistent across the different MPAA ratings categories.
Here’s another variation on this plot, where each bar is scaled to 1, and the plot shows the proportion of movies with each result for the Bechdel test within each MPAA ratings category.
ggplot() +
geom_bar(mapping = aes(x = mpaa_rating, fill = bechdel_test),
position = "fill",
data = movies)
It does look like the distribution of Bechdel test results is similar across different MPAA rating categories, at least for the categories where we had a decent sample size (G, PG, PG-13, R).
Let’s use the variable with international gross inflation-adjusted to 2013 dollars.
ggplot() +
geom_boxplot(mapping = aes(x = bechdel_test, y = intgross_2013), data = movies)
## Warning: Removed 14 rows containing non-finite values (stat_boxplot).
ggplot() +
geom_density(mapping = aes(x = intgross_2013, color = bechdel_test), data = movies)
## Warning: Removed 14 rows containing non-finite values (stat_density).
If you look reaaaaaallly closely at those plots, it seems like movies that pass the Bechdel test might tend to earn less than movies that fail the Bechdel test, on average. But from the first box plot, the highest-earning movie in our data set did pass the Bechdel test. What was that movie?
arrange(movies, desc(intgross_2013))
## Source: local data frame [1,794 x 14]
##
## year title bechdel_test
## 1 1973 The Exorcist ok
## 2 1975 Jaws notalk
## 3 1982 E.T.: The Extra-Terrestrial dubious
## 4 1993 Jurassic Park ok
## 5 1980 Star Wars: Episode V - The Empire Strikes Back nowomen
## 6 1994 The Lion King notalk
## 7 1972 The Godfather notalk
## 8 2003 The Lord of the Rings: The Return of the King notalk
## 9 1999 Star Wars: Episode I - The Phantom Menace ok
## 10 1978 Grease ok
## .. ... ... ...
## Variables not shown: bechdel_test_binary (fctr), budget (int), domgross
## (int), intgross (int), budget_2013 (int), domgross_2013 (int),
## intgross_2013 (int), imdb_rating (dbl), num_imdb_ratings (int),
## mpaa_rating (fctr), run_time_min (int)
The Exorcist! Surprising to me…
Anyways, does this become clearer if we look at just the binary pass/fail measure?
ggplot() +
geom_boxplot(mapping = aes(x = bechdel_test_binary, y = intgross_2013), data = movies)
## Warning: Removed 14 rows containing non-finite values (stat_boxplot).
ggplot() +
geom_density(mapping = aes(x = intgross_2013, color = bechdel_test_binary), data = movies)
## Warning: Removed 14 rows containing non-finite values (stat_density).
Yeah, it does look like movies that pass the test earn slightly less than movies that fail the test, in our data set. The difference is not huge though.
ggplot() +
geom_histogram(mapping = aes(x = year, fill = bechdel_test_binary),
data = movies)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot() +
geom_histogram(mapping = aes(x = year, fill = bechdel_test_binary),
position = "fill",
data = movies)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Similar to the bar charts above, position = "fill"
scales the values to show the proportion of movies in each category of bechdel_test_binary
within each bin of years on the horizontal axis.
It does seem like more movies are passing the test now than used to, generally. But we’re still at less than 50% passing the test in recent years.