Naturalists often want to estimate the sizes of populations that are difficult to measure directly. The capture/recapture method (also referred to as mark/recapture) allows one to estimate, for example, the number of fish in a lake. The idea is also the basis of methods that the Census Bureau has developed that can be used in adjusting population figures obtained in the decennial census (although this is controversial).
There are a few things I’d like you to learn from this lab:
What is the capture/recapture method?
What are some potential limitations of the method, and how do they relate to sampling?
How do methodological limitations of a sampling method affect estimates of quantities in the population we sampled from?
Simulation studies are a valuable tool for studying statistical methods.
Note: In this lab, we will use a few R commands you haven’t seen before. I’d like you to try to work through and understand the R code used in this lab, but you won’t be specifically responsible for knowing the new commands used in this lab for future work. The main purpose is to use R as a tool for learning about the statistical ideas having to do with sampling in objective 3 above.
We will explore the capture/recapture method in the context of estimating the total number of goldfish in a pond. Our exploration will have four phases:
Simulation of the method using goldfish crackers to make the ideas concrete.
Simulation of the method using R.
Modified simulations in R to explore how failures in the sampling process could affect the estimates of population size.
(If time) Simulation in R to explore how the sample sizes drawn affect the estimates of population size.
Pour 100 Cheddar goldfish onto a plate. This represents the population of fish in the pond. For the rest of this exercise, we’re going to pretend that we don’t know the total number of fish in the pond and we’re going to try to estimate that number.
Draw a random sample of 30 fish from the lake. These are the ``captured’’ fish. We’re going to mark each of these fish with an identifying tag so that we will know we’ve captured this fish before if we see it again in the next step. If these were real fish, we would use physical tags, but in this simulation we’ll mark the fish by making them change color. Replace each of the 30 fish you “captured” with a fish of the other flavor. (The new goldfish replace the old ones – don’t put the old ones back into the pond!)
Stir the fish around in the pond so that the marked and unmarked fish are well mixed together. Then draw a new random sample of 30 fish from the lake. When you do this, close your eyes so that you don’t introduce any bias into your sampling – each fish should have an equal chance of being selected. Count the number of fish in the new sample that are marked (call this number \(m\)).
If the fish were well mixed in between the Capture and Recapture steps, then it’s reasonable to estimate the overall proportion of fish in the lake that have been marked by the proportion of fish sampled in the Recapture step that were marked. Let’s call the overall proportion of fish in the pond that were marked \(p\). Since we sampled 30 fish in the Recapture step and \(m\) of those were marked, our estimate of that proportion is
\(\hat{p} = \frac{m}{30}\)
Denote the total number of fish by \(N\); this is what we want to estimate. We know that we marked 30 fish, because we captured 30 fish in the Capture step. Therefore, the true proportion of fish that were marked is \(p = \frac{30}{N}\). Setting this equal to our estimate of \(p\) from step 4 and rearranging, we can estimate \(N\) by
\(\widehat{N} = \frac{30}{\widehat{p}}\)
The following R chunks contain code that does steps 1 through 5 in the physical simulation above. After running the code in each chunk, look at the objects that have been created in the “Global Environment” pane and make sure you understand what happened (even if you wouldn’t have been able to write the code to do it).
We first define three variables that will determine the total number of fish in the pond, how many fish we catch and mark in the Capture step, and how many we catch in the Recapture step. Then we create a data frame called pond_population with two variables in it:
Run the chunk below, and then take a look at the pond_population data frame in your Global Environment if you’re not sure what it looks like.
N <- 100 # number of fish in the pond
C <- 30 # number of fish to sample in the Capture step
R <- 30 # number of fish to sample in the Recapture step
# Make a data frame to represent the population of all fish in the pond.
pond_population <- data.frame(
fish_id = seq(from = 1, to = N),
marked = rep(0, N)
)
In the capture step, we capture C fish and mark them.
In R, we simulate capturing the fish by randomly sampling \(C\) of the fish ids. Then, we update the “marked” variable for those fish, setting it equal to 1.
Run the R chunk below, then look at the captured_fish_ids variable (for example, by typing captured_fish_ids into the console) and the updated pond_population data frame in your Global Environment.
# Capture Step: capture C fish and mark them
captured_fish_ids <- sample(pond_population$fish_id, size = C)
pond_population$marked[captured_fish_ids] <- 1
In the recapture step, we capture R fish. Among those \(R\) fish, we count the number that were marked.
In R, we again simulate capturing the fish by randomly sampling \(R\) of the fish ids. Then, we count the number of those fish that were captured by adding up the values of the “marked” variable for those captured fish, and save that count in a variable called m. Remember that in steps 1 and 2 above we set up the marked variable so that it is 0 for unmarked fish and 1 for marked fish. That means that if I add up these values for the fish in my sample, I get a count of how many fish in the sample were marked.
Run the R chunk below, then look at the recaptured_fish_ids and m.
## Recapture step: capture R fish. Of those, count how many were marked
recaptured_fish_ids <- sample(pond_population$fish_id, size = R)
m <- sum(pond_population$marked[recaptured_fish_ids])
We can use R to calculate the estimated proportion \(\hat{p} = \frac{m}{R}\) as follows. Go ahead and run this chunk.
## Estimate proportion of fish in the pond that were marked in the Capture step:
## The number of fish sampled in the Recapture step that were marked divided by
## the total number of fish sampled in the Recapture step.
p_hat <- m / R
We can use R to calculate the estimated total number of fish in the pond, \(\widehat{N} = \frac{C}{\widehat{p}}\), as follows. Go ahead and run this chunk.
## Estimate the total number of fish in the pond:
## The number of fish that were marked in the Capture step divided by the
## estimated proportion of fish that were
N_hat <- C / p_hat
We’ve now done one simulation!
It’s great that we can do the simulation once, but we really want to do it hundreds or thousands of times and save all of the results. Running the above chunks thousands of times by hand would be tedious; let’s automate this process.
That’s what the R chunk below does. This R chunk puts together all of the code in the previous 5 chunks, with two additions: 1. The line simulation_results <- do(1000) * {
says to “do everything between the following curly braces 1000 times”, and save the results in a new data frame called simulation_results. The code between the curly braces does one simulation, so effectively we’re repeating the simulation 1000 times. 2. Near the end of the R chunk, after doing the computations for one simulation, we make a data frame that stores the p_hat and N_hat values obtained in that simulation run. R will automatically assemble these data frames from the different simulation runs, so the final result (what gets stored in the simulation_results) is a data frame with 1000 rows (one for each simulation run) and two variables (the p_hat and N_hat from each simulation run).
Go ahead and run this R chunk, and then take a look at the simulation_results data frame that gets created. (If you get an error – did you run the R chunk on lines 19-24 of this Rmd document?)
N <- 100 # number of fish in the pond
C <- 30 # number of fish to sample in the Capture step
R <- 30 # number of fish to sample in the Recapture step
simulation_results <- do(1000) * {
## Setup step: make a data frame representing all of the fish in the pond
pond_population <- data.frame(
fish_id = seq(from = 1, to = N),
marked = rep(0, N)
)
## Capture step: capture C fish and add a variable to the data frame
## with a 1 if the fish was marked and a 0 if it was not
captured_fish_ids <- sample(pond_population$fish_id, size = C)
pond_population$marked[captured_fish_ids] <- 1
## Recapture step: capture R fish. Of those, count how many were marked
recaptured_fish_ids <- sample(pond_population$fish_id, size = R)
m <- sum(pond_population$marked[recaptured_fish_ids])
## Estimate proportion of fish in the pond that were marked in the Capture step:
## The number of fish sampled in the Recapture step that were marked divided by
## the total number of fish sampled in the Recapture step.
p_hat <- m / R
## Estimate the total number of fish in the pond:
## The number of fish that were marked in the Capture step divided by the
## estimated proportion of fish that were
N_hat <- C / p_hat
data.frame(p_hat = p_hat, N_hat = N_hat)
}
SOLUTION:
# Your code goes here
SOLUTION:
So far, everything we’ve done has assumed that we were equally likely to capture any of the fish in the Capture and Recapture steps, and that we could definitely tell whether or not each fish we caught in the Recapture step had been previously caught and marked in the Capture step. The estimates of the proportion of fish in the pond that were marked in the Capture step and the total population size that we calculated depend on these assumptions. What happens if these assumptions are wrong, and some fish are more or less likely to be captured or we can’t tell if a fish we see in the Recapture step was previously marked? Here, we will modify our simulation study in order to explore a couple of ways this could happen and what the effects on our estimates of population size are.
In the wild, it happens that some animals are more likely to be captured (“trap-happy”) and other may be less likely to be captured (“trap-shy”). We can represent this in our R code above by adding a prob
argument when we call the sample
function. This argument should be a vector of length \(N\) giving the relative weights for how likely each fish is to be captured. For example, suppose the first 20 fish are ten times as likely to be captured as the remaining 80 fish. We can use the following lines to sample fish with these weights (after running this chunk, if you’re not sure what the sampling_weights vector looks like, take a look at it by typing sampling_weights into the console):
sampling_weights <- c(rep(10, 20), rep(1, 80))
sample(pond_population$fish_id, size = C, prob = sampling_weights)
## [1] 10 13 5 14 9 15 12 38 90 93 18 3 19 77 73 11 4 36 2 87 8 52 31
## [24] 64 88 6 7 20 1 62
The R chunk below contains the original code from section 2 (b) above, to do the simulation study assuming all of the fish are equally likely to be captured. Modify the calls to the sample()
function in both the Capture and Recapture steps to use unequal sampling weights for the different fish by setting a prob =
argument.
N <- 100 # number of fish in the pond
C <- 30 # number of fish to sample in the Capture step
R <- 30 # number of fish to sample in the Recapture step
simulation_results <- do(1000) * {
## Setup step: make a data frame representing all of the fish in the pond
pond_population <- data.frame(
fish_id = seq(from = 1, to = N),
marked = rep(0, N)
)
## Capture step: capture C fish and add a variable to the data frame
## with a 1 if the fish was marked and a 0 if it was not
captured_fish_ids <- sample(pond_population$fish_id, size = C)
pond_population$marked[captured_fish_ids] <- 1
## Recapture step: capture R fish. Of those, count how many were marked
recaptured_fish_ids <- sample(pond_population$fish_id, size = R)
m <- sum(pond_population$marked[recaptured_fish_ids])
## Estimate proportion of fish in the pond that were marked in the Capture step:
## The number of fish sampled in the Recapture step that were marked divided by
## the total number of fish sampled in the Recapture step.
p_hat <- m / R
## Estimate the total number of fish in the pond:
## The number of fish that were marked in the Capture step divided by the
## estimated proportion of fish that were
N_hat <- C / p_hat
data.frame(p_hat = p_hat, N_hat = N_hat)
}
SOLUTION:
# Your code goes here
SOLUTION:
It might also be the case that fish that were marked in the Capture step are more or less likely to be caught again in the Recapture step. For example, maybe the experience of being tagged was traumatic and the fish will strenuously avoid being caught again. Or maybe you treated the fish you caught really well, and the ones who were captured the first time will try to be captured again. We can again represent this in the simulation study by modifying the sampling weights. Now, we’ll only need to modify the weights in the Recapture step, for just those fish who were captured in the Capture step.
If the fish that were tagged are one tenth as likely to be recaptured than the fish that were not tagged, we can use the following code to calculate the weights to use in the Recapture step:
sampling_weights <- rep(1, 100)
sampling_weights[pond_population$marked == 1] <- 0.1
The R chunk below contains the original code from section 2 (b) above, to do the simulation study assuming all of the fish are equally likely to be captured. Modify the call to the sample()
function Recapture step to reflect the scenario where the fish that were marked in the Capture step are less likely to be caught in the Recapture step. You will need to put the calculation of the sampling weights in just before the Recapture step, and modify the call to the sample()
function in the Recapture step (but not the Capture step).
N <- 100 # number of fish in the pond
C <- 30 # number of fish to sample in the Capture step
R <- 30 # number of fish to sample in the Recapture step
simulation_results <- do(1000) * {
## Setup step: make a data frame representing all of the fish in the pond
pond_population <- data.frame(
fish_id = seq(from = 1, to = N),
marked = rep(0, N)
)
## Capture step: capture C fish and add a variable to the data frame
## with a 1 if the fish was marked and a 0 if it was not
captured_fish_ids <- sample(pond_population$fish_id, size = C)
pond_population$marked[captured_fish_ids] <- 1
## Recapture step: capture R fish. Of those, count how many were marked
recaptured_fish_ids <- sample(pond_population$fish_id, size = R)
m <- sum(pond_population$marked[recaptured_fish_ids])
## Estimate proportion of fish in the pond that were marked in the Capture step:
## The number of fish sampled in the Recapture step that were marked divided by
## the total number of fish sampled in the Recapture step.
p_hat <- m / R
## Estimate the total number of fish in the pond:
## The number of fish that were marked in the Capture step divided by the
## estimated proportion of fish that were
N_hat <- C / p_hat
data.frame(p_hat = p_hat, N_hat = N_hat)
}
SOLUTION:
# Your code goes here
SOLUTION:
So far, we have used a sample size of 30 fish in each of the Capture and Recapture steps. What if the boss tells us that we now only have enough money to catch 10 or 20 fish in each phase. How much will that impact our estimates of population size? Can we make an argument that our sample sizes should be even larger, maybe 40 fish? Repeat the above simulation study, but using different numbers of fish caught to the Capture and Recapture steps:
Modify the code in the R chunk below to mark \(C = 10\) fish in the Capture step, and catch \(R = 10\) fish in the Recapture step
N <- 100 # number of fish in the pond
C <- 30 # number of fish to sample in the Capture step
R <- 30 # number of fish to sample in the Recapture step
simulation_results <- do(1000) * {
## Setup step: make a data frame representing all of the fish in the pond
pond_population <- data.frame(
fish_id = seq(from = 1, to = N),
marked = rep(0, N)
)
## Capture step: capture C fish and add a variable to the data frame
## with a 1 if the fish was marked and a 0 if it was not
captured_fish_ids <- sample(pond_population$fish_id, size = C)
pond_population$marked[captured_fish_ids] <- 1
## Recapture step: capture R fish. Of those, count how many were marked
recaptured_fish_ids <- sample(pond_population$fish_id, size = R)
m <- sum(pond_population$marked[recaptured_fish_ids])
## Estimate proportion of fish in the pond that were marked in the Capture step:
## The number of fish sampled in the Recapture step that were marked divided by
## the total number of fish sampled in the Recapture step.
p_hat <- m / R
## Estimate the total number of fish in the pond:
## The number of fish that were marked in the Capture step divided by the
## estimated proportion of fish that were
N_hat <- C / p_hat
data.frame(p_hat = p_hat, N_hat = N_hat)
}
SOLUTION:
# Your code goes here
Modify the code in the R chunk below to mark \(C = 20\) fish in the Capture step, and catch \(R = 20\) fish in the Recapture step
N <- 100 # number of fish in the pond
C <- 30 # number of fish to sample in the Capture step
R <- 30 # number of fish to sample in the Recapture step
simulation_results <- do(1000) * {
## Setup step: make a data frame representing all of the fish in the pond
pond_population <- data.frame(
fish_id = seq(from = 1, to = N),
marked = rep(0, N)
)
## Capture step: capture C fish and add a variable to the data frame
## with a 1 if the fish was marked and a 0 if it was not
captured_fish_ids <- sample(pond_population$fish_id, size = C)
pond_population$marked[captured_fish_ids] <- 1
## Recapture step: capture R fish. Of those, count how many were marked
recaptured_fish_ids <- sample(pond_population$fish_id, size = R)
m <- sum(pond_population$marked[recaptured_fish_ids])
## Estimate proportion of fish in the pond that were marked in the Capture step:
## The number of fish sampled in the Recapture step that were marked divided by
## the total number of fish sampled in the Recapture step.
p_hat <- m / R
## Estimate the total number of fish in the pond:
## The number of fish that were marked in the Capture step divided by the
## estimated proportion of fish that were
N_hat <- C / p_hat
data.frame(p_hat = p_hat, N_hat = N_hat)
}
SOLUTION:
# Your code goes here
Modify the code in the R chunk below to mark \(C = 40\) fish in the Capture step, and catch \(R = 40\) fish in the Recapture step
N <- 100 # number of fish in the pond
C <- 30 # number of fish to sample in the Capture step
R <- 30 # number of fish to sample in the Recapture step
simulation_results <- do(1000) * {
## Setup step: make a data frame representing all of the fish in the pond
pond_population <- data.frame(
fish_id = seq(from = 1, to = N),
marked = rep(0, N)
)
## Capture step: capture C fish and add a variable to the data frame
## with a 1 if the fish was marked and a 0 if it was not
captured_fish_ids <- sample(pond_population$fish_id, size = C)
pond_population$marked[captured_fish_ids] <- 1
## Recapture step: capture R fish. Of those, count how many were marked
recaptured_fish_ids <- sample(pond_population$fish_id, size = R)
m <- sum(pond_population$marked[recaptured_fish_ids])
## Estimate proportion of fish in the pond that were marked in the Capture step:
## The number of fish sampled in the Recapture step that were marked divided by
## the total number of fish sampled in the Recapture step.
p_hat <- m / R
## Estimate the total number of fish in the pond:
## The number of fish that were marked in the Capture step divided by the
## estimated proportion of fish that were
N_hat <- C / p_hat
data.frame(p_hat = p_hat, N_hat = N_hat)
}
SOLUTION:
# Your code goes here
Compare the distributions of population size estimates in each scenario and with the scenario of \(C = 30\) and \(R = 30\) that we started off with. How much does the sample size affect the sampling distribution of population size estimates?
SOLUTION: