The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Sampling alleles, genetic drift and effective population size

Matheus Januario, Andressa Viol, and Daniel Rabosky

Jan 2024

Learning objectives

  1. Genetic drift: qualitative expectation
  2. Genetic drift: building intuition
  3. Dissecting the variability in outcomes
  4. Genetic drift and heterozygosity decay
  5. Effective population size
  6. Historical note

For this tutorial, you must install the plot3D R package.

install.packages("plot3D")

Then, you must load it, as well as our package:

library(evolved)
library(plot3D)

# Let's also store our par() configs so 
# we can restore them whenever we change it in this tutorial
oldpar <- par(no.readonly = TRUE)  

popgen_drift.R

Introduction

In previous labs, we introduced a simple way of applying probability rules to allele change through a small number of generations. In this lab, we will extrapolate qualitatively and quantitatively from those insights, and see what happens to genetic diversity over time.

Genetic drift: qualitative expectation

As you may know, Peter Buri did an experiment on genetic drift in 1956. He started with many flask populations of Drosophila melanogaster and applied a very simple method to keep track of genetic drift. In his own words (Buri, 1956):

“The method consists in observations on the changes in gene frequency between generations in each of a large number of small cultures in which the size of the breeding population is controlled. Each initial culture founds a line in which each successive generation is initiated with a random sample of uniform size taken from among the flies of the preceding generation. The form is that of a fairly large experimental population divided into a number of very small completely isolated subunits.”

Thus, the experiment involves sampling adults at random from a flask and placing them in a new flask where they will mate. Their offspring reach adulthood and are then themselves sampled, thus producing a new generation. All the while, we keep track of relative frequencies of genotypes through time.

Now, imagine you are in Buri’s lab and will run his experiment using 1000 flasks (he tracked 212 in total). When individuals from a flask become able to reproduce, you randomly and uniformly (i.e. all individuals are equally likely to be sampled) pick 8 virgin females from the same flask and generation to mate only once with 8 random males from the same flask and generation. Each mating is successful and generates many viable offspring.

Assume males and females do not diverge in genotypic frequencies. Recall also that Drosophila melanogaster is a diploid organism.

If you are not familiar with inserting an image into an RMarkdown file:

First, make sure your image is inside the current working directory, i.e. inside the reference folder that R will use when you knit the RMarkdown file. Then, you can use the following syntax, replacing file_name.jpg with the specific name of your image.

![](file_name.jpg)

Genetic drift: building intuition

Now we will do a computer simulation of what Buri did. We will use the rbinom() function to draw allele counts from a population of a diploid organism through time. The function does this by applying the following algorithm:

  1. Start with a certain probability \(p\) of sampling a specific allele (e.g., \(A_1\)) based on that allele’s frequency in the population.
  2. Generate the next generation by taking random draws from a binomial distribution given probability \(p\) of sampling copies of allele \(A_1\).
  3. Recalculate the new allele frequency. You have just observed drift!

Remember that the binomial distribution applies to binary processes and their number of “successes”. Our binary process here is the sampling of alleles – as the organisms are diploid, each of their 2 alleles will either be \(A_1\) or \(A_2\), with no exceptions. Our number of “successes” is the total number of \(A_1\) alleles (i.e., the allele frequency, \(p\)) in the next flask.

So, a single flask in the first generation of our experiment can be simulated as:

N <- 32 #population size
n_alleles <- 2*N 
p_gen0 <- 0.25 #Frequency of allele A1 in the first gen
p_gen1 <- rbinom(1, n_alleles, p_gen0) / n_alleles
p_gen1
## [1] 0.21875

popgen_drift.R

Now, to do the next generation, we just have to redo this process, but this time using \(p_1\), the allele frequency in generation 1, as the new probability of sampling \(A_1\) alleles:

p_gen2 <- rbinom(1, n_alleles, p_gen1) / n_alleles
p_gen2
## [1] 0.265625

popgen_drift.R

What would the third generation be then?

p_gen3 <- rbinom(1, n_alleles, p_gen2) / n_alleles
p_gen3
## [1] 0.28125

popgen_drift.R

The fourth would be:

p_gen4 <- rbinom(1, n_alleles, p_gen3) / n_alleles
p_gen4
## [1] 0.359375

popgen_drift.R

The fifth would be:

p_gen5 <- rbinom(1, n_alleles, p_gen4) / n_alleles
p_gen5
## [1] 0.359375

popgen_drift.R

What is happening with the code and with the result at each generation?

Let’s plot all results we have up to now. Note that your results will be different because you are generating different random numbers than us here.

generations <- seq(from = 0, to = 5, by = 1)
p_through_time <- c(p_gen0, p_gen1, p_gen2, p_gen3, p_gen4, p_gen5)
plot(generations, p_through_time, type="l", lwd = 2, col = "darkorchid3",
     ylab = "p", xlab = "generations", las = 1)

popgen_drift.R

Repeat the above lines of code a few times to get a feeling for what is happening. Do all simulations behave the same way? Can you identify different “archetypes” of dynamics?

Instead of tediously copying and pasting code in the console, we can run large numbers of simulations quite easily using the function WFDriftSim():

#We will simulate five flasks through 10 generations:
WFDriftSim(Ne = 32, n.gen = 10, p0 = 0.5, n.sim = 5)

popgen_drift.R

Run the above chunk of code above many times to get a feeling for it. How does changing Ne, n.gen, p0, and n.sim affect your simulations?

A function to calculate if a simulation’s alleles have fixed.

# R will not return anything in the console when you run this, but 
# after running you should have this function in your R session
# as an object. Note that if you name another object as "isFixed"
# the function will be overwritten.

isFixed <- function(p, tol = 0.00000000001){
    
    if (p <= tol | p >= (1 - tol)){
        return(TRUE)
    } else{
        return(FALSE)
    }

}

# Assuming your current allele frequency is stored in the p_t object, the
# line below uses the function created above to test if p_t is equal to 
# zero OR equal to one

popgen_drift.R

isFixed(p_t)
# The function will return `TRUE` if p_t is equal to zero or one.

popgen_drift.R


Dissecting the variability in outcomes

If our simulations contain many independent flasks (i.e. when n.sim > 40), it is hard to keep track of the whole history of allele frequency change in each flask. This is when the histogram to the right of the line plot in the output of WFDriftSim becomes particularly helpful. Take a look at the histogram section of the output plot after typing the following into your console:

WFDriftSim(Ne = 32, n.gen = 10, p0 = 0.5, n.sim = 100)

popgen_drift.R

popgen_drift.R


Genetic drift and heterozygosity decay

In lecture, we learned an analytic formula to calculate heterozygosity decay over time due to drift.

Example on how to store (and handle) your simulation results:

data = WFDriftSim(Ne = 8, n.gen = 50, p0 = 0.5, n.sim = 50,
                  print.data = T, plot.type = "none")

#then, if we want to see the 2nd to 5th generations of 
# the simulations number 14, 17 and 18, we type:
sims <- c(14,17,18)
gens <- 3:6
subset_data <- data[sims, gens]
subset_data
##         gen2   gen3   gen4   gen5
## sim14 0.3750 0.3750 0.1875 0.1250
## sim17 0.4375 0.6250 0.6875 0.8125
## sim18 0.6250 0.6875 0.6250 0.5000
#and we can also plot those results:

#first opening an empty plot:
plot(NA, ylab = "Alleleic frequency", xlab = "Generation",
     xlim = c(2, 5), ylim = c(0, 1))

#creating a set of colors to paint our lines
cols <- rainbow(n = nrow(subset_data))

#then we finally add the lines to our plot:
for (i in 1:nrow(subset_data)) {
  lines(x = gens - 1, y = subset_data[i,], col = cols[i])
}

popgen_drift.R

Notice that the actual generation number does not match the column numbers!


Effective population size

We know that in real populations, the census size might be different from the effective population size.

\[\begin{equation} N_e = \frac{1}{\frac{1}{t}\sum_{i=1}^t \frac{1}{N_i}} \end{equation}\]

popgen_drift.R

Some food for thought:

Assuming an equal number of females and males, an equal number of generations, and bottlenecks of equal magnitude (equally drastic changes in census size), which of the options below has the greatest influence on \(N_e\) given equation (1)? Why?

  1. Having a large census size prior to the bottleneck
  2. Experiencing a bottleneck that is very brief
  3. Having a large census size after the bottleneck

A nice in-depth discussion of what the effective population size means can be found in Waples (2022).


Historical note

Kimura’s (1955) diffusion approximation for the distribution of allele frequencies in populations under drift is both elegant and difficult. Understanding his derivation requires some fairly sophisticated mathematics. However, there is a much easier way to generate and visualize the expectation for the distribution of allele frequencies across populations under drift, if we assume the population size is reasonably small. The solution described below involves a Markov chain model of drift.

Fundamentally, the probability model we will use is a binomial sampling model. This is nothing more than the “coin-toss” distribution: what is the probability of getting (say) \(k\) heads out of \(n\) coin tosses, given that the probability of heads on any given toss is \(p\)? More formally we denote this probability by \(P(k|n; p)\).

Under a binomial sampling model, we can estimate the probability \(P(k|n; p)\) using the binomial probability, or:

\[\begin{equation} P(k | n; p) = \binom{n}{k} p^k (1-p)^{n-k} \end{equation}\]

Let’s say we have a population with alleles \(A\) and \(a\), with respective frequencies \(p\) and \(1 - p\). The relationship between the binomial model and genetic drift is that, for a finite population, we can use this to compute the exact probability that we will sample \(k\) alleles of a given type in a given generation, given an initial frequency \(p\). So, for example, consider Buri’s genetic drift data, where \(2N = 32\), and where the initial allele frequencies were equal (\(p = 0.5\)). Given this starting configuration at time \(t_0\) (e.g., \(n = 32\), \(p = 0.5\)), we can use the binomial probability to compute the exact probability of sampling \(k\) alleles in generation \(t_1\).

I’m going to switch notation here. For shorthand, I’m going to use the expression \(P_{i, k}\) to denote the probability that a population with \(k\) type \(A\) alleles has \(i\) copies after one generation of drift (sampling). This necessarily depends on a particular population size. For example, \(P_{13, 16}\) is the probability that a given generation of random mating in a Wright-Fisher population gives 13 \(A\) alleles given that the population currently has 16 such alleles, but obviously, this probability is critically dependent on the population size. For the Buri data, with \(2N = 32\), the probability \(P_{13 , 16}\) corresponds to binomial parameters \(p = 0.5\) and \(n = 32\). But this probability would obviously differ for any other \(n\).

Now, imagine forming a matrix of \(2N + 1\) columns and \(2N + 1\) rows. Here, each element \((i, k)\) gives the probability – the transition probability – associated with sampling \(i\) alleles of type \(A\) in generation \(t+1\) given that you are currently at \(k\) \(A\) alleles in generation \(t\). The matrix has \(2N + 1\) rows and columns because we also have an extra row and column to account for the state of \(0\) alleles.

\[\begin{bmatrix} P_{0,0} & P_{0,1} & P_{0,2} & \cdots & P_{0, n}\\ P_{1,0} & P_{1,1} & P_{1,2} & \cdots & P_{1, n}\\ P_{2,0} & P_{2,1} & P_{2,2} & \cdots & P_{2, n}\\ P_{3,0} & P_{3,1} & P_{3,2} & \cdots & P_{3, n}\\ \vdots & \vdots & \vdots & \ddots & \vdots \\ P_{n,0} & P_{n,1} & P_{n,2} & \cdots & P_{n, n}\\ \end{bmatrix}\]

Of course, the probability of going from \(0\) \(A\) alleles to any number of alleles greater than \(0\) is also \(0\) (because 0 is an absorbing state – after the allele is lost in the population, it cannot increase in frequency). So the first column is necessarily zero, with the exception of \(P_{0,0}\), which is \(1\) (if you have \(0\) \(A\) alleles at time \(t\), you will have \(0\) \(A\) alleles at time \(t + 1\) with probability \(1\)). The same general idea holds for the last column: if you are at the maximum number of \(A\) alleles, e.g., \(n\) \(A\) alleles, then you can’t sample \(2\) in the following generation. In our example, with \(2N = 32\), both \(0\) and \(32\) are absorbing states. So the matrix is:

\[\begin{bmatrix} 1 & P_{0,1} & P_{0,2} & \cdots & 0\\ 0 & P_{1,1} & P_{1,2} & \cdots & 0\\ 0 & P_{2,1} & P_{2,2} & \cdots & 0\\ 0 & P_{3,1} & P_{3,2} & \cdots & 0\\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & P_{n,1} & P_{n,2} & \cdots & 1\\ \end{bmatrix}\]

Let’s call this matrix \(P\). It is a transition matrix, and it has some really nice properties. This matrix includes the probabilities for all possible pairwise transitions between states. As such, the columns of the matrix must sum to \(1\). Once we’ve defined this matrix, we can fill in all of the individual elements with binomial probabilities. So, for example, assuming \(2N = 32\), we can fill in \(P_{3, 5}\) by first noting that the frequency of \(A\) at time \(t\) is \(p = \frac{5}{32}\).

\[\begin{equation} P_{3,5} = \binom{32}{3} \binom{5}{32}^3 (1-\frac{5}{32})^{32-3} \end{equation}\]

The amazing thing about the transition matrix \(P\) is that we can use it to compute the distribution of alleles after a generation of random mating, given some initial vector of allele counts \(N_0\). Here, \(N_0\) is a vector with the property that each element \(i\) of the vector gives the number of populations having \(i\) alleles. For example, \(N_0[0]\) gives the number of initial populations with \(0\) \(A\) alleles. In the Buri data, with roughly 100 populations with equal allele frequencies, we would have \(N_0[16] = 100\) (100 populations, each with 16 \(A\) alleles at time \(t = 0\)) and all other elements of the vector equal to zero.

We can now construct a Markov chain, to compute the probability distribution of populations with a given allele count, as

\[\begin{equation} N_1 = PN_0 \end{equation}\]

where \(P N_0\) involves matrix multiplication of the column vector \(N_0\) by the square matrix \(P\). \(N_1\) is the mean number of populations in each allele count category. For example, \(N_1[7]\) is the number of populations in generation \(t = 1\) that have \(7\) alleles (remember: the starting state was given by the vector \(N_0\)). More generally, we have:

\[\begin{equation} N_{t+1} = PN_t \end{equation}\]

And we can see where this is going. It’s a recursive relationship whereby:

\[\begin{equation} N_{t+2} = P (P N_0) \end{equation}\]

and

\[\begin{equation} N_{t+3} = P ( P (P N_0)) \end{equation}\]

Etc etc. Thus, computing the distribution of allele frequencies can easily be done by iterating the equation given above, provided we have \(P\) and \(N_0\).

The R code below implements this model and uses it to compute the distribution of populations for each possible number of alleles, using Buri’s Drosophila study as a guide.

# A simple genetic drift simulator as a Markov process
N       <- 16
popsize <- 2*N

# Now we define a transition matrix, such that:
# Each element of the transition matrix is a transition probability
#   Specifically: element Pi,j is the probability of going from
#   j alleles to i alleles in a single sampling step 
#  

tmat <- matrix(0, nrow = popsize + 1, ncol = popsize + 1) # along rows: current allele count; along cols: future allele count

# Fill in elements with probabilities from the binomial distribution
# If you have j alleles in generation t-1, then this defines the probability
# of sampling (0,1....2N) alleles in the next generation, under a binomial 
# sampling process:

# Note also that first column and row are absorbing states of 0

# Here is the vector of frequencies:
probvec <- (1:popsize) / popsize  

# What is the probability of getting to 0 allele copies, given 
# that you have 1? This is P01, and the transition probability 
# would be computed using 1/2N (=1/32) as the sampling probability 
# for the allele. 
# Here, this is computed using the function dbinom, which is a shortcut 
#   to the analytical binomial probability:
dbinom(1, popsize, prob = 1/32)
## [1] 0.3737345
# Now we will set up the full matrix by iterating this over rows:

for (ii in 1:nrow(tmat)){
    tmat[ii, 2:ncol(tmat)] <- dbinom(ii - 1, size = popsize, prob = probvec)
}

# Note: we ignore the first column and let almost all elements equal zero, 
# because the probability of sampling k alleles given that you currently have zero
# is zero everywhere except the special case of P00, which is 1 
# (if you have zero now, you will have zero in the next gen with probability 1)
# To address this, we manually set element P00 equal to 1:
tmat[1,1] <- 1

#Now, all the columns should sum to 1:
colSums(tmat)
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# Pt <- tmat^2*P0 

# Initial vector of population frequencies:
# Like Buri, suppose we do an experiment with 100 populations
# each with equal allele frequencies (p = q = 0.5) initially.
p_init <- rep(0, popsize + 1)

# note, in this indexing:
#   p_init[1] is the number of populations with 0 alleles
#   p_init[2] is the number with 1 allele
#   p_init[popsize + 1] is the number with 2N alleles

# we want the initial population to have 2N/2 = N p alleles (for 50:50 proportion of alleles)
#  so we fill in the N+1 element of p_init vector with the 
#  number of populations

p_init[N+1] <- 100

popgen_drift.R

#Now, we will introduce drift in our simulation
# after 1 generation of random mating in Wright-Fisher (WF) population, 
#  the distribution of populations is:
tmat %*% p_init
 
# and after 2 generations, it is:
tmat %*% (tmat %*% p_init)

# and after 3 generations, it is:
tmat %*% (tmat %*% (tmat %*% p_init))

popgen_drift.R

# This is easy enough to iterate.
# Here, we will make a matrix to hold 
# the results, for up to ngen generations

ngen    <- 20
freqmat <- matrix(NA, nrow=popsize+1, ncol=ngen)

curr_pop <- p_init 

for (ii in 1:ngen){
    
    # recompute the new pop frequency distribution
    curr_pop <- tmat %*% curr_pop
    # store it:
    freqmat[,ii] <- curr_pop
}



plot3D::persp3D(x = c(0, probvec), z=freqmat, theta=45, phi=20, contour =F,
                xlab="Allele frequency", zlab="Number of populations",
                ylab="Generations")

# Plotting selected generations:
########## 
plot.new()
par(oma=c(1,1,1,1), mfrow=c(2,2), mar=c(2,2,2,0))

pfx <- function(gen){
    plot.new()
    plot.window(xlim=c(0,33), ylim=c(0, 15))
    axis(1, at=seq(-4, 36, by=4))
    axis(2, at=seq(-2, 16, by=2), las=1)
    lines(x=c(16,16), y=c(0, 18), lwd=5, col="gray60")
    title(main = paste("After", gen, ifelse(gen == 1, "gen", "gens")))
}

# Expectation after 1 generation of drift:
pfx(1)
points(x=0:32, y=freqmat[,1], pch=21, bg="red", cex=1.5)

# Expectation after 5 generations of drift
pfx(5)
points(x=0:32, y=freqmat[,5], pch=21, bg="red", cex=1.5)

# Expectation after 10 generations of drift
pfx(10)
points(x=0:32, y=freqmat[,10], pch=21, bg="red", cex=1.5)

# Expectation after 15 generations of drift
pfx(15)
points(x=0:32, y=freqmat[,15], pch=21, bg="red", cex=1.5)

# Restoring old par() configs:
par(oldpar)

popgen_drift.R

The above visually resembles figure 6 in Buri (1956) – but please remember this is an analytical expectation.

A little more on the history of this mathematical model can be found in the initial parts of Tran et al (2013) – but note that the mathematics that is developed in the later parts of the paper may be challenging. The original references on the Wright-Fisher population model are Fisher (1922) and Wright (1931).


References:

Buri, P. (1956). Gene frequency in small populations of mutant Drosophila. Evolution, 367-402.

Fisher RA (1922) On the dominance ratio. Proc. R. Soc. Edinb 42:321–341

Kimura, M. (1955). Solution of a process of random genetic drift with a continuous model. Proceedings of the National Academy of Sciences of the United States of America, 41(3), 144.

Tran, T. D., Hofrichter, J., & Jost, J. (2013). An introduction to the mathematical structure of the Wright–Fisher model of population genetics. Theory in Biosciences, 132(2), 73-82. [good for the historical review, math can be challenging]

Waples, R. S. (2022). What is Ne, anyway?. Journal of Heredity, 113(4), 371–379.

Wright S (1931) Evolution in Mendelian populations. Genetics 16:97–159

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.