Statistical [R]ecipes: January 2013

Tuesday, January 29, 2013

Hardy-Weinberg Genotype Frequencies

An important principle in population genetics is called the Hardy-Weinberg principle (or H-W equilibrium or H-W law) which "describes the equilibrium state of a single locus in a randomly mating diploid population that is free of other evolutionary forces such as mutation, migration, and genetic drift" [Population Genetics: A Concise Guide by John Gillespie, 2nd edition]. This means the allele frequencies and genotype frequencies of a single locus are in equilibrium and stay constant generation to generation (in an ideal world) until outside forces act on them such as mutation, migration, genetic drift (see post on simulating genetic drift) or natural selection (in the real world).

Let's try to put some notation to all this. If we have two alleles $A$ and $a$ with genotype frequencies
\begin{eqnarray*}
P(AA) & = & u \\
P(Aa) & = & v \\
P(aa) & = & w
\end{eqnarray*}
then without any more assumptions, the allele frequencies are easily estimated:
\[ P(A) = u + 0.5v = p \hspace{2in} P(a) = w + 0.5v = q \]
If we have the allele frequencies, we need a few more assumptions to calculate the genotype frequencies. This is where the Hardy-Weinberg principle comes into play.

H-W Assumptions:

diploid organism, infinite population, discrete generations
random mating
no outside forces at play (e.g. no selection, no migration, no mutation)
equal (or unequal)* initial genotype frequencies in two sexes

When a population is in H-W equilibrium, the alleles that make up a genotype are thought of as randomly sampling alleles from the population. Thus, we can estimate genotype frequencies from allele frequencies as followed:
\begin{eqnarray*}
P(AA) & = & P(A) P(A) = p^2 \\
P(Aa) & = & 2P(A) P(a) = 2*p*q\\
P(aa) & = & P(a) P(a) = q^2
\end{eqnarray*}
*In dioecious species (individual is either male or female), H-W equilibrium can be reached in just two generations assuming unequal genotype frequencies in two sexes (will be achieved in one generation assuming equal genotype frequencies).

Rcode:
Assume equal genotype frequencies in two sexes. Show HW equilibrium in one generation.

> # Start with two alleles A, a with respective allele frequencies p.0, q.0

> p.0 = 0.2
> q.0 = 1 - p.0
>
> # After one round of mating
> # Calculate genotype frequencies (Assumes HWE)
> AA = p.0*p.0
> AA
[1] 0.04
>
> Aa = 2*p.0*q.0
> Aa
[1] 0.32
>
> aa = q.0*q.0
> aa
[1] 0.64
>
>
> # Calulate new allele frequencies (Does not assume HWE)
> p.1 = AA + 0.5*Aa
> p.1
[1] 0.2
>
> q.1 = 1 - p.1
> q.1
[1] 0.8
>
> # After two rounds of mating
> # Calculate genotype frequencies (Assumes HWE)
> AA = p.1*p.1
> AA
[1] 0.04
>
> Aa = 2*p.1*q.1
> Aa
[1] 0.32
>
> aa = q.1*q.1
> aa
[1] 0.64

Therefore, we see the genotype frequencies after one round of mating were equal to the genotype frequencies after two rounds of mating (i.e. H-W equilibrium is attained).

Now this is all nice, but the H-W assumptions are almost never met in the real world. For example, there may be small population sizes, deviations from random mating (e.g. assortative mating, inbreeding) and there more than likely outside forces at play such as mutation, migration and selection. To test for deviations from H-W equilibrium, we can use the $\chi^2$ Goodness-of-Fit test or Exact tests (better for small sample sizes) in standard software tools such as R or PLINK.

Example: If we have two alleles $A$ and $a$, we can compare the observed genotype counts with values expected under H-W equilibrium:

\begin{eqnarray*}
Genotype & Observed & Expected \\
AA & n_{AA} & np^2 \\
Aa & n_{Aa} & 2np(1-p) \\
aa & n_{aa} & n(1-p)^2
\end{eqnarray*}
where $p = P(A) = (n_{Aa} + 2n_{AA}) / 2n$

Monday, January 28, 2013

Markov Chains: Stationary Distributions

In my first post on Markov chains, I introduced a particular type of stochastic process with the basic first-order Markov property (in discrete time). We call this a Markov chain. Examples explaining what are 1-step and 2-step transition probabilities and how to compute a marginal or unconditional probability of being in a particular state at time $n$ were all provided. This post will be a bit more technical, but ultimately I hope to end with a concrete example (with R code) to help make it more clear.

In this post, I will introduce stationary distributions by answering the following questions:

What is a stationary distribution?
What properties are needed to guarantee the unique existence of the stationary distribution?
How to compute the stationary distribution?

Q1: A stationary distribution $\pi$ is a probability distribution such that when the Markov chain reaches the stationary distribution, then it remains in that probability distribution forever. This means whenever you are interested in asking questions like "In the long-run, what it the proportion of time the Markov chain will be in a particular state?" (such as the rain or no rain example), you're interested in computing the stationary distribution (if it's possible).

Let's start with a bit of notation. Define $\pi_j$ as the limiting probability that the process will be in state $j$ at time $n$ or
\[ \pi_j = \lim_{n \rightarrow \infty} P_{ij}^n \]
Consider:
\begin{eqnarray*}
\lim_{n \rightarrow \infty} P[X_{n+1} = j] & = & lim_{n \rightarrow \infty} \sum_{i = 0}^{\infty} P[X_{n+1} = j | X_n = i] P[X_n = i] \\
& = & \sum_{i = 0}^{\infty} lim_{n \rightarrow \infty} P_{ij} P[X_n = i] \hspace{.2in} (Fubini's \hspace{.05in}Theorem)
\end{eqnarray*}
which defines the stationary distribution:
\[ \pi_j = \sum_{i=0}^{\infty} P_{ij} \pi_i \]

Q2: This question is more difficult to answer.
Short answer: The Markov chain needs to be a (1) finite (2) aperiodic and (3) irreducible chain.
Longer answer: Assume $P$ is the transition probability matrix for a Markov chain with the following properties:

finite = finite number of states
aperiodic = state $i$ has period $d$ if $P_{ii}^n = 0$ whenever $n$ is not divisible by $d$ and $d$ is the largest integer with this property
irreducible = two states that communicate are in the same class. classes are disjoint or identical. If there exists only 1 class, then the Markov chain is irreducible

Then, there exists a unique probability distribution satisfying
\[ \pi_j = \sum_{i = 0}^{\infty} \pi_i P_{ij} \hspace{1in} j \geq 0, \hspace{.5in} \sum_j \pi_j = 1 \]
In vector notation $\pi^{'} = \pi^{'} P$ and $\pi^{'} 1 = 1$ where $\pi^{'} = (\pi_0, \ldots, \pi_s)$.

This theorem also gives us the fact that if $P$ is finite, aperiodic and irreducible, then not only does the unique stationary distribution exist, but also $\pi_j$ = the limiting probability that the process will be in state $j$ at time $n$ equals the long-run proportion of time that the process will be in state $j$ [Sheldon Ross, 2007]. Or

\[ P^n \rightarrow \left[ \begin{array}{ccc}

\pi_0 & \ldots & \pi_s \\
\pi_0 & \ldots & \pi_s \\
\vdots & & \\
\pi_0 & \ldots & \pi_s
\end{array} \right] \]

Q3: Back to our rain example. The transition probability matrix $P$ was given by

\[ P = \left[ \begin{array}{cc}

\alpha & 1-\alpha \\
\beta & 1- \beta \\
\end{array} \right] \]
where $\alpha$ = 0.7 and $\beta$= 0.4. As we consider the 2-step transition probability matrix ($P_{ij}^2$), we can consider the 4-step ($P_{ij}^4$)or $n$-step transition probability matrix ($P_{ij}^n$) as $n \rightarrow \infty$.

Rcode:

trans.mat <- matrix(c(0.7, 0.3, 0.4, 0.6), 2,2, byrow = TRUE)
> trans.mat %*% trans.mat %*% trans.mat
[,1] [,2]
[1,] 0.583 0.417
[2,] 0.556 0.444

> trans.mat %*% trans.mat %*% trans.mat %*% trans.mat
[,1] [,2]
[1,] 0.5749 0.4251
[2,] 0.5668 0.4332
> trans.mat %*% trans.mat %*% trans.mat %*% trans.mat %*% trans.mat
[,1] [,2]
[1,] 0.57247 0.42753
[2,] 0.57004 0.42996
> trans.mat %*% trans.mat %*% trans.mat %*% trans.mat %*% trans.mat %*% trans.mat
[,1] [,2]
[1,] 0.571741 0.428259
[2,] 0.571012 0.428988

We see each row in starting to converge to what's called the 'stationary distribution' as $n \rightarrow \infty$. Thus, let's find the stationary distribution for Markov chain. Another way of asking this is in the long run, what is the proportion of time the process is in the rain or no rain state?

Because our Markov chain is finite, aperiodic, and irreducible, we can use the set of equations $\pi_j = \sum_{i = 0}^{\infty} \pi_i P_{ij}$ and $ \sum_j \pi_j = 1$.

\begin{eqnarray*}
\pi_0 & = & \alpha \pi_0 + \beta \pi_1 \\
\pi_1 & = & (1-\alpha) \pi_0 + (1-\beta) \pi_1 \\
\pi_0 + \pi_1 & = & 1 \\
\end{eqnarray*}
Solving for $\pi_0$ and $\pi_1$ yields the solutions
\begin{eqnarray*}
\pi_0 & = & \frac{\beta}{ 1 + \beta - \alpha} \\
\pi_1 & = & \frac{1-\alpha}{1 + \beta - \alpha}
\end{eqnarray*}
Plugging in $\alpha = 0.7$ and $\beta = 0.4$, we see the long-run proportion of time it will rain is $\pi_0 = 0.571$ and the long-run proportion of time it won't rain is $\pi_1 = 0.428$. This matches the limiting probabilities we calculated.

Hopefully this was helpful in understanding what a stationary distribution is and how to compute one. Stay tuned for second-order Markov chains and a great example using the RHmm package in R.

Tuesday, January 15, 2013

Easy Introduction to Markov Chains in R

Markov chains are an important concept in probability and many other areas of research. In this post, I provide the basic Markov property and then a few examples including R code to give an example of how they work. Most of the background can be found in one of my favorite books on this topic which is called Introduction to Probability Models by Sheldon Ross.

We define a stochastic process $\{X_n, n = 0,1,2, \ldots \}$ that takes on a finite or countable number of possible values. Let the possible values be nonnegative integers (i.e. $X_n \in \mathbb{Z}_+ $). If $X_n = i$ then the process is said to be in state $i$ at time $n$.

Now, we define the 'Markov property' in discrete time:
\[ P_{ij} = P[X_{n+1} = j | X_n = i_n, X_{n-1} = i_{n-1}, \ldots, X_0 = i_0] = P[X_{n+1} = j | X_n = i_n] \hspace{.2in} \forall i, j \in \mathbb{Z}_+ \]
Such a stochastic process is known as Markov chain. We call $P_{ij}$ a 1-step transition probability because we moved from time $n$ to time $n+1$. It is a first-order Markov chain because the probability of being in state $j$ at the $(n+1)$ time point only depends on the state at the the $n$th time point. Since proabilities are nonnegative and the process must make a transition into some state, we have these properties:

$P_{ij} \geq 0, \hspace{.2in} \forall i,j \geq 0$
$\sum_{j \geq 0} P_{ij} = 1, \hspace{.2in} i = 0, 1, 2, \ldots$

The $n$-step transition probability $P_{ij}^n$ is defined as
\[ P_{ij}^n = P[X_{n+k} = j | X_k = i] \hspace{.2in} \forall n \geq 0, \hspace{.2in} i,j \geq 0 \]
An important set of equations called the Chapman-Kolmogorov equations allow us to compute these $n$-step transition probabilities. It states
\[ P_{ij}^{n+m} = \sum_{k=0}^{\infty} P_{ik}^n P_{kj}^m \hspace{.5in} \forall n, m \geq 0, \hspace{.2in} \forall i, j \geq 0\]
Let's try an example. Consider two states: 0 = rain and 1 = no rain. Define two probabilities $\alpha$ = the probability it will rain tomorrow given it rained today and $\beta$ = the probability it will rain tomorrow given it didn't rain today. We are interested in asking what is the probability it will rain the day after tomorrow given it rained today?

First, we must set up the transition probability matrix. Translating $\alpha$ and $\beta$ to probabilities leads to $\alpha = P_{00} = P[X_{n+1} = 0 | X_n = 0]$ and $\beta = P_{01} = P[X_{n+1} = 1 | X_n = 0]$. Because the rows must sum to 1, this leads to the transition probability matrix:
\[ P = \left[ \begin{array}{cc}
\alpha & 1-\alpha \\
\beta & 1- \beta \\
\end{array} \right] \]
If we assign numbers to $\alpha$ = 0.7 and $\beta$= 0.4, then we can compute the probability it will rain the day after tomorrow (aka 2 steps) given it rained today.
\[ P^2 = \left[ \begin{array}{cc}
\alpha & 1-\alpha \\
\beta & 1- \beta \\
\end{array} \right] \left[ \begin{array}{cc}
\alpha & 1-\alpha \\
\beta & 1- \beta \\
\end{array} \right] = \left[ \begin{array}{cc}
.61 & .39 \\
.52 & .48 \\
\end{array} \right] \]
R code:
> trans.mat <- matrix(c(0.7, 0.3, 0.4, 0.6), 2,2, byrow = TRUE)
> trans.mat %*% trans.mat
[,1] [,2]
[1,] 0.61 0.39
[2,] 0.52 0.48

These probabilities are conditional probabilities because they depend on what happened the previous day. We can also ask: 'What is the unconditional probability of rain at time $n$? ' This can be easily found by computing the unconditional or marginal distribution of the state at time $n$:
\[ P[X_n = j] = \sum_{i = 0}^{\infty} P[X_n = j| X_0 = i] P[X_0 = i] = \sum_{i = 0}^{\infty} P_{ij}^n \alpha_i \]
where $\alpha_i = P[X_0 = i]$, $\forall i \geq 0$ and note $\sum_{i = 0}^{\infty} \alpha_i = 1$ because we must start in some state.

Back to the example. If we let $\alpha_0 = .4$ and $\alpha_1 = 0.6$ in our rain example, then the unconditional or marginal probability it will rain in two days is
\[ P[X_2 = 0] = 0.4 P_{00}^2 + 0.6 P_{10}^2 = 0.4 (0.61) + 0.6 (0.52) = 0.566 \]
Rcode:
> twostep.trans <- trans.mat %*% trans.mat
> init.dist = matrix(c(0.4, 0.6), 1, 2)
> init.dist %*% twostep.trans
[,1] [,2]
[1,] 0.556 0.444

There are many more fun questions to ask, but for now I hope this was a basic introduction to discrete time Markov chains. Stay tuned for a blogpost about stationary distributions and second-order Markov chains!