Statistical [R]ecipes: April 2012

Monday, April 30, 2012

Freely Available Machine Learning Data sets

I cam across a great resource for anyone in statistics / machine learning! It's called the UCI Machine Learning Repository. The website contains over 200 downloadable data sets and it is even classified by the task (classification, regression, clustering, etc), data type (multivariate, time-series, ... ), area (biological, business, ... ), and number of covariates. Very valuable resource.

Tuesday, April 17, 2012

Gaddy Getz - April 18th 4pm

Gaddy Getz is giving a talk at 4pm in the medical center tomorrow! It should be a great talk for everyone interested cancer genomics.

Wednesday, April 11, 2012

Using BioMart and biomaRt

I've been wanting to explore the tools BioMart and the corresponding R package biomaRt which is a part of the bioconductor suite. I recently came across this blogpost explaining a bit more about the strength of the package.

biomaRt is a package which interfaces with a large number of databases implemented by the BioMart suite. You don't need to know SQL, just R. Examples of BioMart database include Ensembl and HapMap. As the blogpost above says, "The concept is simple. You have a set of identifiers that describe a biological object, such as a gene. These are called filters. They have values – for example, HGNC symbols. You want to retrieve other identifiers – attributes – for your objects."

First, we must install the R library biomaRt.

source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")

We use the useMart() function to interface with a particular database.  To see the available marts, use listMarts(). Within the database, we need to pick a particular dataset.  You can see what datasets are available using the function listDatasets().  If we want to extract particular attributes from the database, we need to know what attributes are available.  This can be found using the listAttributes() function.  Finally, we use the getBM() function to actually extract the information.

Next, I will consider two examples.

Example 1: Randomly sample n = 500 HGNC gene IDs from the human genome

# Load library

library(biomaRt)

# Define biomart object

mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")

# listDatasets(mart)

# listAttributes(mart)

# Extract information from biomart

results <- getBM(attributes = c("hgnc_symbol"), mart = mart)

# Randomly sample the gene name list

N <- 500
sample.hgnc <- sample(results$hgnc_symbol,N)

Sample results
> head(sample.hgnc)

[1] "LINC00293" "C6orf223" "PRMT5-AS1" "SYT11" "FLNB" "SNORA49"

Example 2: Given a list of REFSEQ IDs, convert gene IDs to HGNC IDs or Uniprot Swissprot IDs

library(biomaRt)

# Define biomart object

mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")

# Read in file with gene names

genes <- read.csv("refseq.csv")

# Extract information from biomart

results <- getBM(attributes = c("refseq_mrna", "hgnc_symbol"), filters = "refseq_mrna", values = genes[,1], mart = mart)

results <- getBM(attributes = c("refseq_mrna", "uniprot_swissprot"), filters = "refseq_mrna", values = genes[,1], mart = mart)

# see uniqueRows = TRUE/FALSE to return unique list of IDs or not

# Match the RefSeq names with the Uniprot names

matched <- match(genes[,1], results[,2])
cbind(genes,results[matched,2])

Sample results

refseq Uniprot

1 NM_023018 O95544

2 NM_178545 Q8NDY8

3 NM_033467 Q495T6

4 NM_004402 O76075

5 NM_018198 Q9NVH1

6 NM_018198 Q9NVH1

After working through all this, I've quickly learned this is a very powerful tool in bioinformatics.

Tuesday, April 10, 2012

Steve Stigler gives a talk at Rice University

As part of the celebration the 25th anniversary of the Department of Statistics at Rice University, Steve Stigler gave a talk yesterday titled 'How a Statistical Idea Saved Darwin's Theory and Much More'. He began his talk with some old pictures of how the department was founded including pictures of the founding mathematics department in 1927.

Then Steve discussed some very influential mathematicians and probabilists who helped contributed to some important ideas in statistics such as 'The Rule of Three'. A brief summary would be if the following relationships holds
\[ \frac{A}{B} = \frac{C}{D} \]
and you know any of the three out of the four (A, B, C, D), then you can easily solve for the fourth. In 1855, Charles Darwin is even quoted saying, "I have no faith in anything short of actual measurement and the Rule of Three". Karl Pearson suggested this be the motto of the journal Biometrika. For a full description of the story, see the publication Stigler (2012) in Biometrika.

Interestingly, Francis Galton made the discovery that if data were perfectly correlated then the Rule of Three is true. When data is not perfectly correlated, then the idea of a conditional distribution must be introduced. The example that led to this discovery was an anthropologist was studying skeletons in a grave site. He let $T$ = length of a man's thigh bone and $H$ = man's height. After measuring several complete skeletons, he averaged $m_T$ = mean of thigh bones and $m_H$ = mean of heights. Then anthropologist wanted to infer heights from thigh bone lengths. He made the assumption that
\[ \frac{m_T}{m_H} = \frac{T}{H} \]
What Galton discovered is that if the data were perfectly correlated, then this formula would work. Because this assumption did not hold, Galton stumbled on to what we now call correlation. This led to the idea of expectations of conditional distributions, and hence the regression line! Very interesting story.

Steve ended his talk with some encouraging words about the Statistics Department at Rice. He showed this picture which was taken at Rice when Emile Borel visited.

The night ended with a nice dinner in Duncan Hall hall with pictures, awards and a lovely cake.

Congratulations to the Department of Statistics at Rice! I'm excited to see what the next 25 years has to hold.

Wednesday, April 4, 2012

Applying the EM Algorithm: Binomial Mixtures

Last month I made a post about the EM algorithm and how to estimate the confidence intervals for the parameter estimates out of the EM algorithm. In this post, I give the code for estimating the parameters of a binomial mixture and their confidence intervals.

Consider the following problem: We have two coins. The probability of observing a head with the first coin is $p$ and the second coin is $q$. We do not observe which coin was used. We only observe $Y_i \in \{0,1\}$ which records whether we see a heads or tails. Let $Z_i \in \{0,1\}$ be the missing information of which coin was used on each flip. The probability of using the first coin is $P(Z_i = 0) = \pi$ and the second coin is $P(Z_i = 1) = 1-\pi$. Then, the complete information is given by $X = (Y, Z)$. Using the law of total probability, we know
\[ P[Y] = P[Y | Z = 0] P[Z=0] + P[Y| Z = 1] P[Z = 1] \]
Therefore, the complete likelihood is given by
\[ L_0(\theta | X) = \prod_i [p^{y_i}(1-p)^{(1-y_i)} \pi ]^{1-z_i} [q^{y_i}(1-q)^{(1-y_i)} (1-\pi)  ]^{z_i} \]
where $\theta = (\pi, p, q)$. We can compute the expectation of $Z_i$ conditional on the observed information and the current parameter estimates $\theta^{(t)}$:
\[ w_i = E_{Z_i | Y_i, \theta^{(t)}} [ Z_i ] = \frac{ (p^{(t)})^{y_i}(1-(p^{(t)}))^{(1-y_i)} (\pi^{(t)}) }{ (p^{(t)})^{y_i}(1-(p^{(t)}))^{(1-y_i)} (\pi^{(t)}) + (q^{(t)})^{y_i}(1-(q^{(t)}))^{(1-y_i)} (1-(\pi^{(t)})) } \]
Taking the gradient of $Q(\theta | \theta^{(t)}) = E_{Z_i | Y_i, \theta^{(t)}} [ L_0(\theta | X) ]$ with respect to the three parameters $\pi, p, q$ and solving for $\pi, p, q$ yield the
\begin{eqnarray*}
\hat{\pi} & = & \frac{1}{n} \sum_{i = 1}^n w_i \\
\hat{p} & = & \frac{  \sum_{i = 1}^n w_i y_i }{  \sum_{i = 1}^n w_i} \\
\hat{q} & = & \frac{  \sum_{i = 1}^n (1-w_i) y_i }{  \sum_{i = 1}^n (1-w_i)}
\end{eqnarray*}

The confidence intervals associated with the EM estimates can be found by calculating

\[ I(\theta) = - \frac{\partial^2 Q}{\partial \theta^2 } |_{\theta = \hat{\theta} } \]

To calculate a $100(1-\alpha)\%$ confidence interval for $\theta$, we use

\[ [\hat{\theta} - Z_{\alpha/2} (\frac{1}{\sqrt{I(\theta)}}) , \hat{\theta} + Z_{\alpha/2} (\frac{1}{\sqrt{I(\theta)}}) ] \]

Here is my R code:
## Expectation Step
estep <- function(obs,pi,p,q){
pi_estep <- (pi*dbinom(obs,1,p)) / ( pi*dbinom(obs,1,p) + (1-pi)*dbinom(obs,1,q) )
return(pi_estep)
}

## Maximization Step
mstep <- function(obs,e.step){

# estimate pi
pi_temp <- mean(e.step)

# estimate p,q
p_temp <- sum(obs*e.step) / sum(e.step)
q_temp <- sum(obs*(1-e.step)) / sum(1-e.step)

list(pi_temp,p_temp,q_temp)
}

## EM Algorithm
em.algo <- function(obs,pi_inits,p_inits,q_inits,maxit=1000,tol=1e-6){
# Initial parameter estimates
flag <- 0
pi_cur <- pi_inits; p_cur <- p_inits; q_cur <- q_inits

# Iterate between expectation and maximization steps
for(i in 1:maxit){
cur <- c(pi_cur,p_cur,q_cur)
new <- mstep(obs,estep(obs, pi_cur, p_cur, q_cur))
pi_new <- new[[1]]; p_new <- new[[2]]; q_new <- new[[3]]
new_step <- c(pi_new,p_new,q_new)

# Stop iteration if the difference between the current and new estimates is less than a tolerance level
if( all(abs(cur - new_step) < tol) ){ flag <- 1; break}

# Otherwise continue iteration
pi_cur <- pi_new; p_cur <- p_new; q_cur <- q_new
}
if(!flag) warning("Didn’t converge\n")

list(pi_cur, p_cur, q_cur)
}

## Calculate Information matrix

Info.Mat.function <- function(obs, pi.est, p.est, q.est){

estep.est <- estep(obs,pi.est, p.est, q.est)

info.mat <- matrix(rep(0,9),3,3)

info.mat[1,1] <- - sum(estep.est)/(pi.est^2) - sum((1-estep.est))/((1-pi.est)^2)

info.mat[2,2] <- - sum(estep.est*obs)/(p.est^2) - sum(estep.est*(1-obs))/((1-p.est)^2)

info.mat[3,3] <- - sum((1-estep.est)*obs)/(q.est^2) - sum((1-estep.est)*(1-obs))/((1-q.est)^2)

return(-info.mat)

}

## Generate sample data
n <- 5000
pi_true <- 0.90 # prob of using first coin
p_true <- 0.60 # the first coin has P(heads) = 0.60
q_true <- 0.50 # the second coin has P(heads) = 0.50
true <- c(pi_true,p_true,q_true)
u <- ifelse(runif(n)<pi_true, rbinom(n,1,p_true),rbinom(n,1,q_true))

## Set parameter estimates
pi_init = 0.70; p_init = 0.70; q_init = 0.60

## Run EM Algorithm
output <- em.algo(u,pi_init,p_init,q_init)

## Calculate Confidence Intervals

sd.out <- sqrt(diag(solve(Info.Mat.function(u,output[[1]],output[[2]],output[[3]]))))

data.frame("Truth" = true, "EM Estimate" = unlist(output), "Lower CI" = unlist(output) - qnorm(.975)*sd.out, "Upper CI" = unlist(output) + qnorm(.975)*sd.out)

Example output:

Truth EM.Estimate Lower.CI Upper.CI

1 0.9 0.8941314 0.8856034 0.9026594

2 0.6 0.6081113 0.5938014 0.6224211

3 0.5 0.4483729 0.4060064 0.4907393

Tuesday, April 3, 2012

Eliminating Highly Polymorphic Genes from Whole Exome Sequencing

Whole exome sequencing yields thousands of variants for each individual sequenced. Many of these variants are in genes that are highly polymorphic or in regions that do not sequence well and therefore would not be of interest when searching for a disease-causing variant. This is because any frequently mutated gene containing many deleterious variants will have a low probability of containing the disease-causing mutation. Also, if the gene is in a region that does not sequence well then a high number of variants will be often reported. When you find such variants, this would be considered a false positive. There is a big interest in detecting these false positive signals and eliminating these variants from the list reported back from whole exome sequencing.

A paper was just published this month in Human Mutation describing a way to do this. They published several lists of genes for researchers to use in their own projects. The researchers hypothesize these genes will not contain the disease-causing mutation. This list of genes could be an incredibly useful tool to filter out highly polymorphic genes or genes that simply do not sequence well.

Monday, April 2, 2012

pizza pizza!

This week we've been attempting italian themed meals! After our risotto disaster last night, I was hoping for a bit of redemption tonight with homemade pizzas.

This was my first time using a pizza stone and it definitely won't be my last! I read through several tutorials on how to get the best result out of your pizza stone. The stone made the crust very crispy, a type of crispy that I've never been able to get out of a regular pizza pan. Here's the dough recipe I used:

White Pizza Dough Recipe:
1 packet of active dry yeast
1 cup of warm water
2 1/2 cups all-purpose flour
1/2 teaspoons salt
2 teaspoons olive oil
1/4 cup cornmeal

Directions: Mix yeast and warm water. Let stand until yeast dissolves (approximately 5-10 mins). Mix flour, salt, olive oil and yeast mixture together until the mixture forms a sticky ball that has pulled away from the side of the bowl. Add more flour if needed. Let rise in a well oiled bowl for 30 mins (cover with a towel). Punch down and separate into two balls. At this point, you can either store the pizza dough in a container or roll it out into two 12 inch pizzas. Since the dough was for tonight, I placed the pizza stone in a cold oven to warm up to 400 degrees.

While the dough was rising, I chopped up a few toppings. Tonight, I picked tomatoes, jalapeno peppers, basil (picked from our patio garden), parm and mozzarella cheese! Chris added some pepperoni to his pizza too.

When the stone is hot, take it out of the oven and add a nice dusting of cornmeal. This is a critical step! It keeps the dough from sticking to the stone. After I rolled out the dough, I placed it on the steaming hot stone. Because I prefer pizzas with no tomato sauce, I first added a light layer of olive oil followed by the toppings.

Bake in the oven at 400 degrees for 10 mins and then broil 2 minutes just to give it that perfect golden brown color.

Great texture in the dough and great toppings = great pizza!