Wednesday, April 11, 2012

Using BioMart and biomaRt

I've been wanting to explore the tools BioMart and the corresponding R package biomaRt which is a part of the bioconductor suite.  I recently came across this blogpost explaining a bit more about the strength of the package.

biomaRt is a package which interfaces with a large number of databases implemented by the BioMart suite.  You don't need to know SQL, just R.  Examples of BioMart database include Ensembl and HapMap.  As the blogpost above says, "The concept is simple. You have a set of identifiers that describe a biological object, such as a gene. These are called filters. They have values – for example, HGNC symbols. You want to retrieve other identifiers – attributes – for your objects."  

First, we must install the R library biomaRt. 
source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")


We use the useMart() function to interface with a particular database.  To see the available marts, use listMarts(). Within the database, we need to pick a particular dataset.  You can see what datasets are available using the function listDatasets().  If we want to extract particular attributes from the database, we need to know what attributes are available.  This can be found using the listAttributes() function.  Finally, we use the getBM() function to actually extract the information.

Next, I will consider two examples. 

Example 1: Randomly sample n = 500 HGNC gene IDs from the human genome

# Load library
library(biomaRt)

# Define biomart object
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
# listDatasets(mart)
# listAttributes(mart)

# Extract information from biomart
results <- getBM(attributes = c("hgnc_symbol"), mart = mart)

# Randomly sample the gene name list
N <- 500
sample.hgnc <- sample(results$hgnc_symbol,N)

Sample results
> head(sample.hgnc)
[1] "LINC00293" "C6orf223"  "PRMT5-AS1" "SYT11"     "FLNB"      "SNORA49"  


Example 2: Given a list of REFSEQ IDs, convert gene IDs to HGNC IDs or Uniprot Swissprot IDs
library(biomaRt)

# Define biomart object
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")

# Read in file with gene names
genes <- read.csv("refseq.csv")

# Extract information from biomart
results <- getBM(attributes = c("refseq_mrna", "hgnc_symbol"), filters = "refseq_mrna", values = genes[,1], mart = mart)

results <- getBM(attributes = c("refseq_mrna", "uniprot_swissprot"), filters = "refseq_mrna", values = genes[,1], mart = mart)
# see uniqueRows = TRUE/FALSE to return unique list of IDs or not

# Match the RefSeq names with the Uniprot names
matched <- match(genes[,1], results[,2])
cbind(genes,results[matched,2])

Sample results
     refseq Uniprot
1 NM_023018  O95544
2 NM_178545  Q8NDY8
3 NM_033467  Q495T6
4 NM_004402  O76075
5 NM_018198  Q9NVH1
6 NM_018198  Q9NVH1

After working through all this, I've quickly learned this is a very powerful tool in bioinformatics.  

No comments:

Post a Comment