biomaRt is a package which interfaces with a large number of databases implemented by the BioMart suite. You don't need to know SQL, just R. Examples of BioMart database include Ensembl and HapMap. As the blogpost above says, "The concept is simple. You have a set of identifiers that describe a biological object, such as a gene. These are called filters. They have values – for example, HGNC symbols. You want to retrieve other identifiers – attributes – for your objects."
First, we must install the R library biomaRt.
source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")
biocLite("biomaRt")
We use the useMart() function to interface with a particular database. To see the available marts, use listMarts(). Within the database, we need to pick a particular dataset. You can see what datasets are available using the function listDatasets(). If we want to extract particular attributes from the database, we need to know what attributes are available. This can be found using the listAttributes() function. Finally, we use the getBM() function to actually extract the information.
Next, I will consider two examples.
Example 1: Randomly sample n = 500 HGNC gene IDs from the human genome
# Load library
library(biomaRt)
# Define biomart object
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
# listDatasets(mart)
# listAttributes(mart)
# Extract information from biomart
results <- getBM(attributes = c("hgnc_symbol"), mart = mart)
# Randomly sample the gene name list
N <- 500
sample.hgnc <- sample(results$hgnc_symbol,N)
sample.hgnc <- sample(results$hgnc_symbol,N)
Sample results
> head(sample.hgnc)
> head(sample.hgnc)
[1] "LINC00293" "C6orf223" "PRMT5-AS1" "SYT11" "FLNB" "SNORA49"
Example 2: Given a list of REFSEQ IDs, convert gene IDs to HGNC IDs or Uniprot Swissprot IDs
library(biomaRt)
# Define biomart object
# Define biomart object
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
# Read in file with gene names
genes <- read.csv("refseq.csv")
# Extract information from biomart
results <- getBM(attributes = c("refseq_mrna", "hgnc_symbol"), filters = "refseq_mrna", values = genes[,1], mart = mart)
results <- getBM(attributes = c("refseq_mrna", "uniprot_swissprot"), filters = "refseq_mrna", values = genes[,1], mart = mart)
# see uniqueRows = TRUE/FALSE to return unique list of IDs or not
# Match the RefSeq names with the Uniprot names
# Match the RefSeq names with the Uniprot names
matched <- match(genes[,1], results[,2])
cbind(genes,results[matched,2])
cbind(genes,results[matched,2])
Sample results
refseq Uniprot
1 NM_023018 O95544
2 NM_178545 Q8NDY8
3 NM_033467 Q495T6
4 NM_004402 O76075
5 NM_018198 Q9NVH1
6 NM_018198 Q9NVH1
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.