Wednesday, June 19, 2013

Getting Started with SAMtools and Rsamtools

Have you heard of this thing called 'samtools', but not really sure what it is or how to get started?   Maybe you are a postdoc in a wet-lab that is now starting to use whole-genome and whole-exome sequencing, but there isn't an in-house bioinformatician to help sift though the computational side of the project?  If you're like me (a statistician by training and working with genomic data), learning about tools to handle next-generation sequencing data is essential.  This tutorial is meant give a little background about samtools and then provide some technical (and hopefully helpful) information on the purpose of samtools, its relation to Rsamtools and examples of how to utilize both.


Beginning with SAMtools

What does SAM mean? How is SAM different from BAM?
The Sequence Alignment/Map (SAM) format was developed to store aligned nucleotide sequence reads produced from genome aligners (e.g. BWA) in a somewhat human readable tab-deliminted text file.  Practically speaking, this format has become the standard alignment format widely used by groups such as 1000 Genomes Project.  The Binary Alignment/Map (BAM) format is a binary version of SAM  which contains the same information and simply compresses the data to save storage space (and BAM files are faster to manipulate).  In the SAM format, there is a header section (all header rows begin with '@') and an alignment section (all alignment rows contain 11 mandatory fields + optional fields).

What is SAMtools and where did SAMtools come from? 
The SAM/BAM format and SAMtools were published by Li et al. 2009. The purpose of samtools is provide a way to manipulate and interact with the aligned sequence reads in SAM/BAM format.  Samtools does not work with the entire file of aligned reads, but rather works in a stream which allows Unix commands to be used in conjunction with the SAM files.

How do you install samtools?
The easiest way to do this to check out the most recent samtools source code from the github project page.  If you're not familiar with git or github, I would suggest this wonderful tutorial by Karl Broman. The best description of github I've heard is this tutorial, Karl writes "Github is like facebook for programmers", which couldn't be more true.   So, if you've set yourself up with a github account, then simply check out the lastest samtools source code

git clone git://github.com/samtools/samtools.git
cd samtools
make
make razip

This last line is to compile razip which is a tool to compress files and still make them randomly accessible by samtools (not normally possible with gzip files).  Finally, the last step is to make samtools accessible to all the users on the machine which I extracted from this tutorial.

sudo cp ~/current/working/directory/samtools /usr/local/samtools-v0.1.19
sudo ln -s /usr/local/samtools-v0.1.19 /usr/local/samtools

The second line makes a symbolic link to the latest version directory. When a new version of samtools becomes available (e.g. samtools-v0.1.20), you can install it in the same directory and then just update the /usr/local/samtools symbolic link.  You can check to make sure samtools was installed in your $PATH with the following command:

which samtools

If nothing happens, samtools was not properly put in your $PATH.


Examples of how to use SAMtools

There are many tutorials on samtools.  The standard format of samtools requires a command and then allows for additional options.

samtools [command] <options>

If you type just 'samtools' in the terminal window,  a list of available command (and their descriptions) will be provided.  We begin by creating a folder which will contain our SAM file and reference genome in fasta format which we used to align the reads.

mkdir MyProject
cp ~/current/working/directory/test.sam MyProject
cp ~/current/working/directory/ref.fa MyProject

samtools view
To convert from SAM to BAM format (option -b) when header is available and input is SAM format (option -S)
samtools view -Sb test.sam > test.bam

To convert from SAM to BAM format when header is not available, but ref.fa was used to map the reads
samtools view -b ref.fa test.sam > test.bam

To filter out unmapped reads from BAM files (option -F)
samtools view -bF test.bam > mappedtest.bam

samtools sort
To sort BAM file by chromosomal coordinates (default) or read names (option -n) and create a test.sorted.bam file
samtools sort test.bam test.sorted

An example of using a Unix pipe to go from the SAM format to the sorted BAM format
samtools view -Sb test.sam | samtools sort test.sorted

samtools index 
To create an indexed bam file (.bai) from the test.sorted.bam file 
samtools index test.sorted.bam

samtools faidx
To create an indexed reference genome in FASTA format (creates a .fai file)
samtools faidx ref.fa 



Examples of using SAM/BAM files 

'samtools mpileup' and 'bcftools view'
To create a Binary Call Format (bcf) file (option -u) by mapping the bases using the indexed reference genome (option -f) and call genomic variants
samtools mpileup -u -f ref.fa test.sorted.bam > test.bcf

To convert the bcf file to a human readable Variant Call Format (vcf) file
bcftools view -v -g test.bcf > test.vcf

From here, you can also check out the Integrative Genomics Viewer (IGV) to visually look at the aligned reads or try importing the BAM files into tools like Rsamtools.


Examples of how to use Rsamtools

The purpose of Rsamtools is to provide an interface between R and BAM files produced by the tools samtools, bcftools, and tabix (not discussed here). To install Rsamtools in R, use

source("http://bioconductor.org/biocLite.R")
biocLite("Rsamtools")

The main purpose is to import (indexed!) BAM files into R using the scanBam() function.  The vignette for Rsamtools can be found here.  The entire BAM file should not be read into R, but rather just the portion of the genome of interest using the what and which arguments in ScanBamParam(which=which, what=what).  There are other packages which can read in BAM files include ShortRead and GenomicRanges.

To extract the header information, use scanBamHeader().  Use filterBam() to filter reads from BAM file according to the criteria defined in ScanBamParam().

Another great use of Rsamtools is to access multiple BAM files using the BamViews class in Rsamtools. This allows you to obtain metadata by 'viewing' the BAM files rather than importing each BAM individually.


Thursday, June 13, 2013

US Supreme court says NO to patenting natural human genes

Today the US Supreme Court rendered a unanimous decision in the case Association for Molecular Pathology v. Myriad Genetics: "A naturally occurring DNA segment is a product of nature and not patent eligible merely because it has been isolated, but cDNA is patent eligible because it is not naturally occurring."





In the court ruling, Justice Clarence Thomas rejects several claims including "the idea that by isolating the gene and separating it from the surrounding chromosome, Myriad had created something new".  Thomas makes the argument cDNA is patentable because it is not a "product of nature". He writes "cDNA does not present the same obstacles to patentability as naturally occurring, isolated DNA segments. Its creation results in an exons-only molecule, which is not naturally occurring. Its order of the exons may be dictated by nature, but the lab technician unquestionably creates something new when introns are removed from a DNA sequence to make cDNA".

This led to one of my favorite tweets from this afternoon:


What does this mean for a patient seeking a genetic test (e.g. the BRCA1, BRCA2 genes)?  Competing companies offering genetic tests no longer have to pay the high licensing fees and hopefully can make the cost accessible to the consumer.






As expected, this decision has major implications for the medical genetics community.  Twitter exploded with responses from various groups in the genetics and genomics community with hashtags #cDNA and #SCOTUS trending.



Interestingly, by this afternoon Myriad released a statement claiming victory: "'We believe the Court appropriately upheld our claims on cDNA, and underscored the patent eligibility of our method claims, ensuring strong intellectual property protection for our BRACAnalysis test moving forward,' said Peter D. Meldrum, president and CEO".   Of course, not everyone agrees with this statement including Myriad's stock holders it seems...

Overall, today was a wonderful day for the genetics community and for, most importantly, the patients and their families.