Have you heard of this thing called '
samtools', but not really sure what it is or how to get started? Maybe you are a postdoc in a wet-lab that is now starting to use whole-genome and whole-exome sequencing, but there isn't an in-house bioinformatician to help sift though the computational side of the project? If you're like me (a statistician by training and working with genomic data), learning about tools to handle next-generation sequencing data is essential. This tutorial is meant give a little background about samtools and then provide some technical (and hopefully helpful) information on the purpose of samtools, its relation to Rsamtools and examples of how to utilize both.
Beginning with SAMtools
What does SAM mean? How is SAM different from BAM?
The
Sequence Alignment/Map (SAM) format was developed to store aligned nucleotide sequence reads produced from genome aligners (e.g.
BWA) in a somewhat human readable tab-deliminted text file. Practically speaking, this format has become the standard alignment format widely used by groups such as
1000 Genomes Project. The Binary Alignment/Map (BAM) format is a binary version of SAM which contains the same information and simply compresses the data to save storage space (and BAM files are faster to manipulate). In the SAM format, there is a header section (all header rows begin with '@') and an alignment section (all alignment rows contain 11 mandatory fields + optional fields).
What is SAMtools and where did SAMtools come from?
The SAM/BAM format and SAMtools were published by
Li et al. 2009. The purpose of samtools is provide a way to manipulate and interact with the aligned sequence reads in SAM/BAM format. Samtools does not work with the entire file of aligned reads, but rather works in a stream which allows Unix commands to be used in conjunction with the SAM files.
How do you install samtools?
The easiest way to do this to check out the most recent samtools source code from the github project page. If you're not familiar with
git or
github, I would suggest
this wonderful tutorial by Karl Broman. The best description of github I've heard is this tutorial, Karl writes "
Github is like facebook for programmers", which couldn't be more true. So, if you've set yourself up with a github account, then simply check out the lastest samtools source code
git clone git://github.com/samtools/samtools.git
cd samtools
make
make razip
This last line is to compile razip which is a tool to compress files and still make them randomly accessible by samtools (not normally possible with gzip files). Finally, the last step is to make samtools accessible to all the users on the machine which I extracted from this
tutorial.
sudo cp ~/current/working/directory/samtools /usr/local/samtools-v0.1.19
sudo ln -s /usr/local/samtools-v0.1.19 /usr/local/samtools
The second line makes a symbolic link to the latest version directory. When a new version of samtools becomes available (e.g. samtools-v0.1.20), you can install it in the same directory and then just update the /usr/local/samtools symbolic link. You can check to make sure samtools was installed in your $PATH with the following command:
which samtools
If nothing happens, samtools was not properly put in your $PATH.
Examples of how to use SAMtools
There are many
tutorials on samtools. The standard format of samtools requires a command and then allows for additional options.
samtools [command] <options>
If you type just '
samtools' in the terminal window, a list of available command (and their descriptions) will be provided. We begin by creating a folder which will contain our SAM file and reference genome in fasta format which we used to align the reads.
mkdir MyProject
cp ~/current/working/directory/test.sam MyProject
cp ~/current/working/directory/ref.fa MyProject
samtools view
To convert from SAM to BAM format (option -b) when header is available and input is SAM format (option -S)
samtools view -Sb test.sam > test.bam
To convert from SAM to BAM format when header is not available, but ref.fa was used to map the reads
samtools view -b ref.fa test.sam > test.bam
To filter out unmapped reads from BAM files (option -F)
samtools view -bF test.bam > mappedtest.bam
samtools sort
To sort BAM file by chromosomal coordinates (default) or read names (option -n) and create a test.sorted.bam file
samtools sort test.bam test.sorted
An example of using a Unix pipe to go from the SAM format to the sorted BAM format
samtools view -Sb test.sam | samtools sort test.sorted
samtools index
To create an indexed bam file (.bai) from the test.sorted.bam file
samtools index test.sorted.bam
samtools faidx
To create an indexed reference genome in FASTA format (creates a .fai file)
samtools faidx ref.fa
Examples of using SAM/BAM files
'samtools mpileup' and 'bcftools view'
To create a Binary Call Format (bcf) file (option -u) by mapping the bases using the indexed reference genome (option -f) and call genomic variants
samtools mpileup -u -f ref.fa test.sorted.bam > test.bcf
To convert the bcf file to a human readable Variant Call Format (vcf) file
bcftools view -v -g test.bcf > test.vcf
From here, you can also check out the
Integrative Genomics Viewer (IGV) to visually look at the aligned reads or try importing the BAM files into tools like Rsamtools.
Examples of how to use Rsamtools
The purpose of
Rsamtools is to provide an interface between R and BAM files produced by the tools
samtools,
bcftools, and
tabix (not discussed here). To install Rsamtools in R, use
source("http://bioconductor.org/biocLite.R")
biocLite("Rsamtools")
The main purpose is to import (indexed!) BAM files into R using the
scanBam() function. The vignette for Rsamtools can be found
here. The entire BAM file should not be read into R, but rather just the portion of the genome of interest using the
what and
which arguments in
ScanBamParam(which=which, what=what). There are other packages which can read in BAM files include ShortRead and GenomicRanges.
To extract the header information, use
scanBamHeader(). Use
filterBam() to filter reads from BAM file according to the criteria defined in
ScanBamParam().
Another great use of Rsamtools is to access multiple BAM files using the
BamViews class in Rsamtools. This allows you to obtain metadata by 'viewing' the BAM files rather than importing each BAM individually.