Statistical [R]ecipes

Thursday, February 6, 2014

Creating A New Ground for 'Data Science' outside of just Statistics or Computer Science

In a recent AMSTAT News article, Terry Speed wrote a fantastic and inspirational article on the field of statistics. He began by summarizing some themes from the IYS 2013 "The Future of Statistical Science" workshop he attended. Some themes he noted were "what an excellent job statisticians are doing" particularly in the areas of "genomics, cancer biology, the study of diet, the environment and climate, in risk and regulation, neuroimaging, confidentiality and privacy and autism research" but saw a lack of representation from "social, agricultural, government, business and industrial statistics".

He openly acknowledged our field of statistics faces many challenges (I direct you to the link for the complete list). In particular: (1) statistics departments are grappling with the debate of adsorbing statistics into incredibly popular emerging field of "data science" and (2) statistics departments are not able to deliver the type of graduates that companies such as Google, Apple, Amazon want which "perhaps involves adopting a more engineering approach to our work". Yet, students across the world are seeking out majors and/or courses in statistics and "data science" through many different venues e.g. taking courses at a universities, via MOOCs, or online tutorials, etc. Should Statistics be renamed "Data Science"? What is the overlap? What is the best way to train "Data Scientists"?

Terry's response was essentially that data science is not the same as statistics and he saw "no evidence that data science ... has any prospect of replacing our discipline" because statistics "is far wider and deeper than data science". He encourages statisticians to embrace this emerging field of data science and not fear it: "As with mathematics more generally, we are in this business for the long term. Let's not lose our nerve."

These ideas were all echoed at a symposium I attended last Friday called 'Paths to Precision Medicine: The Role of Statistics' hosted by the Department of Biostatistics at Harvard School of Public Health. At the end of the symposium, there was a panel discussion on 'Education of Future Statisticians in the Big Data Era' with Giovanni Parmigiani, Corsee Sanders, Rafael Irizarry and Marc Pfeffer as the panelists.

Giovanni said when transformations in technology occur, this is a characteristic of the "Big Data Era" and no single field is able to take on "Big Data" as a whole by saying this is a subset of what they do (similar to the argument made by Terry). Not computer science. Not electrical engineering. Not statistics. When it comes to creating a curriculum to train individuals who are seeking to become "data scientists" in the Big Data Era, he says we should "teach less and do more". Rather than starting with a predefined idea of what we should be teaching and/or adding more things to a curriculum for "data science", we should let these individuals dive right into projects as early as possible (with mentorship!). This will also help develop essential skills such as communication and teamwork which cannot be taught in the classroom.

Furthermore, Rafa proposed creating a new, uncharted territory between statistics and computer science to provide the training to individuals who are seeking to become data scientists with as much emphasis as the individuals want in the direction of statistics, computer science or other fields. This will more than likely require faculty to step out of their comfort zone and even possibly connect with faculty outside their department to jointly teach a course. When it comes to data science, there are many people who believe data science is more about the computing than the statistics and others who would emphasize the statistics more than the computing. Regardless of the degree of emphasis of various fields on data science, the one thing I do know is that I agree with Jeff Leek's view that "the key word in 'Data Science' is not Data, it is Science". We should be emphasizing the science (whether it is statistics, computer science, hacking skills, etc) in our curriculums for data science and allow the students to have an opinion in how diverse of a training they want. As Rafa said, "the faculty in statistics and biostatistics departments are becoming increasingly more diverse and we should make these graduate students more diverse too."

Thursday, January 9, 2014

Easy introduction to meta-analyses in R

My incredibly intelligent younger sister is in the middle of her third year of a PhD program in clinical psychology. While I have always considered myself to be logical, analytical and drawn to the more science and math-oriented topics, she has always been more artistic, intuitive and definitely the writer in the family. Part of her curriculum has included several statistics courses (which she aced!) so you can imagine the statistician in me is beaming with pride!

Recently, she has started searching for potential thesis topics and mentioned meta-analyses were particularly interesting to her. To help her and hopefully other non-statisticans who are interested in performing a meta-analysis, I have put together a short tutorial to run a simple meta-analysis in R.

Performing a search for the words 'meta-analysis' or 'meta analysis' on CRAN and Bioconductor currently yields 34 and 6 available R packages, respectively. Some are meant for a specific type of data (e.g. genomic data such as microarrays), but in general the majority of these R packages are meant for combining summary statistics of discrete or continuous data extracted from a set of carefully selected published studies.

Before starting a meta-analysis, there are many important questions to be answered such as

How to pick which studies to include in the meta-analysis? What are possible biases in selecting the studies?

What effect are you interested in measuring? What data needs to be extracted from the papers?

What type of meta-analysis should be performed?

What software tools/packages are available to perform a meta-analysis?

How do I interpret the results from the output of the software tool used? How do I know if the meta-analysis yielded anything statistically valid and significant?

I want to preface this tutorial with the statement: the first two questions are extremely important and should be answered before starting any meta-analysis. Because an entire course could focus on meta-analyses, I've limited the focus of this tutorial to discussing the last three questions: (1) basic types of meta-analyses, (2) statistical tools/packages available to perform the meta-analysis and (3) interpreting the results.

Generally speaking there are four types of meta-analyses:

univariate meta-analysis

n studies comparing two treatments (e.g. case/control) with a dichotomous outcome

multivariate meta-analysis

n studies comparing two treatments with multiple outcomes

meta-regression

n studies comparing two treatments with a dichotomous outcome but can investigate the impact of additional "moderator" or explanatory variables (e.g. year of study) on the outcome

network meta-analysis (also known as multiple treatment meta-analysis)

n studies comparing multiple treatments with a dichotomous outcome

All of these types of meta-analyses can easily be run in R with freely available packages such as meta, mvmeta, mvtmeta, metafor, rmeta and getmtc.

For example, here is a brief summary of the meta R package :

Description: Simple package to estimate fixed-effects and random-effects models for binary and continuous data in a univariate meta-analysis. Meta-regression is also available.
Documentation: http://cran.r-project.org/web/packages/meta/index.html
Useful Notes: Use metabin() for binary data and metacont() for continuous data. Using continuous data, can estimate mean difference and using binary data, can estimate risk ratio, odds ratio, risk difference and arcsine difference using "sm = " argument. Try print(), summary(), forest(), funnel() and labbe(), metabias() for analyzing the results from the meta-analysis. Use metareg() for meta-regression.

Simulated data example:

Consider a univariate meta-analysis with n = 10 studies comparing two treatments (drug A and drug B) and a dichotomous outcome (e.g. death, no death). Estimate an overall odds ratio of the death in the drug A group relative to the drug B group.

In the first study, 109 individuals were in the control group who received drug A and 107 individuals in the case group who receive drug B. Out of the 109 individuals who received drug A, 14 individuals died compared to 52 individuals who received drug B. The odds ratio for study 1 is 6.42 with a 95% confidence interval of (3.26, 12.63) which is statistically significant. Below is a forest plot of all the studies used in the meta analysis.

Based on the simulated data, the odds ratio of death using drug B relative using drug A is 8.15 which is statistically signifiant because the 95% confidence interval of (6.44, 10.30) using a fixed-effects model does not contain the value 1.

For a further discussion on the differences between a fixed-effects and random-effects model, Wikipedia has a fairly easy to understand description of the differences. The main thing to understand is if your studies are considered to be "heterogeneous", then you will need to use a random effects model. Otherwise you should use a fixed effects model. The way to test which model to use is with the Cochran Q test or the $I^2$ test. In the forest plot, the $I^2$ test was performed in which the null hypothesis is there is no study heterogeneity and the fixed effects model should be used. Because the p-value (p = 0.6586) was greater than an a $\alpha$ confidence level of 0.05, we fail to reject the null hypothesis and use the fixed effects model for the meta-analysis.

For an overview of the R packages CRAN has to offer, a Task View dedicated specifically to meta-analyses is available. Another good resource is from the 2013 UseR conference.

Note: there are also many other software tools available outside of R which may be of interest: MetaEasy in Excel and similar functions in Stata, SPSS and SAS.

Wednesday, January 1, 2014

Mulled Wine

Since winter is in full swing here in Boston (reports are saying we should expect 12 inches of snow tomorrow), I thought a warm festive drink might be fun to try! A mulled wine filled with spices, orange flavors and honey sounded just right.

Ina Garten has a pretty good mulled wine recipe, but I also found a similar second recipe which looked to have some potential. After "mulling" over some other recipes, I settled on a combination of both. This mulled wine recipe is mostly based on the following three ingredients: wine, brandy and apple cider for sweetness:

Hopefully you will find this wine as tasty as I did. Also, I hope your holidays were filled with family and happiness and I wish everyone a Happy New Years!

Ingredients:
- 1 bottle (750 ml) of red wine (e.g. carmenere or cabernet sauvignon)
- 1 cup apple cider
- 1/4 cup brandy
- 1/4 cup honey
- 2 cinnamon sticks
- 4 whole cloves
- 4 all spice
- 1 orange zested and sliced

Recipe:
1) Combine wine, apple cider and brandy in a stock pot.

2) Zest the orange and add to the pot.

3) Add cinnamon sticks, whole cloves, and all spice pieces to the wine mixture. Add slices of orange.

4) Heat mixture slowly to just before boiling. Let simmer for 10-15 minutes. Afterwards, use a colander to strain away the oranges and spices.

Serve the mulled wine warm with some fresh orange slices!

Monday, October 14, 2013

Simple Cutout Sugar Cookies: Ghosts and Pumpkin Pi

Fall is in full swing and (it's hard to believe) even Halloween is just around the corner. I'm starting to get in the spirit by getting out the Halloween decorations and making yummy treats! Cutout sugar cookies are some of favorite to make in the holiday season because they are so versatile. Rolling out the dough can be tricky if the dough is not right, so this has been my favorite go-to recipe for cutout sugar cookies. As for the icing, the usual ingredients are confectioners' sugar, food coloring and water. I found a recipe for the icing which hardens nicely and contains vanilla extra and light corn syrup (gives it an extra sheen). Here are my ghosts and pumpkin 'pi'! and yes, that is a mouse :)

Ingredients:

For the cookies:
- 6 tbsp of soft unsalted butter
- 1/2 cup sugar
- 1 large egg
- 1/2 tsp vanilla extract
- 1 1/2 cups all-purpose flour
- 1/2 tsp baking powder
- 1/2 tsp salt
- chocolate chips, mini chocolate chips (optional)

For the icing:
- 1 cup confectioners' sugar
- 1 tbsp light corn syrup
- 1/2 tsp vanilla extract
- food coloring

Recipe:

1) Preheat the oven to 350 degrees. In one bowl, mix butter, sugar, egg and vanilla. In a second bowl, combine the flour, baking powder, salt. Add the dry ingredients to the wet and mix gently. You can add more flour if the dough seems too sticky, but be careful to not add too much. Wrap the dough in plastic wrap and let rest for one hour.

2) Sprinkle flour on a clean surface. With a rolling pin, roll out the dough to a thickness of 1/4-inch. If the dough is sticking to the rolling pin, sprinkle more flour on the dough.

2) With your favorite cookie cutters, cut out the cookies and place on a baking sheet.

3) The recipe says bake the cookies for 8-12 mins. My cookies were definitely done at 9 mins. Let the cookies cool completely before adding the icing.

4) To prepare the icing mix the sugar, corn syrup, food coloring and vanilla. If the icing still seems thick, add 1-2 drops of water a time.

5) Decorate your cookies! My ghosts were made with plain white icing, mini chocolate chips for the eyes and a regular sized chocolate chip for the mouth. The pumpkins begin with orange icing and are finished with a bit of pipe work.

Sunday, September 8, 2013

Boston or Bust

This past week Chris and I celebrated our first month of being official Bostonians. For those who don't know, while finishing my PhD at Rice this spring, I accepted a postdoctoral research fellow position starting in August in Rafael Irizarry's lab in the Biostatistics & Computational Biology department at the Dana-Farber Cancer Institute with a joint appointment in the Harvard School of Public Health. Since graduating in May, this summer has been such a whirlwind of events. We did quite a bit of traveling in July and spent about a week driving from Texas to Massachusetts. Some additional good news is Chris just finished his second week at his new job as a Validation Engineer at Bose! Now that things are starting to settle down a bit, I decided to show some highlights of our trip and a few fun outings we have had exploring around Boston.

To begin, I will be upfront and say no matter how you move, moving is a pain in the butt. The only part of moving I liked was seeing the 'Podzilla' move the POD around! Seriously, it's one of the coolest things. :)

We began our very long drive by stopping to visit some family members in Texas & Louisiana.

Next up was Mississippi. We only spent a short time there, but we found Elvis who sang our blues away about leaving such good friends and family behind.

In Tennessee we stopped at Corky's BBQ for some 'Memphis-style' BBQ which consisted of ribs with a very dry rub & an enormous amount of sauce on top. I think I prefer North Carolina or Texas style BBQ, but overall it was a good experience!

Next up was West Virgina, Maryland & Pennsylvania. We had a wonderful time staying with some of my family in PA! At the end of the night, we even enjoyed some s'mores under the beautiful Pennsylvania night sky. I can't remember the last time I made s'mores over a real fire pit, but it is something I wish I could do more often.

The next day, we drove through Dallas, PA where my dad grew up! This is a picture of the house when my dad lived there:

And this was the picture the day we drove by. So pretty!

The next day was New York, Connecticut & finally Massachusetts! After our 1900+ mi drive, we rewarded ourselves with clam chowder the night we arrived. We're determined to find the best new england clam chowder in Boston, so suggestions are welcomed.

Since arriving in Boston, we've spent most of our time settling into the new apartment, but we have made a few excursions to explore the city. We've tried a pretty famous local BBQ place called Blue Ribbon BBQ and found some very excited LSU alumni fans! One of my favorite outings has been to visit Cape Cod. On our drive there, we saw so many windmills! The beaches were also phenomenal.

Hopefully I'll be able to keep documenting a few of our adventures. :)

Friday, August 30, 2013

Comparing Methods to Detect Somatic Mutations

A nice comparison of four methods (VarScan, SomaticSniper, JointSNVMix and Strelka) to detect somatic single nucleotide variants (SNVs) was published this week in Bioinformatics. This is clearly a very hot topic in the analysis of cancer genomes because as I was reading through the article, I was thinking to myself, I can think of at least three other methods which have been published in 2013 alone. This particular article was received by Bioinformatics in March 2013, accepted in June and published first online in July, therefore the more recent methods could not have made it in into this comparison paper, but I would have loved to see the authors take on the three methods discussed in this post.

For those less familiar with the topic, you might ask why is this such an important topic? One of the first steps in analyzing cancer sequencing data, is to accurately detect the genetic variants with some level of sensitivity and specificity. The second step would be to understand the genetic (or epigenetic) effect followed by associating the mutational profiles with outcome. Two genetic variants of interest in matched normal and tumor samples are called somatic and germline single nucleotide polymorphisms (SNPs) where

'Somatic' refers to a variant observed in tumor sample and not observed in the normal sample and the normal allele matches the reference allele
'Germline' refers to the same variant observed in both the normal and tumor samples but the normal allele does not match the reference allele

One straightforward approach is the "subtraction method" in which variants are called separately from normal and tumor samples using a tool such as GATK. Next, you just subtract the high quality variant calls from the tumor and the high quality 'no-calls' in normal leaving the somatic calls.

The main problem with this approach is when detecting somatic variants from the next-generation sequencing data the "sensitive and accurate detection of somatic singlenucleotide variants (sSNVs) from next-generation sequencing data is still a challenging informatic problem because of differing levels of purity and numbers of subclonal populations represented in tumor samples" [Hansen et al. (2013)]. This implies the 'subtraction method' may not be the best approach to detect true somatic and true germline variants because the somatic variants called from a given method may vary depending on tumor purity or clonality.

Carter et al. (2012), Koboldt et al. (2012) and Hansen et al. (2013) agree "the necessary read depth coverage from the NGS data to detect a mutation accurately depends on

the prevalence of the mutation in that sample which depends on

levels of copy number variation,
levels of stromal contamination and
prevalence of mutation among subcones within the sample".

The authors note: "Current software does not account for the how deep coverage should be to attain a specific sensitivity and accuracy".

Cibulskis et al. (2013) writes "the sensitivity and specificity of any somatic mutation–calling method varies along the genome and depends on several factors, including

the depth of sequence coverage in the tumor and a patient-matched normal sample,
the local sequencing error rate,
the allelic fraction of the mutation and
the evidence thresholds used to declare a mutation".
The authors note: "Characterizing how sensitivity and specificity depend on these factors is necessary for designing experiments with adequate power to detect mutations at a given allelic fraction, as well as for inferring the mutation frequency along the genome, which is a key parameter for understanding mutational processes and significance analysis".

Recently, several groups have developed methods specifically to deal with this problem (detecting somatic variants using next-generation sequencing data). Rather than calling variants separately from the normal and tumor samples, one idea is to simultaneously analyze the reads from the matched normal and tumor samples together resulting in increased confidence in the germline and somatic variants. Some studies have shown this can lead to better discrimination of germline variants from sequencing errors. I refer you to the comparison paper published in Bioinformatics this week for a detailed discussion on VarScan, SomaticSniper, JointSNVMix and Strelka. In this post, I will discuss three more recently published tools to detect somatic mutations Shimmer, Cake and MuTect.

Shimmer [Hansen et al. (2013)]

Method: Uses Fisher's exact test at each genomic position covered by both the tumor and normal sample and corrects for multiple testing.
Documentation: https://github.com/nhansen/Shimmer/blob/master/DOCUMENTATION
Input: samtools BAM files of the matched normal and tumor samples and reference genome
Output: Shimmer will produce a list of somatic variants and somatic indels. Additionally, Shimmer will provide the output in the ANNOVAR format (and will annotate the file if ANNOVAR is locally installed on the system).
Reference paper shows Shimmer is superior to other methods in large amounts of stromal contamination, heterogenity and copy number alterations, but comparable accuracy in high tumor purity.

Installation and Usage:
Shimmer is available in a perl script on github:
$ git clone git://github.com/nhansen/Shimmer.git
$ cd Shimmer
$ perl Build.PL
$ ./Build
$ ./Build test
$ ./Build install

Basic usage:
$ shimmer.pl [normal_bamfile.bam] [tumor_bamfile.bam] --ref reference.fa

Cake [Rashid et al. (2013)]

Method: Uses set of mutations called from at least two callers (VarScan2, CaVEMan, Bambino and SAMtools mpileup). Applies post-processing filters.
Documentation: http://cakesomatic.sourceforge.net/Cake_User_Guide.pdf
Input: samtools BAM files of the matched normal and tumor samples and reference genome
Output: Cake generates several output files/directory depending on what the user defined, but it will report the somatic variants (from each individual caller) in a VCF format.
The Cake_User_Guide.pdf provides extensive documentation on how to download and install Cake, VarScan2, CaVEMan and Bambino. It also provides documentation on how to install all the prerequistes such as samtools, vcftools, tabix, and additional perl modules.

Installation and Usage:
Cake is available in a perl script here: http://sourceforge.net/projects/cakesomatic/files/
$ mkdir cake
$ cd cake
$ wget http://sourceforget.net/projects/cakesomatic/files/Cake_1.0.tar.gz
$ tar xfz Cake_1.0.tar.gz

Basic Usage:
$ perl run_cake_somatic_pipeline.pl -s [Sample_File] -species human -callers pileup,varscan,bambino, somatic sniper -o [output_directory] -separator [,|\t] -mode [FULL|CALLING|FILTERING]

MuTect [Cibulskis et al. (2013)]

Method: Pre-processes the aligned reads from the matched normal and tumor samples separately. Calculates two log odds ratios (LODs) - see below. Applies an additional set of post-processing filters.

1) LODs of the tumor containing a non-reference allele for a given position
2) If the position is called non-reference, the LODs of the normal sample containing the reference allele

Documentation: http://www.broadinstitute.org/cancer/cga/mutect_run
Input: samtools BAM files of the matched normal and tumor samples and wiggle file containing the coverage output
Output: A list of somatic calls (use grep to filter the calls not considered confident by MuTect)
Reference paper compares MuTect to SomaticSniper, JointSNVMix and Strelka
Can use Cosmic & dbSNP databases for annotation

Installation and Usage:
To run, must have java installed. MuTect is available from here: http://www.broadinstitute.org/cancer/cga/mutect_download and should be run in the command line.

Basic usage:
$ java -Xmx2g -jar muTect-XXXX-XX-XX.jar --analysis_type MuTect --reference_sequence <reference> --input_file: normal [normal_bamfile.bam] --input_file:tumor [tumor_bamfile.bam] --out [output_file.out] --coverage_file [coverages.wig.txt]

To only see the 'confident' calls from MuTect, use grep to filter for lines not containing REJECT
$ grep -v REJECT [my.output_file.out]

Three additional tools (from 2011 and 2012):

deepSNV [Gerstung et al. (2012)]

Based on a likelihood ratio test at each genomic position and corrects for multiple testing.
Uses samtools with heavy physical memory requirements

CaVEMan [Stephens et al. 2012]

Rashid et al. (2013) says this method is similar to SAMtools mpileup which "computes the genotype likelihood of nucleotide positions in tumor and normal genome sequences by use of an expectation-maximization method"

Bambino [Edmonson et al. 2011]

Bayesian scoring model (as opposed to Fisher's exact testing) which compares alternate allele frequency between tumor and normal samples
Similar to VarScan2 (except Bayesian)

Wednesday, June 19, 2013

Getting Started with SAMtools and Rsamtools

Have you heard of this thing called 'samtools', but not really sure what it is or how to get started? Maybe you are a postdoc in a wet-lab that is now starting to use whole-genome and whole-exome sequencing, but there isn't an in-house bioinformatician to help sift though the computational side of the project? If you're like me (a statistician by training and working with genomic data), learning about tools to handle next-generation sequencing data is essential. This tutorial is meant give a little background about samtools and then provide some technical (and hopefully helpful) information on the purpose of samtools, its relation to Rsamtools and examples of how to utilize both.

Beginning with SAMtools

What does SAM mean? How is SAM different from BAM?
The Sequence Alignment/Map (SAM) format was developed to store aligned nucleotide sequence reads produced from genome aligners (e.g. BWA) in a somewhat human readable tab-deliminted text file. Practically speaking, this format has become the standard alignment format widely used by groups such as 1000 Genomes Project. The Binary Alignment/Map (BAM) format is a binary version of SAM which contains the same information and simply compresses the data to save storage space (and BAM files are faster to manipulate). In the SAM format, there is a header section (all header rows begin with '@') and an alignment section (all alignment rows contain 11 mandatory fields + optional fields).

What is SAMtools and where did SAMtools come from?
The SAM/BAM format and SAMtools were published by Li et al. 2009. The purpose of samtools is provide a way to manipulate and interact with the aligned sequence reads in SAM/BAM format. Samtools does not work with the entire file of aligned reads, but rather works in a stream which allows Unix commands to be used in conjunction with the SAM files.

How do you install samtools?
The easiest way to do this to check out the most recent samtools source code from the github project page. If you're not familiar with git or github, I would suggest this wonderful tutorial by Karl Broman. The best description of github I've heard is this tutorial, Karl writes "Github is like facebook for programmers", which couldn't be more true. So, if you've set yourself up with a github account, then simply check out the lastest samtools source code

git clone git://github.com/samtools/samtools.git
cd samtools
make
make razip

This last line is to compile razip which is a tool to compress files and still make them randomly accessible by samtools (not normally possible with gzip files). Finally, the last step is to make samtools accessible to all the users on the machine which I extracted from this tutorial.

sudo cp ~/current/working/directory/samtools /usr/local/samtools-v0.1.19
sudo ln -s /usr/local/samtools-v0.1.19 /usr/local/samtools

The second line makes a symbolic link to the latest version directory. When a new version of samtools becomes available (e.g. samtools-v0.1.20), you can install it in the same directory and then just update the /usr/local/samtools symbolic link. You can check to make sure samtools was installed in your $PATH with the following command:

which samtools

If nothing happens, samtools was not properly put in your $PATH.

Examples of how to use SAMtools

There are many tutorials on samtools. The standard format of samtools requires a command and then allows for additional options.

samtools [command] <options>

If you type just 'samtools' in the terminal window, a list of available command (and their descriptions) will be provided. We begin by creating a folder which will contain our SAM file and reference genome in fasta format which we used to align the reads.

mkdir MyProject
cp ~/current/working/directory/test.sam MyProject
cp ~/current/working/directory/ref.fa MyProject

samtools view
To convert from SAM to BAM format (option -b) when header is available and input is SAM format (option -S)
samtools view -Sb test.sam > test.bam

To convert from SAM to BAM format when header is not available, but ref.fa was used to map the reads
samtools view -b ref.fa test.sam > test.bam

To filter out unmapped reads from BAM files (option -F)
samtools view -bF test.bam > mappedtest.bam

samtools sort

To sort BAM file by chromosomal coordinates (default) or read names (option -n) and create a test.sorted.bam file

samtools sort test.bam test.sorted

An example of using a Unix pipe to go from the SAM format to the sorted BAM format

samtools view -Sb test.sam | samtools sort test.sorted

samtools index

To create an indexed bam file (.bai) from the test.sorted.bam file

samtools index test.sorted.bam

samtools faidx

To create an indexed reference genome in FASTA format (creates a .fai file)
samtools faidx ref.fa

Examples of using SAM/BAM files

'samtools mpileup' and 'bcftools view'
To create a Binary Call Format (bcf) file (option -u) by mapping the bases using the indexed reference genome (option -f) and call genomic variants
samtools mpileup -u -f ref.fa test.sorted.bam > test.bcf

To convert the bcf file to a human readable Variant Call Format (vcf) file
bcftools view -v -g test.bcf > test.vcf

From here, you can also check out the Integrative Genomics Viewer (IGV) to visually look at the aligned reads or try importing the BAM files into tools like Rsamtools.

Examples of how to use Rsamtools

The purpose of Rsamtools is to provide an interface between R and BAM files produced by the tools samtools, bcftools, and tabix (not discussed here). To install Rsamtools in R, use

source("http://bioconductor.org/biocLite.R")
biocLite("Rsamtools")

The main purpose is to import (indexed!) BAM files into R using the scanBam() function. The vignette for Rsamtools can be found here. The entire BAM file should not be read into R, but rather just the portion of the genome of interest using the what and which arguments in ScanBamParam(which=which, what=what). There are other packages which can read in BAM files include ShortRead and GenomicRanges.

To extract the header information, use scanBamHeader(). Use filterBam() to filter reads from BAM file according to the criteria defined in ScanBamParam().

Another great use of Rsamtools is to access multiple BAM files using the BamViews class in Rsamtools. This allows you to obtain metadata by 'viewing' the BAM files rather than importing each BAM individually.