Statistical [R]ecipes: August 2013

A nice comparison of four methods (VarScan, SomaticSniper, JointSNVMix and Strelka) to detect somatic single nucleotide variants (SNVs) was published this week in Bioinformatics. This is clearly a very hot topic in the analysis of cancer genomes because as I was reading through the article, I was thinking to myself, I can think of at least three other methods which have been published in 2013 alone. This particular article was received by Bioinformatics in March 2013, accepted in June and published first online in July, therefore the more recent methods could not have made it in into this comparison paper, but I would have loved to see the authors take on the three methods discussed in this post.

For those less familiar with the topic, you might ask why is this such an important topic? One of the first steps in analyzing cancer sequencing data, is to accurately detect the genetic variants with some level of sensitivity and specificity. The second step would be to understand the genetic (or epigenetic) effect followed by associating the mutational profiles with outcome. Two genetic variants of interest in matched normal and tumor samples are called somatic and germline single nucleotide polymorphisms (SNPs) where

'Somatic' refers to a variant observed in tumor sample and not observed in the normal sample and the normal allele matches the reference allele
'Germline' refers to the same variant observed in both the normal and tumor samples but the normal allele does not match the reference allele

One straightforward approach is the "subtraction method" in which variants are called separately from normal and tumor samples using a tool such as GATK. Next, you just subtract the high quality variant calls from the tumor and the high quality 'no-calls' in normal leaving the somatic calls.

The main problem with this approach is when detecting somatic variants from the next-generation sequencing data the "sensitive and accurate detection of somatic singlenucleotide variants (sSNVs) from next-generation sequencing data is still a challenging informatic problem because of differing levels of purity and numbers of subclonal populations represented in tumor samples" [Hansen et al. (2013)]. This implies the 'subtraction method' may not be the best approach to detect true somatic and true germline variants because the somatic variants called from a given method may vary depending on tumor purity or clonality.

Carter et al. (2012), Koboldt et al. (2012) and Hansen et al. (2013) agree "the necessary read depth coverage from the NGS data to detect a mutation accurately depends on

the prevalence of the mutation in that sample which depends on

levels of copy number variation,
levels of stromal contamination and
prevalence of mutation among subcones within the sample".

The authors note: "Current software does not account for the how deep coverage should be to attain a specific sensitivity and accuracy".

Cibulskis et al. (2013) writes "the sensitivity and specificity of any somatic mutation–calling method varies along the genome and depends on several factors, including

the depth of sequence coverage in the tumor and a patient-matched normal sample,
the local sequencing error rate,
the allelic fraction of the mutation and
the evidence thresholds used to declare a mutation".
The authors note: "Characterizing how sensitivity and specificity depend on these factors is necessary for designing experiments with adequate power to detect mutations at a given allelic fraction, as well as for inferring the mutation frequency along the genome, which is a key parameter for understanding mutational processes and significance analysis".

Recently, several groups have developed methods specifically to deal with this problem (detecting somatic variants using next-generation sequencing data). Rather than calling variants separately from the normal and tumor samples, one idea is to simultaneously analyze the reads from the matched normal and tumor samples together resulting in increased confidence in the germline and somatic variants. Some studies have shown this can lead to better discrimination of germline variants from sequencing errors. I refer you to the comparison paper published in Bioinformatics this week for a detailed discussion on VarScan, SomaticSniper, JointSNVMix and Strelka. In this post, I will discuss three more recently published tools to detect somatic mutations Shimmer, Cake and MuTect.

Shimmer [Hansen et al. (2013)]

Method: Uses Fisher's exact test at each genomic position covered by both the tumor and normal sample and corrects for multiple testing.
Documentation: https://github.com/nhansen/Shimmer/blob/master/DOCUMENTATION
Input: samtools BAM files of the matched normal and tumor samples and reference genome
Output: Shimmer will produce a list of somatic variants and somatic indels. Additionally, Shimmer will provide the output in the ANNOVAR format (and will annotate the file if ANNOVAR is locally installed on the system).
Reference paper shows Shimmer is superior to other methods in large amounts of stromal contamination, heterogenity and copy number alterations, but comparable accuracy in high tumor purity.

Installation and Usage:
Shimmer is available in a perl script on github:
$ git clone git://github.com/nhansen/Shimmer.git
$ cd Shimmer
$ perl Build.PL
$ ./Build
$ ./Build test
$ ./Build install

Basic usage:
$ shimmer.pl [normal_bamfile.bam] [tumor_bamfile.bam] --ref reference.fa

Cake [Rashid et al. (2013)]

Method: Uses set of mutations called from at least two callers (VarScan2, CaVEMan, Bambino and SAMtools mpileup). Applies post-processing filters.
Documentation: http://cakesomatic.sourceforge.net/Cake_User_Guide.pdf
Input: samtools BAM files of the matched normal and tumor samples and reference genome
Output: Cake generates several output files/directory depending on what the user defined, but it will report the somatic variants (from each individual caller) in a VCF format.
The Cake_User_Guide.pdf provides extensive documentation on how to download and install Cake, VarScan2, CaVEMan and Bambino. It also provides documentation on how to install all the prerequistes such as samtools, vcftools, tabix, and additional perl modules.

Installation and Usage:
Cake is available in a perl script here: http://sourceforge.net/projects/cakesomatic/files/
$ mkdir cake
$ cd cake
$ wget http://sourceforget.net/projects/cakesomatic/files/Cake_1.0.tar.gz
$ tar xfz Cake_1.0.tar.gz

Basic Usage:
$ perl run_cake_somatic_pipeline.pl -s [Sample_File] -species human -callers pileup,varscan,bambino, somatic sniper -o [output_directory] -separator [,|\t] -mode [FULL|CALLING|FILTERING]

MuTect [Cibulskis et al. (2013)]

Method: Pre-processes the aligned reads from the matched normal and tumor samples separately. Calculates two log odds ratios (LODs) - see below. Applies an additional set of post-processing filters.

1) LODs of the tumor containing a non-reference allele for a given position
2) If the position is called non-reference, the LODs of the normal sample containing the reference allele

Documentation: http://www.broadinstitute.org/cancer/cga/mutect_run
Input: samtools BAM files of the matched normal and tumor samples and wiggle file containing the coverage output
Output: A list of somatic calls (use grep to filter the calls not considered confident by MuTect)
Reference paper compares MuTect to SomaticSniper, JointSNVMix and Strelka
Can use Cosmic & dbSNP databases for annotation

Installation and Usage:
To run, must have java installed. MuTect is available from here: http://www.broadinstitute.org/cancer/cga/mutect_download and should be run in the command line.

Basic usage:
$ java -Xmx2g -jar muTect-XXXX-XX-XX.jar --analysis_type MuTect --reference_sequence <reference> --input_file: normal [normal_bamfile.bam] --input_file:tumor [tumor_bamfile.bam] --out [output_file.out] --coverage_file [coverages.wig.txt]

To only see the 'confident' calls from MuTect, use grep to filter for lines not containing REJECT
$ grep -v REJECT [my.output_file.out]

Three additional tools (from 2011 and 2012):

deepSNV [Gerstung et al. (2012)]

Based on a likelihood ratio test at each genomic position and corrects for multiple testing.
Uses samtools with heavy physical memory requirements

CaVEMan [Stephens et al. 2012]

Rashid et al. (2013) says this method is similar to SAMtools mpileup which "computes the genotype likelihood of nucleotide positions in tumor and normal genome sequences by use of an expectation-maximization method"

Bambino [Edmonson et al. 2011]

Bayesian scoring model (as opposed to Fisher's exact testing) which compares alternate allele frequency between tumor and normal samples
Similar to VarScan2 (except Bayesian)

Statistical [R]ecipes

Friday, August 30, 2013

Comparing Methods to Detect Somatic Mutations