For those less familiar with the topic, you might ask why is this such an important topic? One of the first steps in analyzing cancer sequencing data, is to accurately detect the genetic variants with some level of sensitivity and specificity. The second step would be to understand the genetic (or epigenetic) effect followed by associating the mutational profiles with outcome. Two genetic variants of interest in matched normal and tumor samples are called somatic and germline single nucleotide polymorphisms (SNPs) where
- 'Somatic' refers to a variant observed in tumor sample and not observed in the normal sample and the normal allele matches the reference allele
- 'Germline' refers to the same variant observed in both the normal and tumor samples but the normal allele does not match the reference allele
The main problem with this approach is when detecting somatic variants from the next-generation sequencing data the "sensitive and accurate detection of somatic singlenucleotide variants (sSNVs) from next-generation sequencing data is still a challenging informatic problem because of differing levels of purity and numbers of subclonal populations represented in tumor samples" [Hansen et al. (2013)]. This implies the 'subtraction method' may not be the best approach to detect true somatic and true germline variants because the somatic variants called from a given method may vary depending on tumor purity or clonality.
- Carter et al. (2012), Koboldt et al. (2012) and Hansen et al. (2013) agree "the necessary read depth coverage from the NGS data to detect a mutation accurately depends on
- the prevalence of the mutation in that sample which depends on
- levels of copy number variation,
- levels of stromal contamination and
- prevalence of mutation among subcones within the sample".
- The authors note: "Current software does not account for the how deep coverage should be to attain a specific sensitivity and accuracy".
- Cibulskis et al. (2013) writes "the sensitivity and specificity of any somatic mutation–calling method varies along the genome and depends on several factors, including
- the depth of sequence coverage in the tumor and a patient-matched normal sample,
- the local sequencing error rate,
- the allelic fraction of the mutation and
- the evidence thresholds used to declare a mutation".
- The authors note: "Characterizing how sensitivity and specificity depend on these factors is necessary for designing experiments with adequate power to detect mutations at a given allelic fraction, as well as for inferring the mutation frequency along the genome, which is a key parameter for understanding mutational processes and significance analysis".
Shimmer [Hansen et al. (2013)]
- Method: Uses Fisher's exact test at each genomic position covered by both the tumor and normal sample and corrects for multiple testing.
- Documentation: https://github.com/nhansen/Shimmer/blob/master/DOCUMENTATION
- Input: samtools BAM files of the matched normal and tumor samples and reference genome
- Output: Shimmer will produce a list of somatic variants and somatic indels. Additionally, Shimmer will provide the output in the ANNOVAR format (and will annotate the file if ANNOVAR is locally installed on the system).
- Reference paper shows Shimmer is superior to other methods in large amounts of stromal contamination, heterogenity and copy number alterations, but comparable accuracy in high tumor purity.
Shimmer is available in a perl script on github:
$ git clone git://github.com/nhansen/Shimmer.git
$ cd Shimmer
$ perl Build.PL
$ ./Build
$ ./Build test
$ ./Build install
Basic usage:
$ shimmer.pl [normal_bamfile.bam] [tumor_bamfile.bam] --ref reference.fa
Cake [Rashid et al. (2013)]
- Method: Uses set of mutations called from at least two callers (VarScan2, CaVEMan, Bambino and SAMtools mpileup). Applies post-processing filters.
- Documentation: http://cakesomatic.sourceforge.net/Cake_User_Guide.pdf
- Input: samtools BAM files of the matched normal and tumor samples and reference genome
- Output: Cake generates several output files/directory depending on what the user defined, but it will report the somatic variants (from each individual caller) in a VCF format.
- The Cake_User_Guide.pdf provides extensive documentation on how to download and install Cake, VarScan2, CaVEMan and Bambino. It also provides documentation on how to install all the prerequistes such as samtools, vcftools, tabix, and additional perl modules.
Cake is available in a perl script here: http://sourceforge.net/projects/cakesomatic/files/
$ mkdir cake
$ cd cake
$ wget http://sourceforget.net/projects/cakesomatic/files/Cake_1.0.tar.gz
$ tar xfz Cake_1.0.tar.gz
Basic Usage:
$ perl run_cake_somatic_pipeline.pl -s [Sample_File] -species human -callers pileup,varscan,bambino, somatic sniper -o [output_directory] -separator [,|\t] -mode [FULL|CALLING|FILTERING]
MuTect [Cibulskis et al. (2013)]
- Method: Pre-processes the aligned reads from the matched normal and tumor samples separately. Calculates two log odds ratios (LODs) - see below. Applies an additional set of post-processing filters.
- 1) LODs of the tumor containing a non-reference allele for a given position
- 2) If the position is called non-reference, the LODs of the normal sample containing the reference allele
- Documentation: http://www.broadinstitute.org/cancer/cga/mutect_run
- Input: samtools BAM files of the matched normal and tumor samples and wiggle file containing the coverage output
- Output: A list of somatic calls (use grep to filter the calls not considered confident by MuTect)
- Reference paper compares MuTect to SomaticSniper, JointSNVMix and Strelka
- Can use Cosmic & dbSNP databases for annotation
Installation and Usage:
To run, must have java installed. MuTect is available from here: http://www.broadinstitute.org/cancer/cga/mutect_download and should be run in the command line.
Basic usage:
$ java -Xmx2g -jar muTect-XXXX-XX-XX.jar --analysis_type MuTect --reference_sequence <reference> --input_file: normal [normal_bamfile.bam] --input_file:tumor [tumor_bamfile.bam] --out [output_file.out] --coverage_file [coverages.wig.txt]
To only see the 'confident' calls from MuTect, use grep to filter for lines not containing REJECT
$ grep -v REJECT [my.output_file.out]
Three additional tools (from 2011 and 2012):
- deepSNV [Gerstung et al. (2012)]
- Based on a likelihood ratio test at each genomic position and corrects for multiple testing.
- Uses samtools with heavy physical memory requirements
- CaVEMan [Stephens et al. 2012]
- Rashid et al. (2013) says this method is similar to SAMtools mpileup which "computes the genotype likelihood of nucleotide positions in tumor and normal genome sequences by use of an expectation-maximization method"
- Bambino [Edmonson et al. 2011]
- Bayesian scoring model (as opposed to Fisher's exact testing) which compares alternate allele frequency between tumor and normal samples
- Similar to VarScan2 (except Bayesian)