The possibility to rapidly and inexpensively sequence tumor genomes is opening important new avenues for cancer research. One of the main objectives when sequencing tumor genomes is to identify the somatic alterations that have a relevant role in developing and maintaining the cancer phenotype. However the analysis of this data is hindered by the large number of mutations detected in tumors (often in the order of thousands) and the large molecular heterogeneity observed between tumor samples.
As part of the International Cancer Genome Consortium (ICGC), during the last 2-3 years I have been co-leading (together with Lincoln Stein) a working group focused on discussing how to analyze this data. The group is formed by 48 Members from 10 different countries, and we have held one teleconference nearly every month. We have now written the results of these discussions in a perspective manuscript that has been published in the current issue of Nature Methods.
In this manuscript we divide the process of identifying functional and driver variants into three independent, but interrelated, computational approaches:
Approach 1: mutation mapping and annotation
This approach consists in mapping mutations to functional genomic features, identify their consequences and compare them to catalogs of previously reported mutations.
In the manuscript we recommend the use of Sequence Ontology terms to annotate the consequences of mutations and we propose a number of tools available to do this step.
List of tools to annotate variants to genomic features that may be affected using Sequence Ontology Terms:
Approach 2: assessing the functional impact of mutations
This step consists in applying bioinformatics approaches to assess the functional impact of mutations at the level of proteins or regulatory regions.
We revise existing methods to assess the functional impact of variants in protein coding regions. Some of them have not been specially designed to rank cancer somatic mutations for their likelihood to be drivers, but have nevertheless been widely employed to this end (such as SIFT and PolyPhen2). Only few methods have been designed and tested specifically to differentiate between cancer driver and passenger mutations (CHASM, MutationAssessor and transFIC). One of the main challenges to properly benchmark the methods is the difficulty of collecting well-curated sets of driver and passenger mutations (last year I wrote a post about this problem). Rencently, I learnt about a newly published tool, fatHMM, designed to differentiate between driver, passenger and germline variants. fatHMM is not included in the manuscript as it was not available while we were writing it, but I think it is worth mentioning it here.
List of methods Methods to assess the functional effect of nsSNVs that can be used in a high-throughput manner:
- Logre (LogR Pfam Evalue)
Approach 3: finding signs of positive selection across a cohort of tumor samples
This step involves analyzing the mutational patterns across a cohort of tumors to identify genes or pathways that exhibit signs that mutations in them have been positively selected. For example, MutSigCV and MuSiC assess if genes contain higher number of mutations than expected given the background mutation rate. The most challenging part of this type of approaches is to correctly estimate the background mutation rate, which has been addressed in a recent publication by the Getz group. OncodriveFM, instead measures the functional impact of mutations and then assesses if the observed somatic mutations in a gene are biased towards high functional impact. Other approaches include InVex, ActiveDriver or the assessment of non-synonymous versus synonymous mutations.
Few days ago we published OncodriveCLUST, which identifies drivers exploiting the positional clustering of mutations. In other words, it identifies genes in which mutations tend to cluster in certain residues or protein regions.
List of tools to identify driver genes from a cohort of cancer patients
Integrative pipelines that facilitate the use of methods across approaches
There are at least 2 pipelines that facilitate the user-friendly application of various tools presented in the manuscript.
CRAVAT maps mutations to their consequences on protein-coding genes and it assesses their implication in cancer and disease using CHASM and VEST.
IntOGen-mutations provides a pipeline to perform the whole process, including mapping mutations using the Ensembl VEP, reporting their functional impact on proteins using MutationAssessor, SIFT, PolyPhen2 and transFIC, and identifying genes with signs of positive selection across a cohort using OncodriveFM and OncodriveCLUST.
It is important to stress that none of these tools and approaches can directly identify the mutations involved in tumor development. Instead they are intended to prioritize candidates for follow-up experiments that may demonstrate their implication in the cancer phenotype.
Abel Gonzalez-Perez#, Ville Mustonen#, Boris Reva#, Graham R.S. Ritchie#, Pau Creixell, Rachel Karchin, Miguel Vazquez, J. Lynn Fink, Karin S. Kassahn, John V. Pearson, Gary Bader, Paul C. Boutros, Lakshmi Muthuswamy, B.F. Francis Ouellette, Jüri Reimand, Rune Linding, Tatsuhiro Shibata, Alfonso Valencia, Adam Butler, Serge Dronov, Paul Flicek, Nick B. Shannon, Hannah Carter, Li Ding, Chris Sander, Josh M. Stuart, Lincoln Stein and Nuria Lopez-Bigas for the ICGC Mutation Pathways and Consequences Subgroup of the BioinformaticsAnalyses Working Group.
Nature Methods. 2013 10, 723–72
#These authors contributed equally to this work