Projects that sequence the genomes of a cohort of tumor samples are faced with the challenge of deciding which somatic mutations are relevant to tumor development (drivers). The exome of an individual tumor sample normally contains a few dozens of somatic mutations, most of which are thought to be passenger, i.e., they do not contribute to the tumor phenotype. Very often, cancer resequencing projects use known tools that assess the functional impact of individual mutations (eg. SIFT, PolyPhen2, Mutation Assessor) or use their recurrence across tumor samples to rank somatic variants. They also resort to accumulated knowledge by focusing on mutations that appear in known cancer genes. There are only few bioinformatics tools available to rank somatic mutations according to their likelihood of promoting tumorigenesis. Amongst them are CanPredict and CHASM. (Here is a comprehensive review.)
We have taken a different approach to attempt to rank cancer somatic mutations. First, we observed that genes with dissimilar functions show different tolerance to germline variants, measured as the distribution of functional impact scores of variants accepted during human evolution. (We have called this the baseline tolerance of genes.) Then, we transformed the functional impact scores of mutations provided by three well known tools using this baseline tolerance and we compared the performance of the transformed score and the original score in separating sets of variants enriched for driver mutations (positives) and passenger mutations or polymorphisms (negatives). Actually, benchmarking tools for prediction of cancer drivers is a challenge by itself, as there are no trusted and unbiased set of drivers and passenger mutations, thus we decided to arranged nine proxy datasets for the evaluation. We found a very marked trend for the transFIC to outperform the original functional impact scores, as shown in this graph (where we estimate the performance through the Matthew’s correlation coefficient):
Performance of TransFIC and the original scores to differentiate drivers and passangers. Note that the Matthews Correlation Coefficient (MCC) is sistematically higher for transformed scores than original scores
We have called our approach transFIC (transformed Functional Impact scores in Cancer) The paper where we present transFIC has just been published in Genome Medicine. You can try transFIC with your own cancer mutations at our webserver.
Apart from the transFIC approach, the paper contains a couple of ideas that we believe are worth explaining further and so, we will dedicate new posts to them. They are:
- What is the baseline tolerance? How do we measure it? Why are there differences across gene sets?
- How do we compute the baseline tolerance for different genes?
- How did we assemble the test datasets? How useful are they to test similar approaches?
- How to interpret transFIC of somatic mutations?
How to identify cancer drivers from tumors somatic mutations?
Transfic and Oncodrive-fm: Tools for the analysis of cancer sequencing data
Systematic identification of cancer drivers across tumor types in IntOGen
Cancer mutations: separating the wheat from the chaff (Post at BioMed Central blog).