How to identify cancer drivers from tumor somatic mutations?

Cartoon representing genomic alterations in a tumor cell. Image from NCI.

I have recently seen several presentations by groups that systematically explore alterations in cancer genomes that deliver the same message. One of the main challenges faced by their projects is to identify genes and pathways involved in tumor development (drivers). Very good methods based on the recurrence of somatic mutations have been developed to identify cancer drivers (see, for example MutSig and the Significantly Mutated Gene (SMG) test from MuSiC). They rely on the assumption that genes that exhibit more mutations than expected by chance are putative drivers. Even though these methods are successful in identifying clear cancer drivers, they also face some known challenges. For instance, the background mutation rate is hard to estimate accurately and important genes that are mutated only in a small number of tumors may be overlooked. Besides, these methods treat all mutations that may affect protein sequence equally, when their impact on protein function clearly differ.

 

Some time ago we thought that a good way to address these challenges would be to use the Functional Impact Bias (FM bias) observed in genes across a cohort of tumor samples. In other words, we wanted to estimate how the accumulation of mutations with high functional impact on each gene deviates from the average observed in all tumor samples.

Schema depicting basic principles of methods to identify cancer drivers from somatic mutations in a cohort of tumors. The heatmaps at the left represent the mutations in genes (rows) in a set of tumors (columns). Recurrence analyses identify drivers by computing the mutation rate of each gene in the cohort and spotting those that have a higher rate than the background. FM-bias approach use the functional impact score of somatic mutations and computes the bias towards the accumulation of functional variants.

 

The hypothesis beneath the FM bias is that genes that are not significant to tumor development may receive somatic mutations by virtue of their length, or the spontaneous mutation rate of their genomic region, but these –because no positive selection acts on them– will be across all the scale of functional impact. This is the case, for instance, of the second gene in the above figure, which, although relatively frequently mutated, does not systematically receive highly functional mutations. On the other hand, cancer driver genes will accumulate highly functional mutations, as those are the ones that alter the function of the encoded protein conferring a selective advantage to the cell. For instance, genes with low frequency of mutations that harbor only highly functional mutations (the fifth gene in the figure above) are putative driver genes. These genes might be missed by the recurrence-based approach, but may be picked-up by Oncodrive-fm. (Compare both heatmaps in the Figure and the result of the toy example pvalues resulting from applying both approaches to gene number five.)

 

We have just developed a method to measure the FM bias, which we called Oncodrive-fm. We have tested it on three large scale oncogenomics datasets that contain large catalogs of tumor somatic variants. It was encouraging to realize that most highly-recurrent and well known cancer genes exhibit a very significant FM bias. In addition, Oncodrive-fm successfully identified putative lowly recurrent cancer drivers, which are overlooked by recurrence based methods. Therefore, our recommendation to identify cancer drivers from somatic mutations in a cohort of tumors is to use both approaches, recurrence and FM-bias, as they are complementary. The manuscript describing Oncodrive-fm is now accepted for publication in Nucleic Acids Research (Abel Gonzalez-Perez and Nuria Lopez-Bigas, In Press) and will soon be available.

 

We plan to prepare tutorials on the tool very soon and release them through the blog, once they’re ready. Meanwhile you can take a look at the Oncodrive-fm web page, where you can download a PERL script to compute the FM-bias of cancer genomics datasets.