How to evaluate the performance of computational methods to identify driver mutations?
We have recently published transFIC, a computational method to assess the functional impact of somatic cancer mutations (see this post). To evaluate the performance of transFIC we needed a dataset of driver and passanger mutations. However, we faced a common problem in this field: there is not such dataset that can be trusted and is not biased. Thus, it was a challenge to properly evaluate the performance of transFIC and compare it to other methods with similar aim.
Previously used datasets are based on the recurrence of mutations found in tumor samples, as for instance, the Cosmic database (Reva et al., 2011 and Gonzalez-Perez and Lopez-Bigas, 2011) or manually curated sets of cancer driver mutations (Carter et al., 2009). However, each of these datasets has its own biases. In particular, they are enriched for mutations in well-known genes that have been widely studied in cancer.
After thinking for a long time about this problem, we came out with a solution. Instead of trying to collect the perfect dataset of drivers and passangers (which doesn’t exist yet), we decided to gather several proxy datasets from different sources, under the assumption that each will have its own biases and errors. We assumed that each dataset has an unknown percentage of misclassified mutations. For this reason, instead of focusing on the net performance of each method in a particular dataset we measured the systematic improvement of the one method over others across datasets. If one method outperforms others in all proxy datasets, we can be sure that it is not due to the particular bias of one data source.
We think that these datasets and this approach to evaluate methods to identify cancer drivers may be useful to other researchers interested in developing methods with aims similar to transFIC. For this reason we provide the datasets for download.
The nine datasets use to evaluate transFIC are described in the following table:
HumVar: Dataset of disease-related SNVs and neutral polymorphisms gathered by Adzhubei et al., 2010.
WG: (Whole genome) Dataset of somatic mutations pooled from different tumor exome-sequencing projects (see transFIC paper).
Related posts
How to prioritize cancer somatic mutations?
How to identify cancer drivers from tumors somatic mutations?
Transfic and Oncodrive-fm: Tools for the analysis of cancer sequencing data