Classifying cancer driver genes into Loss of Function and Activating roles.

We developed the machine-learning based approach OncodriveROLE to classify cancer driver genes into to Activating or Loss of Function roles for cancer gene development. Here you can download the code of the method, and browse the results of applying OncodriveROLE to two recently published list of driver genes (HCDs and Cancer5000) in the respective tabs Plots, Gene classification and performance. You may adjust the cut-offs with the sliders to the left, download the results according to the selected cut-offs or directly download the classifier to use with your own data. For further information please refer to the manuscript.

Download classification

OncodriveROLE classifier


The OncodriveROLE.RData classifier is an R object than can easily be loaded and applied on your own driver catalog and data set. The only dependencies are R itself and the R library randomForest.


1. Open an R session in order to load the randomForest library

2. Prepare the features data or download the example feature data (HCD list) and load it
testset <- read.delim("testset.txt")
choose the rownames you like better (sym or ensg)
rownames(testset) <- testset$sym

3. Download the classifier OncodriveROLE.RData , load and use it as follows:
result <- oncodriveROLE.classify(testset)
The result should look something like this:

     prob.Act	prob.LoF	OncodriveROLE
CFH     0.918     0.082     Activating
MATK    0.8515    0.1485    Activating
CSDE1   0.2515    0.7485    Loss of Function
BRCA1   0.002     0.998     Loss of Function
HGF     0.4235    0.5765    No class
BCLAF1  0.9225    0.0775    Activating
RANBP3  0.0365    0.9635    Loss of Function
TRIO    0.9655    0.0345    Activating
CDH1    0.0035    0.9965    Loss of Function

Pre-computed features

The OncodriveROLE classifier has been constructed and applied based on TCGA pan-cancer 17 mutational and cna datasets including data from the following TCGA projects: BLCA, BRCA, COADREAD, GBM, HNSC, KIRC, LAML, LGG, LUAD, LUSC, OV, PRAD, SKCM, STAD, THCA, UCEC.
Here we provide the pre-computed features for the three classification features used for classification. Be aware that applying the classifier to passenger driver genes may yield apparently good classifications, but are nothing else than false positives.

HCD features: download
Cancer5000 features: download
PAM count: download

The feature data must contain the three following columns along with optional identifier columns:

MUTS_trunc_mutfreq  MUTS_clusters_miss_VS_pam	CNA_cbs_logratio_GvL
0.0580	1.079	1
0.0666	0	-1.17609
0.1739  0.7781	0.9542
0.2637	0.6020	-0.6989
0.1308	1.2430	0.6368
0.0763	1.1139	0.3521
0.1724	0.3010	-0.8450
0.0656	0.6020	1.7075
0.4107	0.1760	-1.3222

All ensembl identifiers are in version v70 of the ensembl database.

If an error message appears instead of a plot image here below, try to open the website with Firefox