We developed the machine-learning based approach OncodriveROLE to classify cancer driver genes into to Activating or Loss of Function roles for cancer gene development. Here you can download the code of the method, and browse the results of applying OncodriveROLE to two recently published list of driver genes (HCDs and Cancer5000) in the respective tabs Plots, Gene classification and performance. You may adjust the cut-offs with the sliders to the left, download the results according to the selected cut-offs or directly download the classifier to use with your own data. For further information please refer to the manuscript.
The OncodriveROLE.RData classifier is an R object than can easily be loaded and applied on your own driver catalog and data set. The only dependencies are R itself and the R library randomForest.
1. Open an R session in order to load the randomForest library
2. Prepare the features data or download the example
feature data (HCD list)
and load it
testset <- read.delim("testset.txt")
choose the rownames you like better (sym or ensg)
rownames(testset) <- testset$sym
3. Download the classifier
, load and
use it as follows:
result <- oncodriveROLE.classify(testset)
The result should look something like this:
prob.Act prob.LoF OncodriveROLE CFH 0.918 0.082 Activating MATK 0.8515 0.1485 Activating CSDE1 0.2515 0.7485 Loss of Function BRCA1 0.002 0.998 Loss of Function HGF 0.4235 0.5765 No class BCLAF1 0.9225 0.0775 Activating RANBP3 0.0365 0.9635 Loss of Function TRIO 0.9655 0.0345 Activating CDH1 0.0035 0.9965 Loss of Function
The OncodriveROLE classifier has been constructed and applied based
TCGA pan-cancer 17
mutational and cna datasets including data from the following
BLCA, BRCA, COADREAD, GBM, HNSC, KIRC, LAML, LGG, LUAD, LUSC, OV, PRAD, SKCM, STAD, THCA, UCEC.
Here we provide the pre-computed features for the three classification features used for classification. Be aware that applying the classifier to passenger driver genes may yield apparently good classifications, but are nothing else than false positives.
The feature data must contain the three following columns along with optional identifier columns:
MUTS_trunc_mutfreq MUTS_clusters_miss_VS_pam CNA_cbs_logratio_GvL 0.0580 1.079 1 0.0666 0 -1.17609 0.1739 0.7781 0.9542 0.2637 0.6020 -0.6989 0.1308 1.2430 0.6368 0.0763 1.1139 0.3521 0.1724 0.3010 -0.8450 0.0656 0.6020 1.7075 0.4107 0.1760 -1.3222
All ensembl identifiers are in version v70 of the ensembl database.
If an error message appears instead of a plot image here below, try to open the website with Firefox