Version 3 of IntOGen!

We have worked hard during the last months to release the new version of IntOGen, version 03, which we have published today. We are very proud of this new release, as it includes some major improvements. Although the look and functionalities of the web service does not change, the data content and, therefore, the results have changed substantially. The changes that have been made can be summarized as follows:

The Pipeline: Christian has re-written the pipeline from scratch. The following are some of the new features of the current version:

  • The way studies are organized have changed (see below) and the pipeline is able to select and process them automatically based on their annotations.
  • Christian has developed a system to manage workflows called Wok. It is written in python and it is quite flexible, allowing fast development and supporting many different execution platforms. The state of execution can be tracked through a web interface and also the logs generated by the jobs. As a result the IntOGen pipeline is much easier to run and validate now.
  • In the new pipeline we have automated some processes that were performed manually before, such as the preprocessing of expression and Progenetix CNA data, the creation of the different databases involved in IntOGen and the generation of reports for quality control of the results.

Oncogenomic Data: We changed the oncogenomic data substantially. Alba, Michi, Christian and myself have work hard in organizing the data and annotating it thoroughly. Overall, the new data is of higher quality and better annotated. Here are the changes:

  • We created an ontology to annotate new data entities, i.e. sample, assays.
  • All the oncogenomic data was re-annotated using the ontology.
  • Our main source of CNA data is Progenetix. The current version of Progenetix excludes poor resolution data and contains only CGH and aCGH data. In IntOGen v03, we have decided to adopt a similar strategy, and thus the number of CNA studies has been reduced, as poor resolution ones have been excluded (See Table 1).
  • We added 50 new transcriptomic studies while we excluded some studies of breast cancer since we detected assays common to multiple studies. Repetitive data biases the results. Some studies were excluded due to a lack of minimum clinical information (See Table 1).
  • In the case of some studies that existed in the previous version, some assays have been omitted due to lack of proper annotation.
  • We have updated COSMIC data with the help of Xavi. IntOGen v03 now includes data from COSMIC v53.
  • IntOGen v03 uses the version 60 of Ensembl. This affects the mapping of the probes on a platform to Ensembl genes, which results in changes in the results.

As a result of all these changes in data content, some results in IntOGen v03 are different than those in v02. We recommend all IntOGen users to move to v03 of the data as it is of higher quality.

Studies ICD-O Topographies ICD-O Morphologies
version 2 version 3 version 2 version 3 version 2 version 3
Transcriptomic 73 118 20 24 30 47
Genomic 388 188 57 36 115 59
Total 461 306 57 39 124 85

Table 1. Comparison of the data content of version 2 and 3 of IntOGen.

I also would like to remind you that we recently included ICGC somatic mutations in IntOGen. Look at this post for further information about browsing ICGC data in IntOGen.