Understanding the hierarchy of your GO Terms subset

The other day I downloaded the cancer-affected Gene Ontology (GO) terms from IntOGen for up- and down-regulation via it’s Biomart interface for a few tissues. Since was only interested in the GO Cell Compartment terms, so I directly added a filter file containing all the GO CC Terms as a filter for the Biomart export.

So then… what do you do when you have a list of GO terms? Already if it is only 100 GO terms, it is quite hard to get an idea which are the affected compartments. To understand better you have to identify the more general terms that are affected. Here I explain quickly how I solved this problem and share it with you.

  • Ingredient 1: With the help of the BioStar community I found an easy way to query for the descendants for a certain GO term. So I can find out the hierarchical relationship between my GO terms.
  • Ingredient 2: I needed to create a hierarchical tree for my GO Terms. For this tree representation I adapted the idea Node class and tree function from this blog entry.

So I mixed the two ingredients in a python script which re-creates the ontology hierarchy for my subset of GO Terms and prints it. This way I can see in what trees your terms collapse in and have a structured overview.

Of course this is more helpful if your list is relatively small. Also after I removed all the part-terms (the ones that follow the pattern “Some-compartment part”), the resulting tree structure is less repetitive.

The output of the python script looks like this (this is an excerpt):

| integral to membrane
    | integral to plasma membrane
        | integrin complex
        | voltage-gated potassium channel complex
| cell projection
    | ruffle
    | microvillus
    | neuron projection
        | axon
        | dendrite
            | dendritic spine
        | growth cone
| membrane fraction
    | synaptosome
    | vesicular fraction
        | microsome
| proteasome complex
    | proteasome core complex
| cell surface
    | external side of plasma membrane
| chromosome
    | condensed chromosome
        | condensed chromosome, centromeric region
            | condensed chromosome kinetochore

And in the end: Here is the python script if you are interested:

# settings
###########################################################3
filename = "intogen-combinations.go-cc.significant.unique.nopart.tsv"
go_term_col = 2     # the column with the go term names
header_length = 1   # nb of rows header is occupying in the file

# database access
###########################################################3
import MySQLdb
db = MySQLdb.connect(host="mysql.ebi.ac.uk",
                     user="go_select",
                     db="go_latest",
                     passwd="amigo",
                     port=4085)
cur = db.cursor()

# Create class and functions needed for tree represantation
###########################################################3

# constants
ROOT_NODE = "the_root_node"
ORPHAN_NODE = "an_orphan_node"

class Node:
    def __init__(self, n, s):
        self.id = n
        self.title = s
        self.children = []

def add_node(tree, nodeId, title, parentId=ROOT_NODE):
   newNode = False
   if not nodeId in treeMap:
       newNode = True
       tree[nodeId] = Node(nodeId, title)
   else:
       tree[nodeId].id = nodeId
       tree[nodeId].title = title
       if tree[nodeId] in tree[ROOT_NODE].children:
           tree[ROOT_NODE].children.remove(tree[nodeId])

   if not parentId in treeMap and parentId != ROOT_NODE:
       tree[parentId] = Node(parentId, parentId)
       tree[ROOT_NODE].children.append(treeMap[parentId])

   if parentId != ROOT_NODE or newNode:
       tree[parentId].children.append(treeMap[nodeId])

def print_map(node, lvl=0):
    for n in sorted(node.children, cmp=lambda x,y: cmp(x.title, y.title)):
        print '\t' * lvl + str("|") + " " + n.title
        if len(n.children) > 0:
            print_map(n, lvl+1)

treeMap = {}
Root = Node(ROOT_NODE, ROOT_NODE)
treeMap[Root.id] = Root

# read the file with your go term names
###########################################################3
go_terms = []
f = open(filename, "r")

for line in f:
    cols = line.rstrip().split("\t")
    go_terms.append(cols[go_term_col-1])
for i in xrange(0,header_length):
    del(go_terms[0])

# query descendants and reconstruct tree
###########################################################3
placeholder = '%s'
placeholders = ', '.join(placeholder for unused in go_terms)
for t in go_terms:
    query = "SELECT DISTINCT descendant.acc, descendant.name \
    FROM \
     term \
     INNER JOIN graph_path ON (term.id=graph_path.term1_id) \
     INNER JOIN term AS descendant ON (descendant.id=graph_path.term2_id) \
    WHERE term.name='%s' \
     AND distance < 2 AND distance > 0 \
     AND descendant.name IN (%s);" % (t,placeholders)
    cur.execute(query,go_terms)
    output = cur.fetchall()
    if len(output) == 0:
        add_node(treeMap,t,t)
        #if t == "mitochondrial membrane": print t,"added","solo"
    for o in output:
        add_node(treeMap, o[1],o[1],t)
        #if o[1] == "mitochondrial membrane": print o[1],"added","aschildof",t

cur.close()
db.close()

print_map(Root)