Skip to content


A biophysicist teaches himself how to code

Wel, let’s continue in my theme of hashing out something quick and dirty to get something done. Today I needed to pull some protein names out of an XML file. I had done something like it before (extracting crystallization conditions out of PDB files), but of course I had managed to lose the previous script.

The Python module I used for parsing the XML the first time around was ElementTree, and so I decided to use that to see if things would come back to me. The tricky part for me with ElementTree is stepping through the hierarchy inherent in the XML file. Anyway, code first then I’ll walk through it.

#! /usr/bin/env python
#Parse the XML output of STRING to return
#A list of interacting protein IDs, names,
#and scores
import sys
import xml.etree.ElementTree as ET

ns = '{net:sf:psidev:mi}'

tree = ET.parse('cdc20_interactions.xml')

interactors = {}
for item in tree.getiterator(ns+'interactor'):
    shortname = item.find(ns+'names').findtext(ns+'shortLabel')
    fullname = item.find(ns+'names').findtext(ns+'fullName')
    interactors[shortname] = fullname
for interactor in interactors:
    print interactor+','+interactors[interactor]

It’s pretty straightfoward, although with some coding beginner quirks. Since I’m not clear on how ElementTree deals with the namespaces, I just define this as a string (ns) early on. I then set up an empty dictionary to hold the key:value pairs in. Line 17 uses the ‘getiterator’ function to search the XML tree for a tag labeled ‘interactor’. Then we use item.find to look for a child called ‘names’ and .findtext to get the text attribute from within a child of ‘names’ called ‘shortLabel’. The next line does pretty much the same thing. We then assign these to the dictionary as a key:value pair and finally print the whole thing.

So, that’s it. 23 lines including lots of whitespace and comments, and a hell of a lot easier than yanking the information out by hand.


%d bloggers like this: