Skip to content

Biostumblematic

A biophysicist teaches himself how to code

I’m not sure if this behavior is normal, but I find that I learn a new system best by choosing a real-life problem that I need to solve and applying the new method in order to solve it. This inevitably means that I’ll probably be doing things in a non-efficient way (since I’m a noobie), but code can always be refined later.

Here is the problem I have in front of me today: I have a series of proteins from which I’d like to isolate (via cloning) a certain domain. The cDNA clones of the full length proteins are available from the IMAGE consortium. Unfortunately these aren’t completely “clean” cDNAs; there tends to be some extraneous sequence on both ends of the gene.

The plan of action goes something like this:
The starting materials are the cDNA sequence, the amino acid sequence of the protein, and the residue ranges of the domain of interest. So what I’d like to do is to check each frame of the cDNA to find the one matching the translated protein sequence, then extract just the cDNA coding for the domain I’d like to isolate. I can then (independently) design PCR primers for this domain.

You’re probably thinking that this could be done manually (and of course that’s true), but I find this painstaking work. Also it gives me a chance to play around with the SeqIO functions of BioPython a bit 🙂

Enough introduction, let’s get to work. The protein I’ll be using for this exercise is IKBKB. This is a 756 amino acid protein; I’ll be trying to get the cDNA for the protein kinase domain from residues 15-300. The IMAGE clone ID is 5784717.

Baby step 1 – find the ORF we’re interested in

#! /usr/bin/env python

# https://biostumblematic.wordpress.com

# Extraction of the cDNA
# for a given protein domain

import re
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

input_cdna = raw_input('Paste your cDNA sequence >> ')
input_search = raw_input('What are the first amino acids of the protein? >> ')

cdna = Seq(input_cdna, IUPAC.unambiguous_dna)

i=0
while i < 3:
    frame = cdna&#91;i:150&#93;
    trans = frame.translate()
    orf_find = re.search(input_search, str(trans))
    if orf_find:
        orf_frame = i+1
    else:
        pass
    i += 1
print 'The protein is coded in frame '+str(orf_frame)
&#91;/sourcecode&#93;
Given the input of the cDNA and the first 4 residues (MSWS), this outputs the right answer:
<pre>The protein is coded in frame 3</pre>
Note that I'm only checking the first 50 residues of the cDNA (see line 19) Hopefully this is enough to catch the protein of interest (it would be a lot of extraneous 5' sequence if not).

Obviously this is not enough sexy for Biopython.  It's cumbersome to have to type in the starting sequence of the protein you're interested in, so why don't we let SeqIO handle that for us via the SwissProt code?  Also, we'll change a couple of things to enable automation of the full list later.

First, I made a file called 'test.csv' which has a single line consisting of SwissProt ID,cDNA:
<pre>O14920,atagccccggg[...]</pre>
Then I modified the script like so:

import re, csv
from Bio import ExPASy, SeqIO
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

reader = csv.reader(open('test.csv'))
for row in reader:
    input_prot = row[0]
    get_prot = ExPASy.get_sprot_raw(input_prot)
    prot_obj = SeqIO.read(get_prot, "swiss")
    get_prot.close()
    prot_seq = prot_obj.seq
    prot_start = prot_seq[0:4]

    cdna = Seq(row[1], IUPAC.unambiguous_dna)

    i=0
    while i < 3:
        frame = cdna&#91;i:150&#93;
        trans = frame.translate()
        orf_find = re.search(str(prot_start), str(trans))
        if orf_find:
            orf_frame = i+1
        else:
            pass
        i += 1
    print 'The protein is coded in frame '+str(orf_frame)
&#91;/sourcecode&#93;
Biopython grabs the protein sequence from the web using the SwissProt ID.  The prot_start variable takes just the first few residues and uses that as the search term for the regular expression later on.  Now there is no command line input, as everything is done via the CSV file.  This will iterate over lines in the CSV file to do multiple proteins.  Right now, however, we would just get a long list of "The protein is coded in frame X" lines, which is less than useful.  Time to take care of that.

In this case the domain is annotated in SwissProt already.  This means that I <em>could</em> use the <a href="http://biopython.org/DIST/docs/tutorial/Tutorial.html#chapter:swiss_prot">built-in parsing function</a> of BioPython to select the domain, however I have some custom annotations for other proteins in my list that make this not a good idea in this case.  Instead let's just make some minor modifications to our input CSV and existing script.  The new CSV includes the start and stop residues of interest:
<pre>O14920,15,300,atagccccgggttt[...]</pre>
Now I just modify the top of the script to take into account the new structure of the CSV:

reader = csv.reader(open('test.csv'))
for row in reader:
    input_prot = row[0]
    get_prot = ExPASy.get_sprot_raw(input_prot)
    prot_obj = SeqIO.read(get_prot, "swiss")
    get_prot.close()
    prot_seq = prot_obj.seq
    prot_domain = prot_seq[int(row[1])-1:int(row[2])]
    cdna = Seq(row[3], IUPAC.unambiguous_dna)

and adjust what happens if the script finds a match to the domain sequence:

        if orf_find:
            trans_split = re.split('('+str(prot_domain)+')', str(trans))
            cdna_start = len(trans_split[0])*3
            cdna_stop = cdna_start + len(trans_split[1])*3
            cdna_extracted = frame[cdna_start:cdna_stop]
            print cdna_extracted

And this gets the job done! This prints a cDNA sequence which, when translated back, matches the domain of interest.

The last part there feels sloppy. All I’m doing is counting the number of amino acids that come out of the translation before the start of the domain, then multiplying by three, and getting the cDNA start from this. I feel like there must should be a way to transition more effectively between protein and cDNA sequence.

The entire script, slightly modified to write out the results to a new CSV file, is available over on GitHub. I hope you found the post interesting and look forward to your comments.