Skip to content

Biostumblematic

A biophysicist teaches himself how to code

For some reason it seems that every program which will output a percentage of the identity between two proteins will also align them itself – therefore screwing up any alignment which you’ve already made. I knocked up a short script over the weekend which will read in a FASTA-formatted alignment and output the percent identity of all of the proteins in it to the first one in the file.

I couldn’t find a built-in way to do this all in BioPython, but I did use it to parse the seqences out of the alignment. The rest of the work is just brute force string crunching.

#!/usr/bin/env python
# https://biostumblematic.wordpress.com

import string
from Bio import AlignIO

# change input.fasta to match your alignment
input_handle = open("input.fasta", "rU")
alignment = AlignIO.read(input_handle, "fasta")

j=0 # counts positions in first sequence
i=0 # counts identity hits 
for record in alignment:
    for amino_acid in record.seq:
        if amino_acid == '-':
            pass
        else:
            if amino_acid == alignment[0].seq[j]:
                i += 1
        j += 1
    j = 0
    seq = str(record.seq)
    gap_strip = seq.replace('-', '')
    percent = 100*i/len(gap_strip)
    print record.id+' '+str(percent)
    i=0

I didn’t implement similarity here, but it gets the basic job done. This script is available on GitHub as seqhomology.py

Advertisements

%d bloggers like this: