Skip to content

Biostumblematic

A biophysicist teaches himself how to code

I managed to find a paper in which some of the analysis I’ve been working on had been done. Unfortunately the raw results of the analysis were just that – raw. Specifically they had been dumped into a 6.8 MB text file as a supplement to the paper.

In order to extract the information I was interested in, and to prove to people who read this that I don’t solve all of my problems with Python, I thought I’d share the quick code I used.

First of all, I wanted all of the lines that reported proteins from humans. This turned out to be workable by running:

cat infile.txt | grep 'Homo sapiens' > oufile.txt

This gave me a long list which helpfully had each line starting with the NCBI GI number for the protein of interest. To extract the GI numbers alone involved:

cat outfile.txt | cut -c 1-11 > GI_list.txt

then to trim the whitespace:

sed 's/^[ \t]*//;s/[ \t]*$//' GI_list.txt > GI_list.txt

(this last one took some help from the handy sed one-liners page)

The entire process took about 1/4 of the time I’ve just taken writing it up, and I now have a nicely-formatted 11 kb file which I can use as input to my next round of tasks.

Advertisements

%d bloggers like this: