Tuesday, May 6, 2008

Python for text processing

I'm planning to teach a course on Bioinformatics next spring. We're going to use this book by Higgs and Attwood. It has great coverage of the important topics, teaches some of the math, and is very clearly written. I'm working up a bunch of examples (maybe too many) showing the power of Python for modeling and simulations. To simplify the graphics, I usually save text files containing data which I then open and plot using R.

Here is an example for text processing. We start with the most famous novel in English. Here is a scan of the first page of chapter 1 from my bound copy:



I love the layout. I got the text from Project Gutenberg (previous link) and modified it by stripping out some header text. We use python to open it and count the combinations of two letters. All 26 x 26 possibilities are saved in a dictionary, like this:



After writing the results to disk, we open it with R and plot. The combined code weighs in at a hefty 70 lines or so. It looks pretty impressive. The size of the circle is proportional to the logarithm of the count.



Update: I realize now that this might not be exactly what you want. One might be interested in the differences in frequency of the second letter, given the first letter. For that, we need to normalize for the frequency of the first letter.