Thursday, May 8, 2008

Scripting NCBI using eUtils and Python

Using the urllib2 module in the Python standard library, we could send a request directly to the Entrez server that handles normal interaction. But that would not be nice. Instead, NCBI asks that we use eUtils. In addition, we should run requests after hours (for > 100 requests) and make no more than one request every 3 seconds. Finally, they would like to have (but do not require) an email address and other information to help them help people who run really big jobs. This isn't necessary for us. We just have to figure out how eUtils works, and limit our requests.

For limiting the requests to NCBI, I use a class from Biopython called RequestLimiter. The first example is almost the same as one I posted about the other day, but it includes the RequestLimiter. It uses EFetch to grab some FASTA-formatted sequences from Genbank. The previous post showed how to use the Biopython Fasta parser with the resulting data. Here we do a bit of random testing to see how robust my code is. I pick ids at random and ask for them. It seems to work fine. Occasionally we see this:

# Error: 106820 is confidential: access denied

The second example extends this to use WebEnv. It has two parts: we use ESearch to send a request to NCBI, but we ask them to save the results. They send us back a kind of cookie called WebEnv that allows us to access the results later. I use the xml.etree module to do the parsing of XML returned from ESearch. I mentioned this briefly before. One important note is that etree is only installed by default with Python as of version 2.5 (which ships with Leopard). Otherwise you'll have to install it. It should be easy. Once we have the value for the cookie we use EFetch to get the data. Much more complicated scenarios are possible and described in various places, but I haven't tried them yet. This is just to get you going.