Thursday, March 3, 2011

Handling large sequence sets (4)

Just to recap, I'm doing a project involving a paper:
Gawronski et al 2009 PMID 19805314

They reported an STM study of Haemophilus infuenzae in the murine lung. The data we're working with consists of > 500,000 Illumina reads from the library (before in vivo selection), which we previously mapped to the genome sequence and processed to remove duplicated targets.

The results are in unique.txt (8.7 MB), in a format like this:


5859 706010
5871 844100
..


where the first element is the sequence id and the second is the genome position. I wrote a script (consolidate.py) to process this info into a list of genome positions and the count of reads recovered for each site. That's in ordered.txt, in a format like this:


1010 8
1011 4
1016 1
..


The next step is to get information about the known genes. I should use PyCogent for this, but the example I wrote up for the Cookbook isn't working for this right now. I'll have to look into that issue but I'm in a hurry so.. I go to the Genome project page and the Genes link and download a file called gene_result.txt. It looks like this:


1. accA
acetyl-CoA carboxylase carboxyltransferase subunit alpha[Haemophilus influenzae Rd KW20]
Other Aliases: HI0406
Genomic context: Chromosome
Annotation: NC_000907.1 (427035..427982, complement)
ID: 949505

2. accB
acetyl-CoA carboxylase biotin carboxyl carrier protein subunit[Haemophilus influenzae Rd KW20]
Other Aliases: HI0971
Genomic context: Chromosome
Annotation: NC_000907.1 (1028137..1028604)
ID: 949974
..


I wrote a simple script (extract_genes.py) to parse that info into this format:


accA pep acetyl-CoA_carboxylase_carboxyltransferase_subunit_alpha 427035 427982 -
accB pep acetyl-CoA_carboxylase_biotin_carboxyl_carrier_protein_subunit 1028137 1028604 +
accC pep acetyl-CoA_carboxylase_biotin_carboxylase_subunit 1028779 1030125 +
..


The second element indicates whether the gene encodes RNA or pep (protein). That is in genes.txt. My totals are not quite matching the official numbers, but this is only a demo so we're moving on.

The next step is to use the gene info to assign a gene to each hit, if it's in a gene. The script to do that is assign.py, and the results are in assigned.txt and they look like this:


1010 8 gapdH
1011 4 gapdH
1016 1 gapdH
1023 1
1024 6
..


The next step is to reverse the association. Rather than position => number of hits => gene, I want gene => position => number of hits. The script for that is reverse.py and the results are in genehits.txt. They look like this:


accA
427048 5
427289 2
427942 13
427953 2
427958 1

accB
1028551 3
1028602 1


Now (finally) we can do something interesting. In analyze.py, we examine various statistics for each gene. At first, we will look only at genes that suffered hits. The results are sorted by the number of hits and secondarily by gene name. The end of the list looks like this:


oapA        1011   0.41  412   2.49  2513     (322, 90)
iga1 1695 0.38 650 2.34 3963 (517, 133)
HI1480 157 0.23 36 67.82 10647 (34, 2)


The first two elements are the gene name and the size (in amino acids, with some off-by-one errors).

The next four elements are:

the ratio of the number of positions with hits to the size
the number of positions with hits
the ratio of the number of total hits to the size
the number of total hits.

In parentheses at the end is an attempt at quantifying the distribution of hits in each gene. The first position in the tuple indicates the number of hits mapping between 10% to 90% of the gene length, and the second position ar those at the extreme ends of the gene. On the average, the ratio of these two numbers should be about 4:1.

This one is for nostalgia (link):


hemH         324   0.26   84   1.00   323     (61, 23)


We notice that the very last gene on the list seems to be a hot spot for the transposon. At the beginning of this list are genes that are certainly essential. For example:


def          170   0.01    1   0.01     1     (1, 0)
dnaA 455 0.00 1 0.00 1 (1, 0)
engA 505 0.00 1 0.00 1 (1, 0)
fabI 263 0.00 1 0.00 1 (0, 1)
gptA 156 0.01 1 0.01 1 (1, 0)


These are all of a reasonable size, with only one hit. At a guess, these "hits" could be reads that were mis-called against the genome. And then there are ones in the middle like this:


pta          712   0.01    8   0.05    33     (1, 7)
..
adk 215 0.04 9 0.18 38 (0, 9)
..
murD 438 0.03 11 0.09 38 (1, 10)


These have a number of hits but the distribution is skewed to the ends of the gene (probably the distal end). The inside hits are probably mis-calls.

Next time, we'll do some histograms for individual genes. Also, I'd like to reproduce Fig 2 from the paper. That will require processing the data from the library after it's gone through selection.

Finally, there are a number of genes with no hits at all. If they are of a reasonable size, these are also quite likely to be essential. Here are the ones that aren't HI.. types (lots of ribosomal proteins):


acpP          77
ddl 307
folP-1 276
folP-2 276
frr 186
ftsH 636
glmM 446
lpxK 333
mviN 511
pgk 387
pheS 330
ribH 158
rpS11 130
rpS7 157
rplC 209
rplE 143
rplF 178
rplL 124
rplM 143
rplN 124
rplO 145
rplR 118
rplU 104
rplV 111
rplW 100
rplX 104
rpmA 86
rpsC 236
rpsD 207
rpsE 167
rpsF 126
rpsH 131
rpsI 131
rpsJ 104
rpsL 125
rpsM 123
rpsN 102
rpsO 90
rpsP 83
rpsS 92
secE 107
tRNA-Arg-1 77
tRNA-Glu-1 76
tRNA-His-1 76
tRNA-Ile-2 77


Zipped script files on Dropbox (here).