Python for Bioinformatics: Exploring mitochondrial genomes 2

Thursday, December 16, 2010

Exploring mitochondrial genomes 2

Continuing with the problem from the last post (here), we want to construct a plot similar to what the ape book has on p. 60 (Fig 3.5), which shows the nucleotide composition of the mitochondrial genes from the species we're looking at. The code shown below is in the same script with the listing from last time, and depends on it.

In the first part, we organize the sequence data into a dict keyed by gene name, where the value is a list of all the sequences for that gene in the data. In the second part we go through that dict, pull out the sequences for each of our target genes and count all the A, C, G, T. The data structure is a nested one: a dict of dicts. The third part is somewhat more complicated. We want to plot not the total counts for each nt, but the fraction of each. The first line does the calculation:

    values = [nD[nt]*1.0/S for nt in 'atcg']
    L = [sum(values[:j]) for j in range(1,5)]
    for z,c,f in zip([4,3,2,1],colors,L):
        b = plt.bar(i+1,f,width=1,color=c,zorder=z)

I chose the ATCG because then we can see at a glance what the GC content is for each gene. If we wanted to get slightly fancy we could sort on that. We're going to do a bar plot, so the value for each bar is not its fraction, but its fraction plus the sum of the fractions of all the nucleotides plotted previously to it. The plt.bar call looks a little convoluted: the purpose is to plot the bars in the correct z-order. In reality, we plot the bars one on top of the other, but we don't see what's underneath. We also tweak matplotlib into plotting the gene names as the tick labels.

The figure could use a key relating the colors to the nucleotides, but I'll leave that as an exercise for the reader.

Code listing:

D = dict()
targets = ['ATP6','ATP8','COX1','COX2','COX3',
           'CYTB','ND1','ND2','ND3','ND4','ND4L',
           'ND5','ND6']
for gene in targets:  D[gene] = []

for item in rest.strip().split('\n\n'):
    if 'Sequence does not exist' in item:
        continue
    title,seq = item.strip().split('\n',1)
    i = title.find('(')
    j = title.find(')')
    gene = title[i+1:j]
    D[gene].append(seq)
#---------------------------------

rD = dict()
for k in targets:
    nD = dict(zip('acgt',[0]*4))
    for seq in D[k]:
        for nt in 'acgt':
            nD[nt] += seq.count(nt)
    rD[k] = nD
#---------------------------------

colors = ['0.2','0.4','0.6','0.8']
iL = list()
for i,k in enumerate(targets):
    iL.append(i+1)
    nD = rD[k]
    S = sum(nD.values())
    values = [nD[nt]*1.0/S for nt in 'acgt']
    L = [sum(values[:j]) for j in range(1,5)]
    for z,c,f in zip([4,3,2,1],colors,L):
        b = plt.bar(i+1,f,width=1,color=c,zorder=z)

ax = plt.axes()
ax.set_xlim(1,len(targets)+1 )
ax.xaxis.set_ticks([i+0.5 for i in iL])
ax.xaxis.set_ticklabels(targets)
plt.savefig('example.png')