Friday, February 26, 2010

Why random mutation does not follow the Poisson?


In the simulation of sequence evolution from last time, there is a connection with the Poisson distribution that was also a subject of recent posts (here and here).

Consider an individual position in the sequence, say the first nucleotide. If we ask the question, what is the probability that the initial T has changed to another nucleotide after some number of trials, you can see that this should be described by the Poisson distribution. We have many cycles, and for an individual cycle the probability that the first nucleotide will change is very low (p = 0.01); after 100 cycles, on the average, the mean number of changes is 1. Therefore, we expect the number of unchanged nucleotides to be:

P(k) = λk / k! * e
P(k=0) = λ0 / 0! * e
P(k=0) = e= e-1 = 0.368


So the Hamming distance should be 1 minus 1/e, which is 0.632. But inspection of the plot, or printing out the values for each run:

   5%   0.047  A 252, C 251, G 248, T 249
10% 0.09 A 256, C 246, G 254, T 244
50% 0.366 A 246, C 260, G 248, T 246
100% 0.561 A 254, C 270, G 251, T 225
200% 0.713 A 241, C 280, G 245, T 234
5% 0.05 A 250, C 247, G 248, T 255
10% 0.097 A 245, C 252, G 252, T 251
50% 0.351 A 235, C 270, G 245, T 250
100% 0.557 A 251, C 249, G 258, T 242
200% 0.696 A 228, C 242, G 301, T 229
5% 0.047 A 252, C 243, G 252, T 253
10% 0.094 A 248, C 240, G 258, T 254
50% 0.342 A 262, C 240, G 241, T 257
100% 0.538 A 257, C 239, G 257, T 247
200% 0.692 A 257, C 246, G 258, T 239


shows that the actual value is more like 0.56.

What's going on?

As the number of mutations increases, there is an increase in the probability of targeting a position that was already mutated in a previous round. This is what we were saying before, that the non-linearity is due to this effect. Now we need to recognize that back mutation can occur, restoring the original nucleotide. This depresses the mutation frequency below that predicted from the Poisson distribution. To see that this explanation is correct, we can modify the code to mark the changed bases as lower case.

def JC_rates():
#return { 'A':'CGT', 'C':'AGT','G':'ACT', 'T':'ACG' }
return { 'A':'cgt', 'C':'agt','G':'act', 'T':'acg',
'a':'cgt', 'c':'agt','g':'act', 't':'acg' }


With this single difference, we now obtain the values predicted by the Poisson.

 100%   0.636  A 88, C 93, G 87, T 96
100% 0.634 A 91, C 97, G 86, T 92
100% 0.629 A 95, C 96, G 95, T 85


How about that?