Saturday, November 27, 2010

Neighbor-joining in Python: doing the plot


As discussed last time, I'm using Python lists to hold the information for each node on a phylogenetic tree. This seems as if it will have significant advantages, especially when it comes to tree "surgery." I'll leave that for another post. What I want to do today is actually draw a tree rooted at each of the internal nodes. Here is our example:

example_tree = { '0':['A','B','1'],
'1':['C','0','3'],
'2':['E','D','3'],
'3':['F','1','2'] }

Code to traverse the tree was shown last time (here). I've modified it slightly to use a dictionary. Previously we had this representation for each node and its parent (in the traversal from a particular root):

A:0, C:1, B:0, E:2, D:2, F:3, 1:root, 0:1, 3:1, 2:3

Now these keys and values are in a dict. The results of the traversals from each internal node as root are:

A:0, C:1, B:0, E:2, D:2, F:3, 1:1, 0:1, 3:1, 2:3
A:0, C:1, B:0, E:2, D:2, F:3, 1:0, 0:0, 3:1, 2:3
A:0, C:1, B:0, E:2, D:2, F:3, 1:3, 0:1, 3:3, 2:3
A:0, C:1, B:0, E:2, D:2, F:3, 1:3, 0:1, 3:2, 2:2

The root node is now marked by having itself as its "parent."

The drawing code works by building a repr (representation) for each node from the farthest tips working up to the root. The method was inspired by PyCogent's ancestors() function, which returns a list of ancestors (naturally enough). The output with debug enabled looks like this:

ancestors
E ['2', '3', '1', '0']
D ['2', '3', '1', '0']
F ['3', '1', '0']
2 ['3', '1', '0']
C ['1', '0']
3 ['1', '0']
A ['0']
B ['0']
1 ['0']
0 []

I simply sort on the length (largest first) and work in that order. So E's repr is just 'E'. When we process the first i_node (2), it's repr will be (E,D)2. If we're using branch lengths, it will be: (E:2.25,D:2.75)2. The distance data is in a separate dictionary:

dist = { ('A','0'):1.0,
('B','0'):4.0,
('C','1'):2.0,
('D','2'):2.75,
('E','2'):2.25,
('F','3'):4.75,
('0','1'):1.0,
('1','3'):1.25,
('2','3'):0.75 }

Here's the resulting Newick tree:

(A:1.00,B:4.00,(C:2.00,(F:4.75,(E:2.25,D:2.75)2:0.75)3:1.25)1:1.00)0;

One of the plots is shown at the top, a second one is below. Although I had it working before, be advised there's a bug somewhere that blows up R when I try to plot the tree as 'unrooted.' So now my to_do list includes the challenge of fixing that. [UPDATE: It seems to only happen when calling R from Python using RPy---probably related to the use of a Python keyword type as an argument to the R function. ] Zipped project files are here. And a final note, I've included the upgma code from before. However, we processed the nodes with a different algorithm there, so the drawing code is different, and I haven't tried to convert it to the new method yet.