Python for Bioinformatics: November 2009

Monday, November 30, 2009

Timing complementary DNA

I received an email the other day from someone in Tunisia, letting me know that he is also posting about Python and Bioinformatics (example). This particular post is about the very simple problem of generating the "bottom" strand from the "top" strand, i.e. substituting each complementary base in a list or a string representatation of a DNA sequence. There are several ways to do this, and I decided to test them for speed, recognizing of course that "premature optimization is the root of all evil" (Knuth is cited here but not referenced).

I used the standard Python module timeit. The class initializer takes two strings of "stmt" or code, and "setup", and has a method of the same name that takes the number of repetitions as an argument.

I tested four approaches:

• if ... elif ... elif ... elif
• a list comprehension with a dictionary lookup
• sequential string replacement dna = dna.replace(n1,n2)
• use of the string module method maketrans

The results for 10,000 reps were as follows:

method    time  ratio
if elif  5.427  108.9
dict     2.026   40.6
replace  0.463    9.3
trans     0.05    1.0

import timeit

s = '''
import string,random
L = list('ACGT')*250
L2 = list()
random.shuffle(L)
D = {'A':'T','C':'G','G':'C','T':'A'}
dna = ''.join(L)
'''

c1 = '''
for nt in L:
    if nt == 'A':  L2.append('T')
    elif nt == 'C':  L2.append('G')
    elif nt == 'G':  L2.append('C')
    elif nt == 'T':  L2.append('A')
'''

c2 = '''
L2 = [D[nt] for nt in L]
'''

c3 = '''
dna = dna.replace('A','T')
dna = dna.replace('C','G')
dna = dna.replace('G','C')
dna = dna.replace('T','A')
'''

tt = '''
tt = string.maketrans('ACGT','TGCA')
'''

c4 = '''
string.translate(dna,tt)
'''

results = list()
for i,c in enumerate([c1,c2,c3,c4]):
    if i == 3:  s = s + tt
    t = timeit.Timer(c,s)
    results.append(t.timeit(int(1E4)))

m = min(results)
for r in results:
    print round(r,3), 
    print str(round(r/m,1)).rjust(6)

Wednesday, November 25, 2009

Bridging from Objective-C into Python

Here is my code following bbum's implementation of a worked out example of how to call Python from Objective-C. This follows up on a question I asked on Stack Overflow.

An important part of the approach is that although we want to do this:

ObjC => Py

and we define an Abstract class in Objective-C from which the Python class inherits, so we have this:

ObjC => Py Concrete (ObjC Abstract)

we actually start from Python. This gives us the running interpreter:

Py => ObjC => Py Concrete

But the essential aspect of this method, that was really new to me, is a pair of functions that take the name of a class (as a string), and return the class object itself. In Python (using objc):

objc.lookUpClass("Abstract")

And in Objective-C:

Class c = NSClassFromString(@"Concrete");

I probably should have known about the first one. After all, it is mentioned very early in the Userguide for PyObjC, but after some searching, it is not clear to me where the latter is defined, let alone documented.

I have stripped bbum's example down a bit, and also, just pass in a dictionary to the class we will instantiate in Python.

Part One

an Objective-C class---Abstract. It contains a class method that will allocate and return an instance. This method takes a dictionary that can hold any arguments we want to pass in.

#import <Cocoa/Cocoa.h>

@interface Abstract : NSObject {
    NSMutableDictionary *mD;
}
@property (retain) NSMutableDictionary *mD;
+ myInstance: (NSMutableDictionary *)aD;
- (NSMutableDictionary *)report;
@end

#import "Abstract.h"

@implementation Abstract
@synthesize mD;

+ myInstance: (NSMutableDictionary *)aD {
    return nil;
}
- (NSMutableDictionary *)report {
    return [self mD];
}
@end

I put this code into files Abstract.h and Abstract.m in a new Xcode project that builds a bundle (what the docs call a "loadable bundle"). Make sure you select the Cocoa framework to link against. Select Build from the Build menu (the Toolbar Item is not enabled). I copied Abstract.bundle to the top-level directory that contains the Python script which will use it.

Part Two

a Python class---Concrete.py. The first section loads the Abstract bundle. This somehow makes the class definition "available" to the Objective-C runtime, so we can use a the handy method lookUpClass to actually load it. This allows us to subclass Abstract in Python.

import objc
from Foundation import NSBundle

p = '~/Desktop/ObjCPy/Abstract.bundle'
bundle = NSBundle.bundleWithPath_(p)
if not bundle.principalClass():
    print "%s: failed to load bundle." % p

Abstract = objc.lookUpClass("Abstract")

class Concrete(Abstract):

    @classmethod
    def myInstance_(self, D):
        print "creating Python instance"
        c = Concrete.new()
        self.mD = D
        self.mD['lastName'] = 'MacGillicuddy'
        return c
    
    def report(self):
        return self.mD

Part Three

another Objective-C class---Caller. This is in a separate Xcode project that is built exactly as above. The code couldn't be simpler.

#import <Cocoa/Cocoa.h>
@interface Caller : NSObject {
}
+ (NSString *) callPython;
@end

#import "Caller.h"
#import "Abstract.h"

@implementation Caller
+ (NSString *) callPython;
{
    NSMutableDictionary *temp;
    temp = [NSMutableDictionary dictionaryWithCapacity:5];
    [temp setObject:@"Meredith" forKey:@"firstName"];
    
    Class c = NSClassFromString(@"Concrete");
    AbstractClass *obj = [c myInstance:temp];
    NSLog(@"Created instance: %@", obj);
    NSLog(@"OC dict: %@", temp);
    NSLog(@"obj mD %@", [[obj mD] description]);
    NSLog(@"report %@", [[obj report] description]);
    return [NSString stringWithFormat: @"Did it!"];
}
@end

The class contains a single class method that calls the Python class method myInstance. We set one key-value pair in a dictionary on the Objective-C side, then pass it into Python. The handy method NSClassFromString makes the Python class available. We try two approaches to access the dictionary: via the getter method that is set up for us via @property and @synthesize (see Abstract), and direct access via report instead, breaking encapsulation.

Part Four

a Python script that sets everything in motion.

import sys
import objc
from Foundation import NSBundle
import Concrete

p = '~/Desktop/ObjCPy/Caller.bundle'
bundle = NSBundle.bundleWithPath_(p)
if not bundle.principalClass():
    print "%s: failed to load bundle." % p
        
caller = objc.lookUpClass("Caller")
print caller
result = caller.callPython()
print result

And the output:

$ python script.py 
<objective-c class Caller at 0x1030fb150>
creating Python instance
Python[302:d07] Created instance: <Concrete: 0x10361fdc0>
Python[302:d07] OC dict: {
    firstName = Meredith;
    lastName = MacGillicuddy;
}
Python[302:d07] obj mD (null)
Python[302:d07] report {
    firstName = Meredith;
    lastName = MacGillicuddy;
}
Did it!

The Caller class (ObjC) adds one key-value pair to the dictionary, while the Concrete class (Python) adds the second, and the dictionary is still available from the returned Python instance. One thing: the getter implemented via @property does not seem to work from Python, but an explicit report method does.

The Python Concrete class could then call any further Python code you want. Neat! Thanks, Bill.

Saturday, November 21, 2009

Objective-C calls Python names (3)

I'm working my way through some code that was gifted to me here. It explains how to call Python from Objective-C. This first two posts are here and here. The overall scheme is that we do:

Python => ObjC => Python => ObjC.

The code in the Python module does several things:

• loads the bundle containing the AbstractClass class, written in Objective-C (given its location from a path passed in on the command line)

• first calls objc.lookUpClass("AbstractClass")---so we obtain the definition of the abstract superclass, compiled and saved in the bundle

• provides the Python implementation for ConcreteClass, which inherits from AbstractClass.

• also looks up a second Objective-C class that will run things, and calls a function defined in it. This gets everything rolling in Objective-C. We come back to Python from there.

This is a somewhat unusual Python module.

One thing is that the "bundle" we load is never explicitly used. It is just available to the Objective-C "runtime" so that we can do objc.lookUpClass(). [Oops: beyond checking to see whether the bundle was loaded, that is!] More about lookUpClass another time.

There is also the fact that ConcreteClass has no __init__ function. Instead, there is a class method (in part):

    @classmethod
    def namedInstance_(self, aName):
        print "creating %s" % aName
        newInstance = ConcreteClass.new()

corresponding to this stub in the superclass:

+ namedInstance: (NSString *) aName

(which, as you remember, returned nil). So, I think what we are doing here is using PyObjC to go back and instantiate our Python class on the Objective-C side of bridge using new, which is, I suppose, where it has to be.

At the end of the module we do:

objectiveCCode = objc.lookUpClass("ObjectiveCCallingPython")
print objectiveCCode.callSomePython()

This brings us to the third class: ObjectiveCCallingPython. It uses an Objective-C call I've never heard of:

NSClassFromString()

I'll look more at these two string to class methods in another post. But the outline of what is going on is this:

• From the command line, we call the Python interpreter and give it the name of a Python module as the script to execute, as well as the path to a bundle containing two Objective-C classes.

• The Python code loads the bundle and (in some way not clear to me yet) uses that information to get two necessary class definitions. There is some magic in objc.lookUpClass().

• We then call back to Objective-C with: objectiveCCode.callSomePython()

• This Objective-C code calls a "factory" method to instantiate the Python-defined class using the PyObjC defined new method and do stuff with it from Objective-C.

Got that?

I am beginning to understand bbum's exasperation with me. On the very first page of the Userguide for PyObjC it says: "Using PyObjC is pretty simple, as it strives to make the bridge completely transparent. On the lowest level you can import the 'objc' module and use that to locate classes (objc.lookUpClass)."

There is also a fairly old tutorial.

And googling with objc.lookUpClass turned up someone who was more successful at following the hints than I was.

Objective-C calls Python names (2)

I'm working through an example (from Bill Bumgarner) of calling Python from Objective-C---at least I think that's what we're doing. Who is calling whom is a bit circular, as you will see.

The first class to consider is an Objective-C class, and it's an abstract class. That is, it is not designed to be instantiated as is but should instead be extended or subclassed, and the subclass instantiated. The subclassing stuff will happen in Python.

The class definition appears a little complicated, but it turns out that the complications are not essential to the problem. The big picture is that we declare an instance variable and methods, and these methods will later get new definitions in Python. Just in case we forget, the methods in this (abstract) class call what looks like a very strange function:

void SubclassResponsibility(id classOrInstance, SEL _cmd) {
    NSString *reason = [NSString 
        stringWithFormat: @"Must subclass %s and override the method %s.",
            object_getClassName(classOrInstance), sel_getName(_cmd)];
    @throw [NSException exceptionWithName: @"SubclassResponsibility"
                                   reason: reason
                                 userInfo: nil];
}

This guy is not part of the @implementation, but is at the head of the file. Not sure what that means. I guess it's just a function whose definition is available to us now. Why is it done this way?

SubclassResponsibility is called in each of the "stub" methods, like this:

SubclassResponsibility(self, _cmd);

Hmm... Then I remember, there are not one but two special silent arguments that are available, that the Objective-C runtime sends to a function: self, and the name or more accurately, the selector, which is called _cmd. So we pass this on to SubclassResponsibility in order to let the errant programmer know which function he failed to implement in the subclass. And then throw an exception. And Bill tells us that---way back when---the forerunner of NSObject used to declare this method.

So that's it. An abstract Objective-C class called AbstractClass with methods that will be redefined later, and it will also be compiled into a "bundle."

Oh yes, there is a a class method:

+ namedInstance: (NSString *) a Name

which here at least doesn't seem to return anything...but will return an instance from Python.

Objective-C calls Python names

I have a new (short?) project to work on that combines Python and Objective-C, and I hope it will improve my understanding of the latter.

While cruising Stack Overflow yesterday I came upon this question and answer. The issue is how to "bridge" from code in one language to another. It's easy going from Python to Objective-C, you can even do it from the command line:

>>> from Foundation import *
>>> A = NSArray.arrayWithObjects_('a','b',None)
>>> A
(
    a,
    b
)
>>> A.objectAtIndex_(1)
'b'
>>>

But how to go the other way, back from Objective-C into Python? Obviously, one could launch Python as a separate process, and have it write the results to disk or something (I don't know offhand how to do this from Objective-C, I'll put it on the to-do list). But we're looking for tighter integration than that.

According to bbum it's easy, you just do this:

To call from Objective-C into Python, the easiest way is to:

• declare an abstract superclass in Objective-C that contains the API you want to call
• create stub implementations of the methods in the class's @implementation
• subclass the class in Python and provide concrete implementations
• create a factory method on the abstract superclass that creates concrete subclass instances

Well, 3/4 make sense to me and the fourth sorta does! But where do you get the interpreter from?

My PyObjC Xcode projects might provide a model for this. But they get back to this problem: a lot of what I've done with Xcode, and PyObjC, is magic to me. I'd like to understand it better, but it seems hard. For example, recent projects have a file main.m which appears to start everything off, but I have no idea how it works. The older projects (2 yr) do not have a main.m. How do they work?

So, what I did, I posted a new question at Stack Overflow.

And bbum answered! Actually, Barry Wark did too (thanks). Also a ~~yahoo~~ helpful soul who told me about this thing called Google and PyObjC, but didn't actually read the question. Bill sounded a little exasperated, but he posted with a working solution at around 11 PM on a Friday night. That is special.

So, I have carried down the tablets from the mountaintop, and I'm going to figure it out, and I'll let you know. But just to look ahead a bit: in his solution the way you get the Python interpreter is to (wait for it) start in the interpreter. Well, of course!

And in case you try this at home, when you open the project in Xcode the "Build and Run" toolbar icon is (what's the word?) inactive. But the menu item works.

Tuesday, November 17, 2009

Hidden Markov Models (6)

I've written a Python script to make an HMM for the "occasionally dishonest casino." There is nothing remarkable about the code. Here is a link.

It uses the MarkovHelper module from here, modified a bit. Because long chains yield very low likelihoods, there is a function to convert the probabilities to logs:

def logs(self):
    def log2(x):  return log(x)/log(2)
    fL = [log2(n) for n in self.fairProb]
    lL = [log2(n) for n in self.loadProb]
    D = dict()
    for k in self.tranProb:
        D[k] = log2(self.tranProb[k])
    return fL, lL, D

There are two points where you can make edits to get more verbose output. Here is an example of output from a model with both transition probabilities set to 0.2. The top line in each group is the sequence of states, then the observed data, followed by the results from the Viterbi algorithm.

fraction correct: 0.6692
0 .. 50
LLFFFFFFFFFFFFFFLLLLFFFFFFFFFFFLLLLFFFFLFLLFLLLLLF
65466142352514466666314623515236466525163311362664
LLLLLFFFFFFFFFFLLLLLFFFFFFFFFFFLLLLFFFFFFFFFFLLLLF

50 .. 100
LLLLLLLLLFFFFFFFLLLLFFLLFFFLLLLFFFFLLFFFFLLLLLFFFF
43264622635463146366241654162162245146442626631643
FFFFFFFFFFFFFFFFLLLLFFFFFFFFFFFFFFFFFFFFFLLLLLLLLL

100 .. 150
LLLLLLFFLLLLFFFLLLFFFFFFLLLFLLLFFFLLLLFFFLLLLLLLLF
63421525664134456562136112561166465362562126526622
LFFFFFFFFFFFFFFFFFFFFFFFFFFLLLLLLLLLLLLLLLLLLLLLLL

150 .. 200
LLLLLLLLLLLLLLLLFLLLLLLFLLLLLLLFFFFFFFFFFFFLLLLFFF
66632561666263461461635121666361266546114312626611
LLLLLLLLLLLLLLLLLLLLLFFFFFLLLLLLLLLLLLFFFFFFLLLLLL

At first, this looks promising. But remember, we tell the model what all the probabilities are. So, knowing that both states are equally likely, random guessing would give 50% correct. For transition probabilities of 0.2 and 0.1, the HMM scores 0.72, 6 points higher than random. With transition probabilities of 0.2 and 0.01, the model score 0.952 correct. At least that is no worse than random!

Sunday, November 15, 2009

Python bytes

About ten days ago I posted about looking at the passwords stored in /etc/xgrid/agent/controller-password and similar files.

The code I put up is rather silly. It uses a hard-coded dict to translate bytes (in string rep) to hexadecimal. This is not worth spending too much time on, especially since Python 3 has a whole 'nother attitude about strings and bytes, but I thought I would at least show a simple and more correct (I hope) Python 2 approach to this issue.

So, of course we have bits and bytes on the machine, and strings exist only on-screen or paper. We can represent bits and bytes as integers, or as chars, and vice-versa. I'm sure everyone knows we can go from int to chr and back again:

>>> chr(78)
'N'
>>> ord('P')
80

My understanding is that we should view integers as the natural intermediate form for conversion of bits and bytes from base 2 to other bases.

>>> bin(15)
'0b1111'

In this representation the binary number 1111 is an int (15) or its string representation ('0b1111'). Python also has string reps for hexadecimal and octal:

>>> hex(15)
'0xf'
>>> oct(15)
'017'

We can go from binary or hex back to int, but we need to specify the base:

>>> int('0b1111',2)
15
>>> int('0xf',16)
15

We don't actually need the leading '0x' or '0':

>>> int('f',16)
15
>>> int('17',8)
15

So, the other day I should have just done:

>>> bin(int('0xf',16))
'0b1111'

When reading data from a file:

FH = open('script.py','rb')
data = FH.read(8)
FH.close()

print type(data)
print len(data)
print data

<type 'str'>
8
from bin

Although the file was opened in "binary" mode, the type actually read was <'str'>, and when the data are printed, it looks like a string. Nevertheless, the data do respond well to a function that operates on binary data and converts it to a hexadecimal string representation.

from binascii import *
L = [b2a_hex(b) for b in data]
print L
L = [int(h,16) for h in L]
print L
print [chr(i) for i in L]

['66', '72', '6f', '6d', '20', '62', '69', '6e']
[102, 114, 111, 109, 32, 98, 105, 110]
['f', 'r', 'o', 'm', ' ', 'b', 'i', 'n']

The result is rather different if we use the same function on the data as a whole:

L = b2a_hex(data)
print len(L)
for i in range(0,len(L),2):
   h = L[i:i+2]
   print h,
   print chr(int(h,16))

16
66726f6d2062696e
66 f
72 r
6f o
6d m
20 
62 b
69 i
6e n

In this case, the 8 bytes are converted to 16 hexadecimal characters, and to do the conversion to ints and chars we must read 2 char chunks of the hexadecimal.

Does that make sense?

Saturday, November 14, 2009

Hidden Markov Models (5)

Building on the prrevious post in this series, we have an HMM with two states G (good) and B (bad) weather, that have characteristic emission and transition probabilities. Let's add a begin state to the model, with P = 0.5 for each possible transition.

Now, use the Viterbi algorithm to compute the most probable sequence of hidden states corresponding to some observed data. Suppose the observations are Sun-Rain-Sun. We begin with Sun. Make a table with columns for the observations and rows for the states. For the first column and row, we compute the product of the transition probability times an emission probability. This product is proportional to the probability of reaching the state, but it is not a real probability since the combined values don't add up to 1 (these are likelihoods).

Now, we add the second observation, Rain, in the second column.

Work right to left. For the G state there are two possible paths to get there, either from the previous G or from the previous B. Compute first the transition probability for B -> G times the previous cumulative probability for state B. Also compute the transition probability for G -> G times the previous cumulative probability for state G. Compare the two and take the maximum, that is, G -> G, multiply by the emission probability (Rain from G) and put the result at the bottom of the box.

Draw a horizontal arrow to remember how we got here (the traceback). Continue with the other possible state for this observation, in the second row, second column.

Notice the correspondence with the first row: probabilities on the left of the two terms in parentheses add to 1, while the cumulative probabilities being carried forward are the unchanged. In this case, the maximum value corresponds to G -> B, so we draw a diagonal arrow.

And so on:

Now, look at the right-hand column and pick the state with the highest cumulative likelihood. Follow the traceback to figure out how we got here: GBB is the most probable sequence of hidden states.

[ Two math errors were spotted by a sharp-eyed reader that alter the results in the last part, and change the most probable sequence to GBG. Sorry for the confusion. I don't have the slides anymore to edit them easily so I'll leave it as it is. ]

Hidden Markov Models (4)

In the last post I described a Python class that we will use in the future to explore the dynamic programming algorithms that are important with HMMs. But before we continue with the Python, I want to go through an example of the first of these methods, the Viterbi algorithm, which is named for Andrew Viterbi. It is applied in the case where we have settled on a model and its characteristic probabilities (the parameters), and we want to calculate:

• the most likely sequence of states given some observed output

The example I'm going to use is the weather, but with a twist from the previous example of weather by telephone. Here the weather will be the observed data. The underlying hidden states of the model are two: G (a propensity for good weather) and B (bad weather is likely). G and B can both emit either sun or rain, with characteristic emission probabilities. At each step in the chain, the states can also switch with some characteristic transition probability. (Or, stay the same, with P = 1 - P(T)).

Below is a series of observations.

For a series of n observations, in an HMM with 2 states, there are 2**n possible state sequences, each of which has some probability of emitting the observed data. Given how unbalanced the emission probabilities are for our current model, it is pretty obvious that the first state sequence above is more likely than the second one. Now, if the chain is short enough, we can calculate the probabilities for each possible sequence exactly. For example:

We observe Sun-Rain-Sun. We calculate:


GGG  0.90 * 0.60 * 0.10 * 0.60 * 0.90 = 0.02916
GGB  0.90 * 0.60 * 0.10 * 0.40 * 0.20 = 0.00432
GBG  0.90 * 0.40 * 0.80 * 0.30 * 0.90 = 0.07776
GBB  0.90 * 0.40 * 0.80 * 0.70 * 0.20 = 0.04032
BGG  0.20 * 0.30 * 0.10 * 0.60 * 0.90 = 0.00324
BGB  0.20 * 0.30 * 0.10 * 0.40 * 0.20 = 0.00048
BBG  0.20 * 0.70 * 0.80 * 0.30 * 0.90 = 0.03024
BBB  0.20 * 0.70 * 0.80 * 0.70 * 0.20 = 0.01568

So, as one would expect, GBG is most likely. In a real problem, there will be more states, and the probabilities will not be so skewed. That calls for the Viterbi algorithm. I'll show it (for this problem) in the next post.

Friday, November 13, 2009

Hidden Markov Models (3)

This post is about the "occasionally dishonest casino" from Durbin et al., Biological Sequence Analysis. To save effort, I'll just use my slide from class:

We're going to model this system in Python, starting with two sets of dice with the requisite probabilities, and extending on into HMM analysis. Here is typical output for 100 rolls of the dice. We show the state above, and the output below:

FFFFFFLFLLLFFFFFFFFFFFFLFFFFFFLLLFFFLLLFLLLLLLLLLL
44434462641561516315232623134536312512613662626653

LLLFFFFFFFFFFFLLLFFFFFFFFFFFFFFFFFFFFFFLLLLLLLLLFF
56526642512146625665112531415535611163366666636225

Following, in segments, is the code for the Model class.

import random
outcomes = '123456'
    
class Model:
    def __init__(self, pFL, pLF):
        self.emit = list(outcomes)
        self.fairProb = [1.0/6] * 6
        self.loadProb = [0.1] * 5 + [0.5]
        self.tranProb = { 'F':pFL,'L':pLF }
                                
    def setState(self,which=None):
        if which:  self.state = which
        else:      self.state = random.choice('FL')

The initialization function is kept simple. It requires a value for the transition probability to switch states in either direction. The emission probabilities are hard-coded. The state will be set at the beginning of a run. Rolling the dice (below) is straightforward. We test the result of random.random() against the cumulative probability and emit the corresponding outcome.

    def rollDice(self):
        if self.state == 'F':  L = self.fairProb
        else:                  L = self.loadProb
        f = random.random()
        S = 0
        for i in range(len(L)):
            S += L[i]
            if f < S:  break
        return self.emit[i]

    def transit(self):
        def switch():
            if self.state == 'F':  self.state = 'L'
            else:                  self.state = 'F'
        p = self.tranProb[self.state]
        f = random.random()
        if f < p:  switch()

random.random() is used a second time to decide whether to switch on a particular transition. The object can generate sequences of as many states as needed, with a default of 50.

    def sequence(self,N=50):
        rolls = list()
        states = list()
        for i in range(N):
            rolls.append(self.rollDice())
            states.append(self.state)
            self.transit()
        return ''.join(rolls),''.join(states)
                
if __name__ == '__main__':
    m = Model(pFL=0.1,pLF=0.2)
    m.setState('F')
    L = [m.sequence() for i in range(2)]
    for r,s in L:   print s + '\n' + r + '\n'
    
    import diceStats as St 
    St.checkEStats(m)
    St.checkTStats(m)

It is useful to check that model performs as expected. The code for this is in another module, which we import here. But first, to motivate us, here is typical output for this model:

1   F 0.17    L 0.097
2   F 0.166   L 0.107
3   F 0.164   L 0.099
4   F 0.162   L 0.097
5   F 0.174   L 0.096
6   F 0.164   L 0.505

F 6748 0.098
L 3252 0.204

The different outcomes 1 through 6 are emitted with the correct probabilities. The last two lines list the number of times out of 10000 that the model was in the fair or loaded state. The ratio is determined by ratio of the transition probabilities, which are also determined.

Here is the diceStats module:

outcomes = '123456'

def checkEStats(m):
    N = 10000
    m.state = 'F'
    fairL = [m.rollDice() for i in range(N)]
    m.state = 'L'
    loadL = [m.rollDice() for i in range(N)]
    for o in outcomes:
        print o, '  F', 
        print str(round(fairL.count(o) * 1.0 / N, 3)).ljust(5),
        print    '  L', 
        print str(round(loadL.count(o) * 1.0 / N, 3)).ljust(5)
    print
            
def checkTStats(m):
    N = 10000
    L = list()
    for i in range(N):
        L.append(m.state)
        m.transit()
    totalF = L.count('F')
    totalL = L.count('L')
    D = { 'F':0,'L':0 }
    c1 = L[0]
    for c2 in L[1:]:
        if c1 != c2:  D[c1] += 1
        c1 = c2
        
    print 'F', totalF, 
    print str(round(D['F'] * 1.0 / totalF, 3)).ljust(5)
    print 'L', totalL, 
    print str(round(D['L'] * 1.0 / totalL, 3)).ljust(5)

Hidden Markov Models (2)

A couple of weeks ago I put up an introductory post about HMMs, and promised more. I got distracted for a bit, but now I'm ready to work on it again. Before we start, let's talk a little more generally about what they are about.

Here is Andrey Markov.

Working from back to front, an HMM is

• a model
• that forms a Markov "chain"
• where the model is hidden

The Markov "property" is that we have a chain of "states" of the model and a parallel set of observations. Any particular set of observations is linear, and for each of these there is a corresponding state. The next state to be visited depends only on the current state (or a small number of recent states). And it's hidden---we don't observe the state of the model directly.

This is very cryptic, how about a concrete example?

The weather has a long history in HMM explanations (according to Dave Swofford's slides that I learned so much from, he heard about it from from some people named Rune Lyngsø (I'm guessing: this guy) via Bjarne Knudsen (bio on this page)). To get our feet wet, let's play a game called "weather by telephone."

Imagine that you and your mother live in different places, but you are a dutiful son. You call every day and she tells you about her activities. She does one of three things:

• walk
• shop at the indoor mall next door
• play scrabble.

She hates to talk about the weather, but she tells you what she's done today, and you try to guess the weather. The observed values are the chain of walk-shop-scrabble. The HMM is the weather, which we suppose has two states: sun and rain. You have to guess the state.

The model has probabilities for emission and transition. For example, if it's sunny the probability is very high that your mom will walk, whereas if it's rainy, she's most likely to stay home and play scrabble. Also, one sunny day may be more likely to be followed by another sunny day, rather than change states. We have states with transition probabilities connecting them, and for any given current state, it emits an observation drawn from a number of possible choices governed by emission probabilities that are state-specific.

Now that you're all comfortable with this analogy, I'm going to switch things around in a later post---the weather will be observed and depend on some underlying hidden state. But in the meantime, you can go back to the first post and see whether this makes sense to you in terms of the model for a functional transcription factor binding site, and the observed sequence in a particular candidate site.

Thursday, November 12, 2009

Duly quoted

Cargo cult science

I mentioned this the other day, but it deserves its own post:

During the war they saw airplanes land with lots of good materials, and they want the same thing to happen now. So, they've arranged to make things like runways, to put fires along the sides of the runways, to make a wooden hut for the man to sit in, with two wooden pieces on his head like headphones and bars of bamboo sticking out like antennas---he's the controller---and they wait for the airplanes to land. They're doing everything right. The form is perfect. It looks exactly the way it looked before. But it doesn't work. No airplanes land. So I call these things cargo cult science ...

---Richard Feynman in Surely You're Joking, Mr. Feynman

Xgrid: writeup

I have written an account of what I learned about Xgrid, and how I used it for a "proof of principle" running BLAST on a single machine grid. You can get it from here.

[ UPDATE: I ran into an issue with the links. The document was composed in Text Edit (rtfd format) and then saved as PDF. Many but not all of the links showed up correctly and were active in Text Edit, and showed up but did not work in the PDF. By looking at the source, I found that the ones which worked correctly had

{
  \field
  {
    \*\fldinst
    {
      HYPERLINK "http://telliott99.blogspot.com/2009/11/xgrid-baby-steps.html"
    }
  }
  {
    \fldrslt \cf2 \expnd0\expndtw0\kerning0 \ul \ulc2 
    http://telliott99.blogspot.com/2009/11/xgrid-baby-steps.html
  }
}

(my formatting)

A little testing showed that it's the \ul calls in the second section that make this a link in the pdf, the other stuff makes the text blue. So I added \ul \ulc2 to all links by hand and it seems to work, with the exception of a multi-line link to the Apple docs for Xgrid, which is split into two separate links and is now broken. If you need it, that link is here.]

Wednesday, November 11, 2009

Xgrid: finale

It is just too weird for words. I spent at least 10 hours between yesterday and today, trying everything I could think of to figure out why Xgrid was knackered.

The jobs I try to run in Terminal hang, while those I try to submit give an id which then yields nothing when I request results. They are visible in Xgrid Admin as "pending" but nothing can persuade them to change that status.

• I found out that xgrid logs to /var/log/system.log

Depending on what I've been doing, I get a variety of errors there. At one point I had (in part):

Nov 11 17:04:50 localhost xgridagentd[20]: Notice: agent connected to controller "localhost" address "127.0.0.1" port "4111"
Nov 11 17:04:50 localhost xgridagentd[20]: Warning: agent error opening connection to controller "localhost" (error = Close requested)
Nov 11 17:04:57 localhost xgridcontrollerd[19]: Notice: controller role changed to MASTER

As I say, a huge variety of errors and no consistency. Sometimes complaints about passwords, sometimes about databases. Sometimes things closing on their own. The kitchen sink.

It ~~is not~~ did not seem that the agent needs to be idle:

sudo cat /etc/xgrid/agent/com.apple.xgrid.agent.plist.default
 <key>OnlyWhenIdle</key>
 <false/>

(But note that this value does not track with what is set in System Preferences). And anyway, I activated the screensaver and let it run for 15 minutes. No jobs ran.

And it is not a password issue. I got rid of all password requirements using both Xgrid Sharing and XgridLite, and confirmed it by looking with my own eyes at the plist files.

But miracle of miracles, I sit down to write a final post telling Apple that it's over between us, and leave Xgrid Admin open on my Desktop. I look over from blogger and see this:

It fixed itself! I'm speechless.

Drew McCormack's new Python book

Drew McCormack of MacResearch has a new book (actually, two of them). The one that's available now is called 'Scientific Scripting With Python.' I downloaded the pdf from lulu (about $14). I'm a big fan of his tutorials titled Cocoa For Scientists, and I think it was partly to thank him for those that I bought the pdf.

It's a very nice, short and well-written introduction to Python. It is perfect for someone who has little experience with Python. I am a bit disappointed that there doesn't seem to be anything in it to help me grow (the "Scientific" part is lacking), though perhaps the last 20 pages holds something.

As I've said before, I really like Mark Pilgrim's book(s), including the latest. He's a talented writer (as is Drew) and he is serious about "diving in." He has a lot to say, so if you are comfortable with Python, there is a lot to be learned from Mark.

Other on-line resources I can recommend:
• the tutorial at python.org
• Allen Downey's Think Python

And, not strictly Python, but very useful
• Software Carpentry

Zelle Python Programming remains my favorite programming book.

P.S. If you do use lulu, don't be confused by the signup page. The fourth text field is for a user name, despite the label to the left and the prompt underneath for a "store name." It didn't help that Safari autofill put the http.. for lulu into the box automatically.

Tuesday, November 10, 2009

Xgrid: conclusions

I've been posting about my experiments with Xgrid on a plain Mac OS X 10.6 installation with my one-year old MacBook. I was getting to the point of closing out these experiments, because although I did have BLAST working on Sunday, it was only accomplished by using a security-compromising hack. I've concluded there is no way to use this for anything serious without Server.

The bottom line seems to be that you cannot believe the hype (see this shiny web page), which is entitled Xgrid: High Performance Computing for the Rest of Us. It contains 68 matches for xgrid and 0 matches for Server. They certainly do not advertise this.

To summarize, what I wanted to do is:

• activate Xgrid on my machine (no Server)
• run BLAST via Xgrid on my machine plus other lab machines

There are still are unresolved issues with starting up xgridcontrollerd from the command-line, although the Snow Leopard Server Tools utility Xgrid Admin helps with that, but these are not deal-killers. However, there are very serious issues with authentication and permissions which mean that you cannot use Xgrid as anything but a toy without OS X Server. There doesn't seem to be any reason that this needs to be so, it is just Apple's choice. The reason for this restriction is the sandbox (e.g. see here).

Adding insult to injury, today I started working to re-run the tests for the final posts (about BLAST), but none of the things that I did previously is working. I'm fully into Cargo Cult mode, messing with XgridLite and passwords, or trying to set them with the copy hack.

sudo cp /etc/xgrid/agent/controller-password \
 /etc/xgrid/controller/client-password

sudo cp /etc/xgrid/agent/controller-password \
 /etc/xgrid/controller/agent-password

Jobs submitted using xgrid make it to the controller, they just don't run.

It's like Feynman describes in the last chapter of "Surely You're Joking, Mr. Feynman"

During the war they saw airplanes land with lots of good materials, and they want the same thing to happen now. So, they've arranged to make things like runways, to put fires along the sides of the runways, to make a wooden hut for the man to sit in, with two wooden pieces on his head like headphones and bars of bamboo sticking out like antennas---he's the controller---and they wait for the airplanes to land. They're doing everything right. The form is perfect. It looks exactly the way it looked before. But it doesn't work. No airplanes land. So I call these things cargo cult science ...

The xgrid jobs just hang. It is not an authentication problem, at least not with the controller, because without the correct password, they are rejected with:

BEEPError 535 (Authentication failure)

The jobs are visible in Xgrid Admin, they're listed as pending. And they remain pending even with a stop and restart.

Perhaps it has to do with updating to 10.6.2 an hour ago...

Moving beyond, you may still be interested in playing with xgrid. You can run code installed in various places (even /Users/Shared). But to have that code open another file, you have ~~three~~ ~~two~~ choices:

• give user "nobody" the same access as "somebody"

These aren't options:

• run in what's called the "sandbox" as user "nobody"

• use Kerberos and run as "somebody"

Particularly since agent-to-controller authentication is not working, the first option (the only solution), while it allows testing, is not recommended. Option 2 doesn't work if you need a database, and option 3 requires Server.

What I did, based on the link:

sudo mv /usr/share/sandbox/xgridagentd_task_nobody{,.orig}.sb
sudo cp /usr/share/sandbox/xgridagentd_task_{some,no}body.sb

I was able to run BLAST using a database in /Users/Shared after doing this.

You win, Apple. How about sending me one of these?

Is this also just a toy?

Monday, November 9, 2009

Warren DeLano

I'm pretty sad today at the news that Warren DeLano died early last week. I love Pymol. He was very generous and helpful to me and I wish I'd really known him.

The image above is HemA, an enzyme we've studied. It's brighter in the full-scale image. The other images are ones I made for the Bioinformatics class I taught.

Lambda Repressor and its operator

Ribosomal protein S12 in a forest of 16S rRNA

I forget, but it sure is pretty

Roman numerals

I'm reading Mark Pilgrim's new version of Dive Into Python, and he spends some time with Roman numerals as an example for regular expresssions. (In fact, I like it, and him, so much I bought a dead trees copy). But the example piqued my interest. It's the sort of problem Python does really well. And it has the advantage of having a forward and backward transformation (to standard representation), which allows one to test all valid cases. Not too often you see that in a programming problem. It is a good challenge for a beginning Python coder.

  928     CMXXVIII    928
 1070         MLXX   1070
   30          XXX     30
  581       DLXXXI    581
 1976     MCMLXXVI   1976
  169        CLXIX    169
 1940        MCMXL   1940
  975       CMLXXV    975
  204         CCIV    204
 1044        MXLIV   1044

I know that code is boring, but I've decided to put it up (all 70 lines of it), in four parts. The strategy here is to separate the task in two sub-tasks: conversion between decimal and "verbose" Roman, and "compression."

The first part does what the comment says. No error checking:

def compress(s):
    # convert 'X'*4 to 'XV', etc.
    L = list('IVXLCDM')
    c = s[0]
    i = L.index(c)
    n = len(s)
    if n < 4:      return s
    if n == 4:     return c + L[i+1] 
    if n == 5:     return L[i+1]
    if 5 < n < 9:  return L[i+1] + c*(n-5)
    if n == 9:     return c + L[i+2]

It's a matter of personal taste, but I tend to use single letter variable names when I can. The clutter bothers me more than the chance for confusion. But I'm consistent (L is always a list, s a string, i an index). Actually, you don't need a list here, a string would be fine. But I wanted to use L since s was already claimed :)

The second part does some trivial error checking. Then it breaks down a decimal year into thousands, hundreds, etc, and adds the appropriate number of 'IXCM' to a list, after running them through compress. Rather than check if it's needed, everything goes through.

def decimalToRoman(n):
    assert type(n) == type(1),'not an int'
    assert 0 < n < 10000,'out of bounds %i' % n
    divs = [1000, 100, 10, 1]
    vals = list()
    for i in range(4):
        d = divs[i]
        vals.append(n/d)
        n = n%d
    rL = list()
    rNumerals = 'MCXI'
    for i,v in enumerate(vals):
        c = rNumerals[i]
        s = c * v
        if i and v:  s = compress(s)
        rL.append (s)
    return ''.join(rL)

The third function does the reverse transformation. It has a nested function definition for the reverse of the "compress" algorithm. There is a second function necessitated by the fact that there may be more than one compression in a given Roman numeral. (A bug in the first version!). In "workToDo", we return -1 when there is no more work.

def romanToDecimal(s):
    D = { 'I':1,  'V':5,
          'X':10, 'L':50,  
          'C':100,'D':500, 'M':1000 }
    def resolve(s,i):
        c = s[i]
        first = s[:i]
        last = s[i+2:]
        if s[i+1] in 'VLD': x = 4
        else:               x = 9
        s2 = first + c*x + last
        return s2    
    def workToDo(s):
        for i in range(len(s)-1):
            if D[s[i]] < D[s[i+1]]:
              return i
        return -1
    i = workToDo(s)
    while i != -1:
        s = resolve(s,i)
        i = workToDo(s)          
    n = 0
    for c in s:  n += D[c]
    return n

Finally, the test harness. We print a few random numbers for fun, and then test all the years from 1 to 10,000, going from decimal to Roman and back again.

def test():
    for i in range(10):
        d = random.choice(range(1,2000))
        print str(d).rjust(5),
        r = decimalToRoman(d)
        d = romanToDecimal(r)
        print r.rjust(12),
        print str(d).rjust(6)

    for d in range(1,10000):
        r = decimalToRoman(d)
        s = romanToDecimal(r)
        if not d == s:
            print 'error'
            print str(d).rjust(5), 
            print r.rjust(12), 
            print s.rjust(6)
test()

Xgrid: getting data in and out

My goal is to run BLAST on a small grid of lab machines to process Pyrosequencing data. Actually, I'm trying to get ready, even though we don't have funding to do the project yet. So far, I have a grid with a single machine in all three roles: client, controller and agent.

There are lots of resources available, see the list here. I described in that post how I started up the xgridcontroller daemon, and obtained Xgrid Admin which gives me a view of xgrid jobs. The daemon is visible in Activity Monitor (choose "Other User Processes", or from Terminal with ps -A. It's curious that it comes back after a re-boot. The Xgrid controller can be administered from the command line:

sudo xgridctl controller start
sudo xgridctl controller status
sudo xgridctl c stop
sudo xgridctl c off

or by using XGrid Admin. Note that the earlier examples used sudo for xgrid commands, but this is not necessary. In browsing the manual for xgrid, I discovered that you can do this (for my bash shell):

export XGRID_CONTROLLER_HOSTNAME=localhost
export XGRID_CONTROLLER_PASSWORD=abcd
xgrid -job run /usr/bin/cal

now I can skip the -h 127.0.0.1 -p <password> stuff in my commands.

Before we get to the subject of this post, let's talk a little bit about where Xgrid runs and what you're allowed to do. According to the Apple docs, the jobs run on the agent as user "nobody." The docs discuss users and permissions here, but don't say much about user "nobody" beyond that it "provide(s) minimal permissions."

According to this FAQ from the mailing list:

The second ("nobody") provides only very minimal privileges, since it assumes that the agent doesn't trust the client. This is the most common reason why jobs that attempt to read or write outside of, e.g., /tmp, will get a permission error.

And if you look at this file:
/usr/share/sandbox/xgridagentd_task_nobody.sb

there are a couple of fairly complicated regular expressions

(allow file-read* (regex "^/(bin|dev|(private/)?(etc|tmp|var)|usr|System|Library)(/|$)"))
(allow file-read* file-write* (regex "^/(private/)?(tmp|var)(/|$)"))

which I can't decipher completely but I interpret as restricting reading privileges for "nobody." And according to the mailing list entry, "the best solution to this problem is to enable Kerberos."

But we're getting ahead of ourselves. The question for today is, how do we get data and code to the agent and data back again? According to the Apple docs

You have the option of supplying an input file or a directory of files. If you supply an input directory, it is copied to each agent and becomes the working directory for the executable file.

In an example I used earlier, there is a one-line Python script with:
print 'Hello Python world!'
in a directory temp on my Desktop. Working from the Desktop directory I did:

xgrid -job run -in temp /usr/bin/python temp/script.py

Hello Python world!

As the docs say:

Important: You have the option of providing a relative path or an absolute path when specifying executable files, input files and directories, and output files and directories. When a relative path is used, the executable and the input files or directories are copied to the agents, and the output files or directories are created for every agent and collected by the controller. If you specify an absolute path to the executable, input, or output files or directories, those files are assumed to exist on the agent computers, or to be available to the agents as part of a shared file system, at the path location specified. They are not copied or created.

Executive summary:

• relative path: copied to the agent
• absolute path: assumed to exist on the agent

The version of the above example that I posted the other day and another here are subtly different. They provided a full path /Users/te/Desktop/temp to the temp directory, and which should be "assumed to exist on the agent computers, or..." What we want is a relative path.

According to the man page for xgrid, there are more options:

-si stdin     for submit/run, file to use for standard input
-in indir     for submit/run, working directory to submit with job
-so stdout    for run/results, file to write the standard output stream to
-se stderr    for run/results, file to write the standard error stream to
-out outdir   for run/results, directory to store job results in

The docs say:

Use the -in parameter to pass an input directory. This directory is copied to each agent and becomes the working directory on the agent’s host computer. You can include anything needed in the working directory, such as additonal input files, libraries, and executables. The executable file is run in this directory.

Thinking about this, I realized there is another issue with the xgrid command above. Even though it works, what it really should be is:

xgrid -job run -in temp /usr/bin/python script.py

which also works. Since we're in temp on the agent, we don't need the directory before script.py.

I placed a file with some DNA sequence named seq.txt in the temp directory. The script xgrid.script.py opens the file and prints the data.

fn = 'seq.txt'
FH = open(fn)
data = FH.read()
FH.close()
print data

xgrid -job run -in temp /usr/bin/python xgrid.script.py

>DA19
GGGAGAGTAGCCGTG...

What about getting data back again? Well, I mean, other than the way we've been doing it :)

fn = 'out.txt'
FH = open(fn,'w')
FH.write('got here')
FH.close()

xgrid -in temp -out temp -job run /usr/bin/python out.script.py

It works! We'll save stderror (se) for next time, with BLAST.

Saturday, November 7, 2009

XgridLite and more on passwords

XgridLite is available here. It is a preference pane or plug-in for System Prefs.

We'll use it to change the passwords that the controller requires from the client and agent. The client-controller password is in:

/etc/xgrid/controller/client-password

I changed it to 'abcd' using XgridLite.

Let's see what we have:

sudo cp /etc/xgrid/controller/client-password ~/Desktop/clientpw

use the Finder getInfo to change permissions on the file.

run the pw.script.py from yesterday on it

         6   1   6   2   6   3   6   4
abcd: 01100001011000100110001101100100
 key: 01111101100010010101001000100011
file: 00011100111010110011000101000111

That looks right. What's in the file is an XOR of the key and 'abcd'.

xgrid -h localhost -p abcd -job run /usr/bin/cal

   November 2009
Su Mo Tu We Th Fr Sa
 1  2  3  4  5  6  7
 8  9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30

It works! So, I guess I just screwed it when I tried it yesterday. Now, go back to XgridLite and change the agent-controller password to 'wxyz'.

sudo cp /etc/xgrid/controller/agent-password ~/Desktop/agentpw

change permissions and examine it:

         7   7   7   8   7   9   7   a
wxyz: 01110111011110000111100101111010
 key: 01111101100010010101001000100011
file: 00001010111100010010101101011001

That looks right.

sudo xgridctl c status

daemon                  state                   pid                     
======                  =====                   ===                     
xgridcontrollerd        running | enabled       1747

Different pid as expected, since we told XgridLite to restart the controller in order to make the password change effective:

xgrid -h localhost -p abcd -job run /usr/bin/cal

And it hangs... Go back to XgridLite and change the agent password back to 'mypw'. It still hangs...

sudo xgridctl c status

daemon                  state                   pid                     
======                  =====                   ===                     
xgridcontrollerd        running | enabled       1935

Grab a text copy of the manual (without repeats of some characters---what is that about?)

man xgrid | col -bx > ~/Desktop/man.txt

xgrid -h localhost -p abcd -grid list

{
    gridList =     (
        0
    );
}

xgrid -h localhost -p abcd -grid attributes -gid 0

{
    gridAttributes =     {
        gridMegahertz = 0;
        isDefault = YES;
        name = Xgrid;
    };
}

That looks like we are hooked up properly...

Try cp:

sudo cp /etc/xgrid/agent/controller-password \
/etc/xgrid/controller/agent-password

And now it works.

xgrid -h localhost -p abcd -job run /usr/bin/cal

   November 2009
Su Mo Tu We Th Fr Sa
 1  2  3  4  5  6  7
 8  9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30

XgridLite is described as open-source. I'll have to grab a copy of the source and figure out what it's about. In any event, the most likely explanation seems to be that the controller should provide the agent-password to the agent, and it is either not doing that, or the agent is improperly using the password it requires, when it authenticates with the controller. Just a guess.

However, that doesn't explain why the reset using XgridLite failed. But perhaps I screwed up (<sigh> again).

About my Python

As I'm sure you know, Python currently comes in two flavors, the 3 series and the older 2 series. As the Python website states:

Python 3.1.1 was released on August 17th, 2009.
Python 3.1 is a continuation of the work started by Python 3.0, the new backwards-incompatible series of Python.

Apple, on the other hand, includes a Python installation, and it is currently 2.6.1:

localhost:Desktop te$ python
Python 2.6.1 (r261:67515, Jul  7 2009, 23:51:51) 
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

I run all of my examples using 2.6, though earlier versions should work. The fanciest thing I like to do is list comprehension. According to this PEP, it became part of the language with version 2.0. There are a lot of differences with 3.0 (summarized here), but the one I know will be an issue is the print statement, which has become a function. You'll need to edit my code to change print "x" to print("x"). I think that's probably about it.

There isn't a very good reason to use 2.6 except that I get confused having multiple Pythons around. I even had 3.0 installed to play with, but then got rid of it. This can't go on forever, but it's where I am right now.

Also I don't do the sha-bang #! thing. I just do python myscript.py from the command line.