Thursday, August 4, 2011

Exploring ctypes (3)

Continuing with ctypes---here is an example of scoring a genome for potential sites for a DNA-binding protein according to the methods we talked about here.

We have a single function in a C module that will be compiled into a shared dynamic library. Then we load that library from Python, using ctypes, and call the function. The heavy lifting is done by the function score. The arguments are:

• the DNA sequence as a string (a str on the Python side)
• an array of floats which holds the scoring table
(in order acgt for each position in the site)
• the length of the site (n = 2 in the example)
• an array to hold the result for each position in the sequence

other variables:

• loop counters / index variables i, j, k
• the current character, c, so we can print it
• the current score for this char, d, also so we can print it
• the current score for this position in the sequence, r

The logic is straightforward and elementary. (Although perhaps there are faster approaches that people might suggest?)

mylib.c

#include <stdio.h>
#include <string.h>

void score(const char* dna, double p[], int n, double result[]) {
char c;
int i,j,k,N;
double d,r;

N = strlen(dna)-n+1;
for (i=0; i<N; i++){
r = 0;
for (j=0; j<n; j++){
c = dna[i+j];
switch (c) {
case 'a': { k=0; break; }
case 'c': { k=1; break; }
case 'g': { k=2; break; }
case 't': { k=3; break; }
}
printf("%c ", c);
d = p[j*4 + k];
printf("%3.3f ",d);
r += d;
}
printf("%3.3f\n", r);
result[i] = r;
}
}

We build the shared library using clang (or gcc, a quick test showed no speed difference):

clang -g -Wall -c mylib.c
clang -dynamiclib -current_version 1.0 mylib.o -o mylib.dylib

The Python script is a little more unusual. The array initialization was just adapted from the docs (here), I don't yet understand how it works. (It seems like a pointer would be better).

Here is the second of those:

NFloats = ctypes.c_double * N
result = [0.0] * N
result = NFloats(*result)

And I was surprised that we don't have to worry about the types for arguments to our score function---that's why those lines are commented out at the top.

For this toy example, the DNA sequence is 14 nt, and the scoring table is given in the list L. Here is the script:

script.py

import ctypes
pre = '/Users/telliott/Desktop'
mylib = ctypes.CDLL(pre + '/mylib.dylib', ctypes.RTLD_GLOBAL)
#mylib.score.argtypes = [ctypes.c_char_p]
#ctypes.c_float_p,
#ctypes.c_int]

dna = 'actgtcgactcgag'
L = [ 0.567, -1.603, -0.2245, 0.3605,
-0.1175, -0.4655, -0.5326, 0.6898 ]

EightFloats = ctypes.c_double * len(L)
ff = EightFloats(*L)
n = int(len(L)/4)
N = len(dna) - n + 1

NFloats = ctypes.c_double * N
result = [0.0] * N
result = NFloats(*result)

mylib.score(dna, ff, n, result)
for f in result[:6]: print round(f,3)

Here's the output. Each position is evaluated on a separate line. The last six values printed are the results received back in Python.

> python script.py
a 0.567 c -0.466 0.101
c -1.603 t 0.690 -0.913
t 0.360 g -0.533 -0.172
g -0.225 t 0.690 0.465
t 0.360 c -0.466 -0.105
c -1.603 g -0.533 -2.136
g -0.225 a -0.117 -0.342
a 0.567 c -0.466 0.101
c -1.603 t 0.690 -0.913
t 0.360 c -0.466 -0.105
c -1.603 g -0.533 -2.136
g -0.225 a -0.117 -0.342
a 0.567 g -0.533 0.034
0.101
-0.913
-0.172
0.465
-0.105
-2.136

Not bad for 50 lines of code and a little more than an hour of work! If you want to see it go fast, comment out the printf statements in the C code, and change:

dna = 'actgtcgactcgag'*400000

Now the DNA length is 5.6E6, and it finishes in about 5 seconds. It's impressive.