Thursday, August 18, 2011

Learning to use RPy2 (2)


We're working on RPy2. Before I continue with Bioconductor, we should work through more of the examples in the docs.

I rashly asked a question on Stack Overflow this morning (after being stumped for several hours), then stumbled across the answer. The question is, how to load the data from ALL. In R we would just do:

data('ALL')

and then the objects would be available to us through their names. It turns out that in Python it is almost that simple. I had actually seen the answer, which is a kind of "shortcut" in which we just execute R code directly, and then grab a variable through the name assigned in R. (I used this device several times in the previous post).

(And I should note that there is apparently a more elegant solution in the bioconductor extensions to rpy2.

>>> import rpy2.robjects as robjects
>>> from rpy2.robjects.packages import importr
>>> base = importr('base')
>>> ALL = importr('ALL')
>>> data = robjects.r('data(ALL)')
>>> data.rclass
<rpy2.rinterface.SexpVector - Python:0x258b190 / R:0xdf86c8>
>>> data = robjects.globalenv['ALL']
>>> data
<RS4 - Python:0x2591490 / R:0x29a2134>
>>> data.rclass
<rpy2.rinterface.SexpVector - Python:0x258b3b0 / R:0xdf85c8>
>>> featureNames = robjects.r('featureNames')
>>> featureNames(data)
<StrVector - Python:0x23e4f08 / R:0x304fc00>
['1000..., '1001..., '1002..., ..., 'AFFX..., 'AFFX..., 'AFFX...]
>>> exprs = robjects.r['exprs']
>>> e = exprs(data)
>>> e
<Matrix - Python:0x23b2da0 / R:0x84d8000>
[7.597323, 5.046194, 3.900466, ..., 3.095670, 3.342961, 3.842535]
>>>

Anyway, we want not to make any more silly mistakes. Another elementary thing I happened upon in the docs (in Vectors) is

Access to R-style extracting/subsetting is granted though the two delegators rx and rx2, representing the R functions [ and [[ respectively.

It seems to return a strange little beast:

>>> from rpy2.robjects import r
>>> m = r.matrix(range(4),nrow=2)
>>> value = m.rx(3)
>>> value
<ListVector - Python:0x1023db5a8 / R:0x102cf9c28>
[IntVector]
<no name>:
<IntVector - Python:0x1023db368 / R:0x102cf99e8>
[ 2]
>>> type(value )
<class 'rpy2.robjects.vectors.ListVector'>
>>> value[0]
<IntVector - Python:0x1023db518 / R:0x102cf99e8>
[ 2]
>>> value[0][0]
2

We copy the code to make an R list, directly out of the docs (see ?list):

>>> r('''e1 <- new.env(); e1$a <- 10; e1$b <- 20; L = as.list(e1)''')
<ListVector - Python:0x1023da4d0 / R:0x103306140>
[FloatVector, FloatVector]
b:
<FloatVector - Python:0x1023da680 / R:0x102d1bd58>
[20.000000]
a: <class 'rpy2.robjects.vectors.FloatVector'>
<FloatVector - Python:0x1023da3b0 / R:0x102d1bd88>
[10.000000]
>>> L = r['L']
>>> L.rx2('a')
<FloatVector - Python:0x1023da998 / R:0x102d1bd88>
[10.000000]



Here is a somewhat fancier example involving a linear model (from the Intro). Look in the docs for the R code. In Python:

>>> from rpy2.robjects import FloatVector
>>> from rpy2.robjects.packages import importr
>>> from rpy2.robjects import globalenv as g
>>>
>>> stats = importr('stats')
>>> base = importr('base')
>>>
>>> ctl = FloatVector([4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14])
>>> trt = FloatVector([4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69])
>>> weight = ctl + trt
>>>
>>> group = base.gl(2, 10, 20, labels = ["Ctl","Trt"])
>>> g["weight"] = weight
>>> g["group"] = group
>>> lm_D9 = stats.lm("weight ~ group")
>>> print(stats.anova(lm_D9))
Analysis of Variance Table

Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
group 1 0.6882 0.68820 1.4191 0.249
Residuals 18 8.7292 0.48496
>>>
>>> print(lm_D9.names)
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "contrasts" "xlevels" "call" "terms"
[13] "model"

>>> print names(lm_D9)
Traceback (most recent call last):
File "<stdin>", line 1, in
NameError: name 'names' is not defined
>>>


The next example in this section fails with two errors:

ImportError: cannot import name NA_real.

If we just use zeroes, it fails with

in __setitem__
globalenv_ri)
TypeError: All keywords must be strings (or None).

I'm just going to leave it for now.


PCA example

>>> from rpy2.robjects.packages import importr
>>> from rpy2.robjects import globalenv as g
>>> from rpy2.robjects import r
>>>
>>> base = importr('base')
>>> stats = importr('stats')
>>> graphics = importr('graphics')
>>>
>>> # each of these x.y could be replaced by `r`
... m = base.matrix(r.rnorm(100), ncol=5)
>>> pca = stats.princomp(m)
>>> graphics.plot(pca, main="Eigen values")
rpy2.rinterface.NULL

The graphic is at the top of the post. I didn't both to print this properly. See the first post on the docs here. But the idea is that although these functions come from the imported modules they are present in the global R namespace (if that's what they call it). In general, it's better to document your code by using the qualified name. (I know, I know, I used r here extensively! Do as I say, not as I do).


kwargs example

>>> from rpy2.robjects import r
>>> r('''f <- function(x='foo',y='blue') {
... if (x=='foo') print(x)
... else print (y) }''')
<SignatureTranslatedFunction - Python:0x1023d76c8 / R:0x100ef6b58>
>>> f = r['f']
>>> kwargs = {'x':"bar", 'y':"red"}
>>> f(**kwargs)
[1] "red"
<StrVector - Python:0x1023db170 / R:0x102d0fc28>
['red']

I guess that's enough for one post.