David Mertz, Ph.D.
Generic Programmer, Gnosis Software
March, 2005
R is rich statistical environment, released as Free Software. It includes a programming language, and interactive shell, and extensive graphing capability. This article follows up David's two prior installments (those written with Brad Huntting) and looks at the object-orientation in R, along with some additional general programming concepts in R.
The first two installments of this series looked at R in fairly "real world" usage; we explored various statistical analyses and graphing capabilities around a large series of temperature data that the coauthor of those installments had collected. As mentioned in those prior articles, what we actually explored barely scratched the surface of the deep and rich statistical libraries in R.
For this installment, I want to leave aside consideration of further statistical analysis per se. In large part, I do not personally have the requisite knowledge of statistics to decide the most relevant techniques; both my earlier coauthor, Brad Huntting, and many readers know far more in these areas. But even more significant is that I would like to provide readers with some pointers to the underlying language facilities of R. The prior installments made some gestures towards the functional programming orientation of R; my hunch is that many readers will be more familiar with imperative and object oriented languages. As well, we have previously only looked at using R in a rather ad hoc fashion. For this installment, I want to discuss creating reusable and modular components for R development.
Before getting to R's notion of object orientation let me review and clarify a bit about R's data and functions. The main thing to keep in mind about R data is that, to a first brush, "everything is a vector." Even objects that look superficially distinct from vectors--matrices, arrays, data.frames, etc.--are really just vectors with extra (mutable) attributes that tell R to treat them in special ways.
The most important attribute that (some) R vectors have is its
dimension, spelled dim
. The functions, matrix()
, array()
and
dim()
are simply convenience functions for setting the dimensions of
a vector. R's OOP system similarly boils down to what (if anything)
is in an objects class
attribute.
Let us review dimensioning:
> v = 1:1000 > typeof(v) [1] "integer" > attributes(v) NULL > dim(v) = c(10,10,10) # (Re)dimension > attributes(v) $dim [1] 10 10 10 > v2 = matrix(1:1000, nrow=100, ncol=10) > typeof(v2) [1] "integer" > attributes(v2) $dim [1] 100 10 > attr(v2,'dim') = c(10,10,10) # Redimension > attributes(v2) $dim [1] 10 10 10
In short, their are several syntax conveniences for attaching a dim
attribute to a vector, but at heart that is all the syntax sugar does.
One thing that can be confusing about R's "everything is a vector" approach is that row-wise and column-wise operations may not follow your first intuition. For example, it's simple enough to create a 2-D array (a matrix), and even to operate on a single column or row:
> m = matrix(1:12, nrow=3, ncol=4) > m [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 > sum(m) # sum of all elements of m [1] 78 > sum(m[1,]) # sum of first row [1] 22
But if you wanted to create a vector to sum each row, you might be inclined to try:
> sum(m[c(1,2,3),]) # NOT sum of each row [1] 78
You could construct a loop here, but that feels at odds with R's
functional and vector oriented operations. Instead, the trick is to
use the function apply()
:
> apply(m, 1, sum) # by row [1] 22 26 30 > apply(m, 2, sum) # by column [1] 6 15 24 33 > apply(m, c(1,2), sum) # by column AND row (sum of each single cell) [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 # Kinda worthless to sum each single cell, but what about this: > a = array(1:24,c(3,4,2)) > apply(a, c(1,2), sum) # sum by depth in 3-D array [,1] [,2] [,3] [,4] [1,] 14 20 26 32 [2,] 16 22 28 34 [3,] 18 24 30 36 > apply(a, c(3), sum) # sum of each depth slice [1] 78 222
A construct that is sometimes useful to have, for perfectly practical reasons, is an infinite sequence of numbers. For example, the coauthor of the prior installments was doing some analysis of Monte Carlo integration techniques, and for his purposes, it was useful to have an infinitely long sequence of random numbers. Understand here that the infinite sequence that is needed is not simply the ability to generate a new number as needed; it is also necessary to be able to refer back to a specific previously referenced element, and have it refer to the same value it had before.
Obviously, no computer language and no computer, can store an infinite
sequence--what they can do is store a lazy and unbounded sequence.
More elements can be added to a realized list, when and only when,
access is required. In Python, for example, you might accomplish this
by creating a list-like object with a custom .__getitem__()
method
that extends the internal list as needed. In Haskell, laziness is
built deeply into the language--in effect, everything is lazy. In my
Haskell tutorial (see Resources) I used the example of creating a list
of all the prime numbers:
primes :: [Int] primes = sieve [2 .. ] sieve (x:xs) = x : sieve [y | y <- xs, (y `rem` x)/=0]
In respect to the infinite, R is closer to Python than it is to Haskell. You need to explicitly construct more list elements as needed. We need to wait for the OOP section to let vector indexing itself launch the behind-the-scenes mechanism; but there is still not much scaffolding involved.
inf_vector = rnorm(10, 0, 1) # arbritrarily start w/ init 10 items assure <- function(index) { extend_by = max(index-length(inf_vector), 0) extras = rnorm(extend_by, 0, 1) v <- c(inf_vector, extras) assign("inf_vector", v, env=.GlobalEnv) return(index) } getRand <- function(index) { assure(index) return(inf_vector[index]) }
Probably the preferred usage is to access values through the
getRand()
wrapper function. Notice that you are perfectly free to
use slices or computed values, as well as single indices:
> getRand(3) # Single index [1] 0.5557101 > getRand(1:5) # Range [1] -0.05472011 -0.30419695 0.55571013 0.91667175 -0.40644081 > getRand(sqrt(c(4,16))) # Computed index collection [1] -0.3041970 0.9166717 > getRand(100) # Force background vector extension [1] 0.6577079
If you prefer, you can simply assure()
the vector is large enough
before using elements:
> assure(2000) [1] 2000 > inf_vector[1500] [1] 1.267652
R is capable of fully general object-oriented programming. But to understand this, you need to step back a bit in your thinking about what OOP is. Users of languages like Java and C++, or even of Python, Ruby, or Smalltalk, might have a relatively circumscribed picture of object-orientation. Not wrong, but limited to one model.
R's approach to OOP is based around generic functions rather than around class hierarchies. This concept will be familiar to readers who have used Lisp's CLOS, or who have read my discussion of multiple dispatch using Python (see Resources). Unfortunately, R's approach is still single dispatch, in that respect equivalent to the "traditional" languages mentioned above (C++, Java, etc).
I should note--though I will not discuss it in any specificity in this
article--that recent version of R come with a package called methods
that define and manipulate so-called "formal methods." In many ways,
using these formal methods imposes much of the discipline (and
limitations) you find in "traditional" OOP languages. In any case,
formal OOP in R is built on top of the "informal OOP" that I discuss.
The methods
package is still somewhat tentative, from what I can
tell; but some moderately tweaked version of it seems certain to
continue in later R versions. See Resources for more background.
The thing to keep in mind in understanding OOP sui generis is that
OOP is not really a matter of inheritance specifically, but of
dispatch decisions more generally. That is, a call to obj.method()
in a "traditional" OOP language will look through the method
resolution order (MRO) of an object to find the first ancestor
class of obj
that has a method .method()
. What "first" means is a
more subtle question than is obvious at first (see Resources for a
good discussion of the evolving MRO design in Python). R makes the
same decision, but it turns the idea of inheritance "inside out."
Rather than have a bunch of classes which may define and override
various methods in their bodies, R creates a family of generic
functions that carry a tag describing what type of object they want
to operate on.
As a simple example, let us create a generic function called
whoami()
, and some tagged methods to dispatch to:
#------------- Create a generic method > whoami <- function(x, ...) UseMethod("whoami") > whoami.foo <- function(x) print("I am a foo") > whoami.bar <- function(x) print("I am a bar") > whoami.default <- function(x) print("I don't know who I am")
The key here is that every object in R may belong to zero, one, or
more classes. Specifically, the MRO of any given object (relative to
a particular method) is simply the vector of named classes, if any, in
its class
attribute. E.g.:
> a = 1:10 > b = 2:20 > whoami(a) # No class assigned [1] "I don't know who I am" > attr(a,'class') <- 'foo' > attr(b,'class') <- c('baz','bam','bar') > whoami(a) [1] "I am a foo" > whoami(b) # Search MRO for defined method [1] "I am a bar" > attr(a,'class') <- 'bar' # Change the class of 'a' > whoami(a) [1] "I am a bar"
As with traditional inheritance languages, an object need not utilize
the same class for every method it calls. Traditionally, if Child
inherits from Mom
and Dad
an object of type Child
might utilize
.meth1()
from Mom
and .meth2()
from Dad
. You can do this in
R, naturally, but Mom
and Dad
are nothing substantial, just names:
> meth1 <- function(x) UseMethod("meth1") > meth1.Mom <- function(x) print("Mom's meth1") > meth1.Dad <- function(x) print("Dad's meth1") > meth2 <- function(x) UseMethod("meth2") > meth2.Dad <- function(x) print("Dad's meth2") > attr(a,'class') <- c('Mom','Dad') > meth1(a) # Even though meth1.Dad exists, Mom comes first for a [1] "Mom's meth1" > meth2(a) [1] "Dad's meth2"
It might seem limiting to need to explicitly specify the MRO of an object rather than rely on its implicit resolution through inheritance syntax. In point of fact, you can perfectly easily implement inheritance-based MROs with a minimal wrapper functions. The MRO I use below is probably not the best one possible (see Simionato's essay in Resources), but it shows the idea:
char0 = character(0) makeMRO <- function(classes=char0, parents=char0) { # Create a method resolution order from an optional # explicit list and an optional list of parents mro <- c(classes) for (name in parents) { mro <- c(mro, name) ancestors <- attr(get(name),'class') mro <- c(mro, ancestors[ancestors != name]) } return(mro) } NewInstance <- function(value=0, classes=char0, parents=char0) { # Create a new object based on initial value, # explicit classes and parents (all optional) obj <- value attr(obj,'class') <- makeMRO(classes, parents) return(obj) } MaternalGrandma <- NewInstance() PaternalGrandma <- NewInstance() Mom <- NewInstance(classes='Mom', parents='MaternalGrandma') Dad <- NewInstance(0, classes=c('Dad','Uncle'), 'PaternalGrandma') Me <- NewInstance(value='Hello World', 'Me', c('Mom','Dad'))
In action:
> print(Me) [1] "Hello World" attr(,"class") [1] "Me" "Mom" "MaternalGrandma" "Dad" [5] "Uncle" "PaternalGrandma"
If you wish to follow a traditional approach to class/instance
relationship, you want to include the name of the class you create
(e.g. Mom
in its classes
argument). Actually, given that each
"class" is itself a perfectly good object, the above system is closer
to prototype-based OOP systems than class-based ones. Then again, the
whole system is flexible enough to include all variations. You are
perfectly free to segregate class objects from instance objects if you
wish--you might distinguish classes with a naming convention (e.g.
Mom
versus mom
), by attaching another attribute (e.g. type
could
be class
or instance
; and utility functions would check that), or
by other means.
Now that we have some OOP scaffolding to work with, we can actually do a better job with the infinite vector that was presented above. The first solution is perfectly workable, but it might be nice to have an even more seamless and invisible infinite vector.
Operators in R are just shorthand ways to make function calls, and you are just as free to specialize operator behavior on a class basis as you are any other function call. While we are at it, we can fix a few other weaknesses in the first system: (1) We would like to be able to generate as many distinct infinite vectors as need; (2) We would like to be able to configure the random distribution used; (3) We would like to have the option of initializing an infinite random vector with the values in another vector. So let us do all that:
"[.infinite_random" <- function(v, index) { name <- attr(v, 'name') rfunc <- attr(v, 'rfunc') extend_by = max(index-length(v), 0) extras = rfunc(extend_by) new <- c(v, extras) makeInfiniteRandomVector(name, v=new, rfunc) return(new[index]) } unitnorm <- function(n) return(rnorm(n,0,1)) empty <- vector('numeric', 0) makeInfiniteRandomVector <- function(name, v=empty, rfunc=unitnorm) { # Create an infinite vector # optionally extend existing vector, configurable rand func attr(v,'class') <- 'infinite_random' attr(v,'name') <- name attr(v,'rfunc') <- rfunc assign(name, v, env=.GlobalEnv) } makeInfiniteRandomVector('v') # makeInfiniteRandomVector('inf_poisson', rfunc=my_poisson) # Usage is just, e.g.: v[1]; v[10]; v[9:12]; etc.
Indexing is already defined by R as a generic function, so there is
no need to call UseMethod()
to set it up; you just define as many
new specializations as you wish. Likewise, the built-in print()
function is also generic. We could specialize that with, e.g.:
print.infinite_random <- function(v) { a_few = 5 len = length(v) end_range = (len-a_few)+1 cat('* Infinite Random Vector *\n') cat('[1] ', v[1:a_few], '...\n') cat('[') cat(end_range) cat('] ', v[end_range:len], '...\n') }
In action:
> v[1000] [1] -1.341881 > print(v) * Infinite Random Vector * [1] -0.6247392 1.308057 1.654919 1.691754 -2.251065 ... [996] 1.027440 0.8376 -0.7066545 -0.7778386 -1.341881 ...
Programming general purpose functions, objects and classes in R requires taking a step back from the ways of thinking that traditional procedural and object-oriented programmers are used to. The prior two installments showed you some examples of ad hoc statistical exploration, and did not really need such a rethinking (which is fine). But once you wish to reuse your code, it is worth understanding the concepts of generic functions and the "inside out" OOP you can write with them (but this inside out form is actually more general). The trick here, in my mind, is to think of OOP entirely as a question of "what code gets called, and how is the decision made". Do not get hung up on the specific syntax familiar languages, whether C++, Objective C, Java, Ruby or Python, use to express that; focus on the concept of dispatch itself.
The two earlier installments of this R series can be found at:
Part 1. Dabbling with a wealth of statistical facilities: http://www-106.ibm.com/developerworks/linux/library/l-r1/
Part 2. Functional programming and data exploration: http://www-128.ibm.com/developerworks/linux/library/l-r2/
A good description of Monte Carlo integration can be found in a Wikipedia entry:
http://en.wikipedia.org/wiki/Monte_Carlo_method
My discussion of multiple dispatch in a Charming Python column provide good background for the generalization of OOP to generic functions. The framework presented there is generic, but is also, in fact, a superset of R's capabilities:
http://www-106.ibm.com/developerworks/library/l-pydisp.html
Michele Simionato has a good discussion of the concepts of method resolution order (MRO), and the merits of different MRO algorithms:
http://www.python.org/2.3/mro.html
R formal methods package:
http://developer.r-project.org/methodsPackage.html
Chapter 4 of the full reference manual discusses formal methods and
the methods
package:
http://cran.r-project.org/doc/manuals/fullrefman.pdf
My introductory tutorial on Haskell can be found at:
http://gnosis.cx/publish/programming/Haskell.pdf
To David Mertz, all the world is a stage; and his career is devoted to providing marginal staging instructions. David may be reached at [email protected]; his life pored over athttp://gnosis.cx/publish/. Suggestions and recommendations on this, past, or future, columns are welcomed. Check out David's book Text Processing in Python at http//gnosis.cx/TPiP/.