LINUX ZONE FEATURE: The R Programming Language, Part 3 -- Reusable and Object Oriented Programming --

The first two installments of this series looked at R in fairly "real world" usage; we explored various statistical analyses and graphing capabilities around a large series of temperature data that the coauthor of those installments had collected. As mentioned in those prior articles, what we actually explored barely scratched the surface of the deep and rich statistical libraries in R.

For this installment, I want to leave aside consideration of further statistical analysis per se. In large part, I do not personally have the requisite knowledge of statistics to decide the most relevant techniques; both my earlier coauthor, Brad Huntting, and many readers know far more in these areas. But even more significant is that I would like to provide readers with some pointers to the underlying language facilities of R. The prior installments made some gestures towards the functional programming orientation of R; my hunch is that many readers will be more familiar with imperative and object oriented languages. As well, we have previously only looked at using R in a rather ad hoc fashion. For this installment, I want to discuss creating reusable and modular components for R development.

Back To Basics

Before getting to R's notion of object orientation let me review and clarify a bit about R's data and functions. The main thing to keep in mind about R data is that, to a first brush, "everything is a vector." Even objects that look superficially distinct from vectors--matrices, arrays, data.frames, etc.--are really just vectors with extra (mutable) attributes that tell R to treat them in special ways.

The most important attribute that (some) R vectors have is its dimension, spelled dim. The functions, matrix(), array() and dim() are simply convenience functions for setting the dimensions of a vector. R's OOP system similarly boils down to what (if anything) is in an objects class attribute.

Creating vectors and assigning dimensions

> v = 1:1000
> typeof(v)
[1] "integer"
> attributes(v)
NULL
> dim(v) = c(10,10,10)  # (Re)dimension
> attributes(v)
$dim
[1] 10 10 10
> v2 = matrix(1:1000, nrow=100, ncol=10)
> typeof(v2)
[1] "integer"
> attributes(v2)
$dim
[1] 100  10
> attr(v2,'dim') = c(10,10,10)  # Redimension
> attributes(v2)
$dim
[1] 10 10 10

In short, their are several syntax conveniences for attaching a dim attribute to a vector, but at heart that is all the syntax sugar does.

One thing that can be confusing about R's "everything is a vector" approach is that row-wise and column-wise operations may not follow your first intuition. For example, it's simple enough to create a 2-D array (a matrix), and even to operate on a single column or row:

Operating on matrix vector, and row-wise

> m = matrix(1:12, nrow=3, ncol=4)
> m
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12
> sum(m)  # sum of all elements of m
[1] 78
> sum(m[1,])  # sum of first row
[1] 22

But if you wanted to create a vector to sum each row, you might be inclined to try:

Wrong way to perform multiple row-wise operations

> sum(m[c(1,2,3),])  # NOT sum of each row
[1] 78

You could construct a loop here, but that feels at odds with R's functional and vector oriented operations. Instead, the trick is to use the function apply():

apply() function for row-wise operations

> apply(m, 1, sum) # by row
[1] 22 26 30
> apply(m, 2, sum) # by column
[1]  6 15 24 33
> apply(m, c(1,2), sum) # by column AND row (sum of each single cell)
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12
# Kinda worthless to sum each single cell, but what about this:
> a = array(1:24,c(3,4,2))
> apply(a, c(1,2), sum)  # sum by depth in 3-D array
     [,1] [,2] [,3] [,4]
[1,]   14   20   26   32
[2,]   16   22   28   34
[3,]   18   24   30   36
> apply(a, c(3), sum)    # sum of each depth slice
[1]  78 222

An Infinite Sequence

A construct that is sometimes useful to have, for perfectly practical reasons, is an infinite sequence of numbers. For example, the coauthor of the prior installments was doing some analysis of Monte Carlo integration techniques, and for his purposes, it was useful to have an infinitely long sequence of random numbers. Understand here that the infinite sequence that is needed is not simply the ability to generate a new number as needed; it is also necessary to be able to refer back to a specific previously referenced element, and have it refer to the same value it had before.

Obviously, no computer language and no computer, can store an infinite sequence--what they can do is store a lazy and unbounded sequence. More elements can be added to a realized list, when and only when, access is required. In Python, for example, you might accomplish this by creating a list-like object with a custom .__getitem__() method that extends the internal list as needed. In Haskell, laziness is built deeply into the language--in effect, everything is lazy. In my Haskell tutorial (see Resources) I used the example of creating a list of all the prime numbers:

Haskell list of ALL primes w/ Sieve of Eratosthenes

primes :: [Int]
primes = sieve [2 .. ]
sieve (x:xs) = x : sieve [y | y <- xs, (y `rem` x)/=0]

In respect to the infinite, R is closer to Python than it is to Haskell. You need to explicitly construct more list elements as needed. We need to wait for the OOP section to let vector indexing itself launch the behind-the-scenes mechanism; but there is still not much scaffolding involved.

Define a vector and a means to dynamically expand it

inf_vector = rnorm(10, 0, 1)   # arbritrarily start w/ init 10 items
assure <- function(index) {
  extend_by = max(index-length(inf_vector), 0)
  extras = rnorm(extend_by, 0, 1)
  v <- c(inf_vector, extras)
  assign("inf_vector", v, env=.GlobalEnv)
  return(index)
}
getRand <- function(index) {
  assure(index)
  return(inf_vector[index])
}

Probably the preferred usage is to access values through the getRand() wrapper function. Notice that you are perfectly free to use slices or computed values, as well as single indices:

Using wrapper func as proxy to infinite virtual vector

> getRand(3)                # Single index
[1] 0.5557101
> getRand(1:5)              # Range
[1] -0.05472011 -0.30419695  0.55571013  0.91667175 -0.40644081
> getRand(sqrt(c(4,16)))    # Computed index collection
[1] -0.3041970  0.9166717
> getRand(100)              # Force background vector extension
[1] 0.6577079

If you prefer, you can simply assure() the vector is large enough before using elements:

Extending vector (if needed) before access

> assure(2000)
[1] 2000
> inf_vector[1500]
[1] 1.267652

Object Oriented R

R is capable of fully general object-oriented programming. But to understand this, you need to step back a bit in your thinking about what OOP is. Users of languages like Java and C++, or even of Python, Ruby, or Smalltalk, might have a relatively circumscribed picture of object-orientation. Not wrong, but limited to one model.

R's approach to OOP is based around generic functions rather than around class hierarchies. This concept will be familiar to readers who have used Lisp's CLOS, or who have read my discussion of multiple dispatch using Python (see Resources). Unfortunately, R's approach is still single dispatch, in that respect equivalent to the "traditional" languages mentioned above (C++, Java, etc).

I should note--though I will not discuss it in any specificity in this article--that recent version of R come with a package called methods that define and manipulate so-called "formal methods." In many ways, using these formal methods imposes much of the discipline (and limitations) you find in "traditional" OOP languages. In any case, formal OOP in R is built on top of the "informal OOP" that I discuss. The methods package is still somewhat tentative, from what I can tell; but some moderately tweaked version of it seems certain to continue in later R versions. See Resources for more background.

The thing to keep in mind in understanding OOP sui generis is that OOP is not really a matter of inheritance specifically, but of dispatch decisions more generally. That is, a call to obj.method() in a "traditional" OOP language will look through the method resolution order (MRO) of an object to find the first ancestor class of obj that has a method .method(). What "first" means is a more subtle question than is obvious at first (see Resources for a good discussion of the evolving MRO design in Python). R makes the same decision, but it turns the idea of inheritance "inside out." Rather than have a bunch of classes which may define and override various methods in their bodies, R creates a family of generic functions that carry a tag describing what type of object they want to operate on.

Generic Functions

As a simple example, let us create a generic function called whoami(), and some tagged methods to dispatch to:

#------------- Create a generic method
> whoami <- function(x, ...) UseMethod("whoami")
> whoami.foo <- function(x) print("I am a foo")
> whoami.bar <- function(x) print("I am a bar")
> whoami.default <- function(x) print("I don't know who I am")

The key here is that every object in R may belong to zero, one, or more classes. Specifically, the MRO of any given object (relative to a particular method) is simply the vector of named classes, if any, in its class attribute. E.g.:

Tagging objects with class memberships

> a = 1:10
> b = 2:20
> whoami(a)                 # No class assigned
[1] "I don't know who I am"
> attr(a,'class') <- 'foo'
> attr(b,'class') <- c('baz','bam','bar')
> whoami(a)
[1] "I am a foo"
> whoami(b)                 # Search MRO for defined method
[1] "I am a bar"
> attr(a,'class') <- 'bar'  # Change the class of 'a'
> whoami(a)
[1] "I am a bar"

As with traditional inheritance languages, an object need not utilize the same class for every method it calls. Traditionally, if Child inherits from Mom and Dad an object of type Child might utilize .meth1() from Mom and .meth2() from Dad. You can do this in R, naturally, but Mom and Dad are nothing substantial, just names:

Dispatch Resolution Per Method

> meth1 <- function(x) UseMethod("meth1")
> meth1.Mom <- function(x) print("Mom's meth1")
> meth1.Dad <- function(x) print("Dad's meth1")
> meth2 <- function(x) UseMethod("meth2")
> meth2.Dad <- function(x) print("Dad's meth2")
> attr(a,'class') <- c('Mom','Dad')
> meth1(a)   # Even though meth1.Dad exists, Mom comes first for a
[1] "Mom's meth1"
> meth2(a)
[1] "Dad's meth2"

Including Ancestors

It might seem limiting to need to explicitly specify the MRO of an object rather than rely on its implicit resolution through inheritance syntax. In point of fact, you can perfectly easily implement inheritance-based MROs with a minimal wrapper functions. The MRO I use below is probably not the best one possible (see Simionato's essay in Resources), but it shows the idea:

char0 = character(0)
makeMRO <- function(classes=char0, parents=char0) {
    # Create a method resolution order from an optional
    # explicit list and an optional list of parents
    mro <- c(classes)
    for (name in parents) {
        mro <- c(mro, name)
        ancestors <- attr(get(name),'class')
        mro <- c(mro, ancestors[ancestors != name])
    }
    return(mro)
}
NewInstance <- function(value=0, classes=char0, parents=char0) {
    # Create a new object based on initial value,
    # explicit classes and parents (all optional)
    obj <- value
    attr(obj,'class') <- makeMRO(classes, parents)
    return(obj)
}
MaternalGrandma <- NewInstance()
PaternalGrandma <- NewInstance()
Mom <- NewInstance(classes='Mom', parents='MaternalGrandma')
Dad <- NewInstance(0, classes=c('Dad','Uncle'), 'PaternalGrandma')
Me <- NewInstance(value='Hello World', 'Me', c('Mom','Dad'))

An object with an inheritance-generated MRO

> print(Me)
[1] "Hello World"
attr(,"class")
[1] "Me"              "Mom"             "MaternalGrandma" "Dad"
[5] "Uncle"           "PaternalGrandma"

If you wish to follow a traditional approach to class/instance relationship, you want to include the name of the class you create (e.g. Mom in its classes argument). Actually, given that each "class" is itself a perfectly good object, the above system is closer to prototype-based OOP systems than class-based ones. Then again, the whole system is flexible enough to include all variations. You are perfectly free to segregate class objects from instance objects if you wish--you might distinguish classes with a naming convention (e.g. Mom versus mom), by attaching another attribute (e.g. type could be class or instance; and utility functions would check that), or by other means.

Revisiting an Infinite Vector

Now that we have some OOP scaffolding to work with, we can actually do a better job with the infinite vector that was presented above. The first solution is perfectly workable, but it might be nice to have an even more seamless and invisible infinite vector.

Operators in R are just shorthand ways to make function calls, and you are just as free to specialize operator behavior on a class basis as you are any other function call. While we are at it, we can fix a few other weaknesses in the first system: (1) We would like to be able to generate as many distinct infinite vectors as need; (2) We would like to be able to configure the random distribution used; (3) We would like to have the option of initializing an infinite random vector with the values in another vector. So let us do all that:

Define an indexable infinite random vector

"[.infinite_random" <- function(v, index) {
    name <- attr(v, 'name')
    rfunc <- attr(v, 'rfunc')
    extend_by = max(index-length(v), 0)
    extras = rfunc(extend_by)
    new <- c(v, extras)
    makeInfiniteRandomVector(name, v=new, rfunc)
    return(new[index])
}
unitnorm <- function(n) return(rnorm(n,0,1))
empty <- vector('numeric', 0)
makeInfiniteRandomVector <- function(name, v=empty, rfunc=unitnorm) {
    # Create an infinite vector
    # optionally extend existing vector, configurable rand func
    attr(v,'class') <- 'infinite_random'
    attr(v,'name') <- name
    attr(v,'rfunc') <- rfunc
    assign(name, v, env=.GlobalEnv)
}
makeInfiniteRandomVector('v')
# makeInfiniteRandomVector('inf_poisson', rfunc=my_poisson)
# Usage is just, e.g.: v[1]; v[10]; v[9:12]; etc.

Indexing is already defined by R as a generic function, so there is no need to call UseMethod() to set it up; you just define as many new specializations as you wish. Likewise, the built-in print() function is also generic. We could specialize that with, e.g.:

Printing an infinite vector

print.infinite_random <- function(v) {
    a_few = 5
    len = length(v)
    end_range = (len-a_few)+1
    cat('* Infinite Random Vector *\n')
    cat('[1] ', v[1:a_few], '...\n')
    cat('[')
    cat(end_range)
    cat('] ', v[end_range:len], '...\n')
}

Example of printing infinite vector

> v[1000]
[1] -1.341881
> print(v)
* Infinite Random Vector *
[1]  -0.6247392 1.308057 1.654919 1.691754 -2.251065 ...
[996]  1.027440 0.8376 -0.7066545 -0.7778386 -1.341881 ...

Conclusion

Programming general purpose functions, objects and classes in R requires taking a step back from the ways of thinking that traditional procedural and object-oriented programmers are used to. The prior two installments showed you some examples of ad hoc statistical exploration, and did not really need such a rethinking (which is fine). But once you wish to reuse your code, it is worth understanding the concepts of generic functions and the "inside out" OOP you can write with them (but this inside out form is actually more general). The trick here, in my mind, is to think of OOP entirely as a question of "what code gets called, and how is the decision made". Do not get hung up on the specific syntax familiar languages, whether C++, Objective C, Java, Ruby or Python, use to express that; focus on the concept of dispatch itself.

Resources

A good description of Monte Carlo integration can be found in a Wikipedia entry:

My discussion of multiple dispatch in a Charming Python column provide good background for the generalization of OOP to generic functions. The framework presented there is generic, but is also, in fact, a superset of R's capabilities:

Michele Simionato has a good discussion of the concepts of method resolution order (MRO), and the merits of different MRO algorithms:

Chapter 4 of the full reference manual discusses formal methods and the methods package:

About The Author

To David Mertz, all the world is a stage; and his career is devoted to providing marginal staging instructions. David may be reached at [email protected]; his life pored over athttp://gnosis.cx/publish/. Suggestions and recommendations on this, past, or future, columns are welcomed. Check out David's book Text Processing in Python at http//gnosis.cx/TPiP/.

Linux Zone Feature: The R Programming Language, Part 3

Reusable and Object Oriented Programming

Introduction