numeric
and numarray
packagesDavid Mertz, Ph.D.
Idempotentate, Gnosis Software, Inc.
September, 2003
Numerical Python (often called NumPy) is a widely used extension library for fast operations on fixed-type arrays, of any dimensionality, in Python. Since the underlying code is well-optimized C, any speed limitations of Python's interpreter usually go away when major operations are performed in NumPy calls. As successful as NumPy has been, its developers have decided to supercede NumPy with a new module called Numarray that is mostly, but not quite entirely, compatible with NumPy. This installment looks both at the general features of NumPy, and at the specific improvements forthcoming with Numarray.
The first thing to know about the Numerical Python package is that it
doesn't let you do anything you cannot already do with standard
Python. What it does, is let you do many of the same things a heck
of a lot faster. Actually, that's not quite all; there are a number
of operations on arrays that are much more elegant to express in
numeric
or numarray
than they are with standard Python datatypes
and syntax. But mostly it is the impressive speed that draws users to
Numerical Python.
In essence, Numerical Python just gives you a new datatype, the array. Numerical arrays contain only elements of a homogeneous datatype, in contrast to to lists, tuples, and dictionaries which may contain heterogeneous elements. The other useful thing about Numerical arrays is that they may be multi-dimensional--but the dimensionality of arrays is a bit different that the simple nestability of lists. Numerical Python draws on the experience of programmers--particularly those from scientific backgrounds--who have abstracted the best features of arrays languages like APL, FORTRAN, MATLAB, and S, and created arrays whose shape and dimensionality is easily changeable. Back to this soon.
Operations on arrays in Numerical Python are performed elementwise. Even though two dimensional arrays are similar to matrices from linear algebra, operations (like multiply) have nothing to do with the operations in linear algebra (like matrix multiplication).
Let us look at a quick example of all these points. In plain Python, you could create a "2-D list" with, e.g.:
>>> pyarr = [[1,2,3], ... [4,5,6], ... [7,8,9]] >>> print pyarr [[1, 2, 3], [4, 5, 6], [7, 8, 9]] >>> pyarr[1][1] = 0 >>> print pyarr [[1, 2, 3], [4, 0, 6], [7, 8, 9]]
Good enough, but about all you can do with this structure is set and retrieve elements by individual (multi-dimesional) index. In contrast, a Numerical array is a bit more flexible:
>>> from numarray import * >>> numarr = array(pyarr) >>> print numarr [[1 2 3] [4 0 6] [7 8 9]]
Not much has changed, but what about operations?
>>> numarr2 = numarr * 2 >>> print numarr2 [[ 2 4 6] [ 8 0 12] [14 16 18]] >>> print numarr2 + numarr [[ 3 6 9] [12 0 18] [21 24 27]]
And changing the shape:
>>> numarr2.shape = (9,) >>> print numarr2 [ 2 4 6 8 0 12 14 16 18]
For the most part, the newer package numarray
is API compatible
with the earlier numeric
. However, experience using numeric
by
many users prompted developers to make a few incompatible
improvements. Instead of breaking the any application that relies on
numeric
, developers launched a new project called numarray
. As of
this writing, numarray
lacks a few features in numeric
, but
implementation of most is planned.
Some of the improvements for numarray
are:
(1) Element types are arranged in a hierarchical class structure to
support isinstance() checks. numeric
used only character type codes
when specifying data types (but initializers in numarray
can still
accept the old character codes).
(2) The type coercion rules have changed to preserve types in arrays (more often), rather than converting to the type of a Python scalar.
(3) Added array attributes (not just getters and setters).
(4) More flexible exception handling.
New users probably do not to worry about any of these changes; it is
best to start out with numarray
at this point, in most cases.
Let us try to get a sense of just how much faster operations in Numerical Python can be than standard Python. As a toy task, we will create a sequence of numbers, then double each one. First, a few variants on standard Python approaches:
def timer(fun, n, comment=""): from time import clock start = clock() print comment, len(fun(n)), "elements", print "in %.2f seconds" % (clock()-start) def double1(n): return map(lambda n: 2*n, xrange(n)) timer(double1, 5000000, "Running map() on xrange iterator:") def double2(n): return [2*n for n in xrange(n)] timer(double2, 5000000, "Running listcomp on xrange iter: ") def double3(n): double = [] for n in xrange(n): double.append(2*n) return double timer(double3, 5000000, "Building new list from iterator: ")
We can check whether any speed differences exist between map()
, list
comprehensions, and traditional loops. What about standard module
array
, which also requires homogeneous element types, perhaps it
will run faster:
import array def double4(n): return [2*n for n in array.array('i',range(n))] timer(double4, 5000000, "Running listcomp on array.array: ")
Finally, let us look at how numarray
does. As an extra check, we
should see whether we still benefit if it is necessary to convert an
array back into a standard list:
from numarray import * def double5(n): return 2*arange(n) timer(double5, 5000000, "Numarray scalar multiplication: ") def double6(n): return (2*arange(n)).tolist() timer(double6, 5000000, "Numarray mult, returning list: ")
Now running it:
$ python2.3 timing.py Running map() on xrange iterator: 5000000 elements in 13.61 seconds Running listcomp on xrange iter: 5000000 elements in 16.46 seconds Building new list from iterator: 5000000 elements in 20.13 seconds Running listcomp on array.array: 5000000 elements in 25.58 seconds Numarray scalar multiplication: 5000000 elements in 0.61 seconds Numarray mult, returning list: 5000000 elements in 3.70 seconds
The minor differences between techniques for working with lists is
perhaps worth noticing, as is the misstep in trying standard module
array
. But the general moral is that numarray
performs the
operation in less than 1/20th the time. Coercing back to a standard
list loses a lot of that gain.
Do not put too much weight in such a simple comparison, but the speedup is probably about typical. For big scientific calculations, moving from months to days--or days to hours--become quite worthwhile.
The typical use case for Numerical Python is in scientific modeling,
or perhaps in related areas like graphic manipulations and
convolutions or signal processing. To illustrate a number of features
of numarray
, I will use a moderately realistic problem. Suppose you
have a 3D physical space over which a quantity varies. Abstractly, any
parametric space, with any number of dimensions, is equally applicable
to numarray
. But it is familiar and easy to imagine, for example, a
room in which temperature varies from point to point. As winter comes
to my New England home, the example starts to seem germane.
While this point is perhaps obvious, it is worth mentioning
explicitly. My below examples work with smallish arrays because it is
easier to illustrate with them. But numarray
remains fast for
arrays with millions of elements, rather than merely dozens; the
former are probably more common for real scientific models.
As a start, let us create the "room". There are a number of ways to
do this--the most general is with the callable array()
. Using that,
we can generate Numerical arrays with a variety of initialization
parameters, including initial data from any Python sequences. But for
our room, the convenience function zeros()
produces a nice--evenly
chilly--room:
>>> from numarray import * >>> room = zeros((4,3,5),Float) >>> print room [[[ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.]] [[ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.]] [[ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.]] [[ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.]]]
Each 2-D "matrix" printed represents on horizontal slice of the 3-D room--we will call it top-to-bottom. First thing, let us raise the whole room temperature to a comfortable 70 degree Farenheit:
>>> room += 70 >>> print room [[[ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.]] [[ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.]] [[ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.]] [[ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.]]]
There is a very important difference between Numerical arrays and Python lists to keep in mind for the rest of our operations. When you take slices of arrays--and we will see that slicing is quite flexible and powerful for multiple dimensions--what you get is not a copy, but a "view." There are multiple ways to point to the same data. Let us take a look. Suppose our room has a draft that comes in and cools the floor by four degrees:
>>> floor = room[3] >>> floor -= 4 >>> print room [[[ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.]] [[ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.]] [[ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.]] [[ 66. 66. 66. 66. 66.] [ 66. 66. 66. 66. 66.] [ 66. 66. 66. 66. 66.]]]
In contrast, the fireplace on the north wall heats each position next to it by 8 degrees, while itself being an ambient 90 degrees:
>>> north = room[:,0] >>> near_fireplace = north[2:4,2:5] >>> near_fireplace += 8 >>> north[3,2] = 90 # the fireplace cell itself >>> print room [[[ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.]] [[ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.]] [[ 70. 78. 78. 78. 70.] [ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.]] [[ 66. 74. 90. 74. 66.] [ 66. 66. 66. 66. 66.] [ 66. 66. 66. 66. 66.]]]
Here we get into some tricker indexing, where slices can be indicated along multiple dimensions. These views are nice to keep around. For example, you might wonder what the current temperatures are over the north wall:
>>> print north [[ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.] [ 70. 78. 78. 78. 70.] [ 66. 74. 90. 74. 66.]]
This introductory look will only a small part of all the convenience
functions and array methods/attributes that are available in
numarray
. My hope is only to give reader a general feel; the
numarray
documentation is quite excellent.
Now that our room has been subject to various temperature fluctuation, we might want to judge its overall state. For example, what is the current average temperature in the room:
>>> add.reduce(room.flat)/len(room.flat) 70.066666666666663
This merits a bit of explanation. All the operations you can perform
on arrays have corresponding Universal Functions (ufuncs). So where
we had done something like floor-= 4
in the above code, we could
have equally well used subtract(floor,4,floor)
. With three
arguments, to substract()
, the operation is performed in-place. You
could also create a copy of floor
using floor=subtract(floor,4)
;
but this is probably not what you want, since the change will be in a
new array, not a subsection of room
.
However, ufuncs are not merely functions. They are callable objects
that also have some methods of their own: .reduce()
is probably the
most useful of these. It works just like the Python built-in function
reduce()
, with each operation being the underlying ufunc (the
methods are much faster though when applied to Numerical arrays). In
other words add.reduce()
means sum()
and multiply.reduce()
means
product()
.(those shortcut names are defined too).
Before you can sum the room cell temperatures, you need to take a flat, 1-D, view of the data. Otherwise, you add along the primary dimension, producing an new array of reduced dimension, e.g.:
>>> add.reduce(room) array([[ 276., 292., 308., 292., 276.], [ 276., 276., 276., 276., 276.], [ 276., 276., 276., 276., 276.]])
Such dimensional summary can be useful, but it is not what we want in this case.
Since we are modeling a physical system, let us add a bit more realism. Rooms have air micro-currents in them that mix up temperature. We might model that by supposing that every once in a while, a cell adjusts to the temperatures around it:
>>> def equalize(room): ... z,y,x = map(randint, (1,1,1), room.shape) ... zmin,ymin,xmin = maximum([z-2,y-2,x-2],[0,0,0]).tolist() ... zmax,ymax,xmax = [z+1,y+1,x+1] ... region = room[zmin:zmax,ymin:ymax,xmin:xmax].copy() ... room[z-1,y-1,x-1] = sum(region.flat)/len(region.flat) ... return room
This model is a bit unrealistic, of course: cells pick up on the
temperature around them, but do not impart a differential back to
their neighbors. Still, let us look at what it does. First we pick a
random cell--or actually we pick one more than the cell itself in each
dimension, since .shape
gives us the length rather than the maximum
index. The value zmin
and friends make sure we only index down to
zero, not in negative ranges; zmax
and friends are not really
needed, since indexing past the size of an array dimension is just
treated as the maximum value (much as with Python lists).
Next thing, we need to define the region of adjacent cells. Given our
small room dimensions, picking a cell at a room surface, edge, or
corner is rather common--the region
of a cell is likely to be
smaller than the maximum 27-elements of a 3x3x3 subsection. That is
not a problem, we just need to be sure to use the right denominator
for our average. This new average temperature is assigned to the
randomly chosen cell.
You can perform as many equalization events as you wish to in your
model. Each call just adjusts one cell. Asymptotically, a lot of
calls will lead to equalization of the temperature in parts of the
room. Even though arrays are mutable, the function equalize()
returns its array. This is useful if you only want to equalize a
copy of the model:
>>> print equalize(room.copy()) [[[ 70. 70. 70. 70. 70. ] [ 70. 70. 70. 70. 70. ] [ 70. 70. 70. 70. 70. ]] [[ 70. 70. 71.333333 70. 70. ] [ 70. 70. 70. 70. 70. ] [ 70. 70. 70. 70. 70. ]] [[ 70. 78. 78. 78. 70. ] [ 70. 70. 70. 70. 70. ] [ 70. 70. 70. 70. 70. ]] [[ 66. 74. 90. 74. 66. ] [ 66. 66. 66. 66. 66. ] [ 66. 66. 66. 68. 66. ]]]
This introduction touched on only a bit of what numarry
can do. The
full capabilities are quite fancy. For example, you can populate
arrays based on populating functions, which is likely to be useful for
physical models. You can specify subsections of arrays not only with
slices, but also with index arrays--this not only lets you operate on
non-contiguous segments of an array, it also lets you--via the
take()
function, redimension and reform arrays in interesting ways.
Moreover, for the most part, I have presented only operations between
scalars and arrays; you can also perform operations between arrays,
including ones of different dimensionality. There is a lot to it, but
the API manages to remain fairly intuitive throughout.
I encourage readers to install numarray
and/or numeric
on their
own systems. It is not hard to get started, and the fast
manipulations provided have a surprisingly large area of
application--often where you would not, at first, expect it.
The Numerical Python homepage is at the below URL. This has good
links to documentation on both numarray
and numeric
.
http://www.pfdubois.com/numpy/
The project itself is hosted by Sourceforge, and you can find the Numerical Python project page at:
http://sourceforge.net/projects/numpy
While David Mertz also likes laziness and impatience, this installment is about hubris. David may be reached at [email protected]; his life pored over athttp://gnosis.cx/publish/. Suggestions and recommendations on this, past, or future, columns are welcomed. Check out his book, Text Processing in Python at <http://gnosis.cx/TPiP>.