CHARMING 2 PYTHON #B14: Numerical Python
Working with the [numeric] and [numarray] packages
David Mertz, Ph.D.
Idempotentate, Gnosis Software, Inc.
September, 2003
Numerical Python (often called NumPy) is a widely used extension
library for fast operations on fixed-type arrays, of any
dimensionality, in Python. Since the underlying code is
well-optimized C, any speed limitations of Python's interpreter
usually go away when major operations are performed in NumPy calls.
As successful as NumPy has been, its developers have decided to
supercede NumPy with a new module called Numarray that is mostly,
but not quite entirely, compatible with NumPy. This installment
looks both at the general features of NumPy, and at the specific
improvements forthcoming with Numarray.
INTRODUCTION
------------------------------------------------------------------------
The first thing to know about the Numerical Python package is that it
doesn't let you do anything you cannot already do with standard
Python. What it does, is let you do many of the same things -a heck
of a lot faster-. Actually, that's not quite all; there are a number
of operations on arrays that are much more elegant to express in
[numeric] or [numarray] than they are with standard Python datatypes
and syntax. But mostly it is the impressive speed that draws users to
Numerical Python.
In essence, Numerical Python just gives you a new datatype, the array.
Numerical arrays contain only elements of a homogeneous datatype, in
contrast to to lists, tuples, and dictionaries which may contain
heterogeneous elements. The other useful thing about Numerical arrays
is that they may be multi-dimensional--but the dimensionality of
arrays is a bit different that the simple nestability of lists.
Numerical Python draws on the experience of programmers--particularly
those from scientific backgrounds--who have abstracted the best
features of arrays languages like APL, FORTRAN, MATLAB, and S, and
created arrays whose shape and dimensionality is easily changeable.
Back to this soon.
Operations on arrays in Numerical Python are performed -elementwise-.
Even though two dimensional arrays are similar to matrices from linear
algebra, operations (like multiply) have nothing to do with the
operations in linear algebra (like matrix multiplication).
Let us look at a quick example of all these points. In plain Python,
you could create a "2-D list" with, e.g.:
>>> pyarr = [[1,2,3],
... [4,5,6],
... [7,8,9]]
>>> print pyarr
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]
>>> pyarr[1][1] = 0
>>> print pyarr
[[1, 2, 3], [4, 0, 6], [7, 8, 9]]
Good enough, but about all you can do with this structure is set and
retrieve elements by individual (multi-dimesional) index. In
contrast, a Numerical array is a bit more flexible:
>>> from numarray import *
>>> numarr = array(pyarr)
>>> print numarr
[[1 2 3]
[4 0 6]
[7 8 9]]
Not much has changed, but what about operations?
>>> numarr2 = numarr * 2
>>> print numarr2
[[ 2 4 6]
[ 8 0 12]
[14 16 18]]
>>> print numarr2 + numarr
[[ 3 6 9]
[12 0 18]
[21 24 27]]
And changing the shape:
>>> numarr2.shape = (9,)
>>> print numarr2
[ 2 4 6 8 0 12 14 16 18]
CHANGES BETWEEN NUMERIC AND NUMARRAY:
------------------------------------------------------------------------
For the most part, the newer package [numarray] is API compatible
with the earlier [numeric]. However, experience using [numeric] by
many users prompted developers to make a few incompatible
improvements. Instead of breaking the any application that relies on
[numeric], developers launched a new project called [numarray]. As of
this writing, [numarray] lacks a few features in [numeric], but
implementation of most is planned.
Some of the improvements for [numarray] are:
(1) Element types are arranged in a hierarchical class structure to
support isinstance() checks. [numeric] used only character type codes
when specifying data types (but initializers in [numarray] can still
accept the old character codes).
(2) The type coercion rules have changed to preserve types in arrays
(more often), rather than converting to the type of a Python scalar.
(3) Added array attributes (not just getters and setters).
(4) More flexible exception handling.
New users probably do not to worry about any of these changes; it is
best to start out with [numarray] at this point, in most cases.
TIMING EXAMPLE
------------------------------------------------------------------------
Let us try to get a sense of just how much faster operations in
Numerical Python can be than standard Python. As a toy task, we will
create a sequence of numbers, then double each one. First, a few
variants on standard Python approaches:
def timer(fun, n, comment=""):
from time import clock
start = clock()
print comment, len(fun(n)), "elements",
print "in %.2f seconds" % (clock()-start)
def double1(n): return map(lambda n: 2*n, xrange(n))
timer(double1, 5000000, "Running map() on xrange iterator:")
def double2(n): return [2*n for n in xrange(n)]
timer(double2, 5000000, "Running listcomp on xrange iter: ")
def double3(n):
double = []
for n in xrange(n):
double.append(2*n)
return double
timer(double3, 5000000, "Building new list from iterator: ")
We can check whether any speed differences exist between 'map()', list
comprehensions, and traditional loops. What about standard module
[array], which also requires homogeneous element types, perhaps it
will run faster:
import array
def double4(n): return [2*n for n in array.array('i',range(n))]
timer(double4, 5000000, "Running listcomp on array.array: ")
Finally, let us look at how [numarray] does. As an extra check, we
should see whether we still benefit if it is necessary to convert an
array back into a standard list:
from numarray import *
def double5(n): return 2*arange(n)
timer(double5, 5000000, "Numarray scalar multiplication: ")
def double6(n): return (2*arange(n)).tolist()
timer(double6, 5000000, "Numarray mult, returning list: ")
Now running it:
$ python2.3 timing.py
Running map() on xrange iterator: 5000000 elements in 13.61 seconds
Running listcomp on xrange iter: 5000000 elements in 16.46 seconds
Building new list from iterator: 5000000 elements in 20.13 seconds
Running listcomp on array.array: 5000000 elements in 25.58 seconds
Numarray scalar multiplication: 5000000 elements in 0.61 seconds
Numarray mult, returning list: 5000000 elements in 3.70 seconds
The minor differences between techniques for working with lists is
perhaps worth noticing, as is the misstep in trying standard module
[array]. But the general moral is that [numarray] performs the
operation in less than 1/20th the time. Coercing back to a standard
list loses a lot of that gain.
Do not put too much weight in such a simple comparison, but the
speedup is probably about typical. For big scientific calculations,
moving from months to days--or days to hours--become quite worthwhile.
MODELING SYSTEMS
------------------------------------------------------------------------
The typical use case for Numerical Python is in scientific modeling,
or perhaps in related areas like graphic manipulations and
convolutions or signal processing. To illustrate a number of features
of [numarray], I will use a moderately realistic problem. Suppose you
have a 3D physical space over which a quantity varies. Abstractly, any
parametric space, with any number of dimensions, is equally applicable
to [numarray]. But it is familiar and easy to imagine, for example, a
room in which temperature varies from point to point. As winter comes
to my New England home, the example starts to seem germane.
While this point is perhaps obvious, it is worth mentioning
explicitly. My below examples work with smallish arrays because it is
easier to illustrate with them. But [numarray] remains fast for
arrays with millions of elements, rather than merely dozens; the
former are probably more common for real scientific models.
As a start, let us create the "room". There are a number of ways to
do this--the most general is with the callable 'array()'. Using that,
we can generate Numerical arrays with a variety of initialization
parameters, including initial data from any Python sequences. But for
our room, the convenience function 'zeros()' produces a nice--evenly
chilly--room:
>>> from numarray import *
>>> room = zeros((4,3,5),Float)
>>> print room
[[[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]]
[[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]]
[[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]]
[[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]]]
Each 2-D "matrix" printed represents on horizontal slice of the
3-D room--we will call it top-to-bottom. First thing, let us raise
the whole room temperature to a comfortable 70 degree Farenheit:
>>> room += 70
>>> print room
[[[ 70. 70. 70. 70. 70.]
[ 70. 70. 70. 70. 70.]
[ 70. 70. 70. 70. 70.]]
[[ 70. 70. 70. 70. 70.]
[ 70. 70. 70. 70. 70.]
[ 70. 70. 70. 70. 70.]]
[[ 70. 70. 70. 70. 70.]
[ 70. 70. 70. 70. 70.]
[ 70. 70. 70. 70. 70.]]
[[ 70. 70. 70. 70. 70.]
[ 70. 70. 70. 70. 70.]
[ 70. 70. 70. 70. 70.]]]
There is a very important difference between Numerical arrays and
Python lists to keep in mind for the rest of our operations. When you
take slices of arrays--and we will see that slicing is quite flexible
and powerful for multiple dimensions--what you get is not a copy, but
a "view." There are multiple ways to point to the same data. Let us
take a look. Suppose our room has a draft that comes in and cools the
floor by four degrees:
>>> floor = room[3]
>>> floor -= 4
>>> print room
[[[ 70. 70. 70. 70. 70.]
[ 70. 70. 70. 70. 70.]
[ 70. 70. 70. 70. 70.]]
[[ 70. 70. 70. 70. 70.]
[ 70. 70. 70. 70. 70.]
[ 70. 70. 70. 70. 70.]]
[[ 70. 70. 70. 70. 70.]
[ 70. 70. 70. 70. 70.]
[ 70. 70. 70. 70. 70.]]
[[ 66. 66. 66. 66. 66.]
[ 66. 66. 66. 66. 66.]
[ 66. 66. 66. 66. 66.]]]
In contrast, the fireplace on the north wall heats each position next
to it by 8 degrees, while itself being an ambient 90 degrees:
>>> north = room[:,0]
>>> near_fireplace = north[2:4,2:5]
>>> near_fireplace += 8
>>> north[3,2] = 90 # the fireplace cell itself
>>> print room
[[[ 70. 70. 70. 70. 70.]
[ 70. 70. 70. 70. 70.]
[ 70. 70. 70. 70. 70.]]
[[ 70. 70. 70. 70. 70.]
[ 70. 70. 70. 70. 70.]
[ 70. 70. 70. 70. 70.]]
[[ 70. 78. 78. 78. 70.]
[ 70. 70. 70. 70. 70.]
[ 70. 70. 70. 70. 70.]]
[[ 66. 74. 90. 74. 66.]
[ 66. 66. 66. 66. 66.]
[ 66. 66. 66. 66. 66.]]]
Here we get into some tricker indexing, where slices can be indicated
along multiple dimensions. These views are nice to keep around. For
example, you might wonder what the current temperatures are over the
north wall:
>>> print north
[[ 70. 70. 70. 70. 70.]
[ 70. 70. 70. 70. 70.]
[ 70. 78. 78. 78. 70.]
[ 66. 74. 90. 74. 66.]]
MORE OPERATIONS
------------------------------------------------------------------------
This introductory look will only a small part of all the convenience
functions and array methods/attributes that are available in
[numarray]. My hope is only to give reader a general feel; the
[numarray] documentation is quite excellent.
Now that our room has been subject to various temperature fluctuation,
we might want to judge its overall state. For example, what is the
current average temperature in the room:
>>> add.reduce(room.flat)/len(room.flat)
70.066666666666663
This merits a bit of explanation. All the operations you can perform
on arrays have corresponding -Universal Functions- (ufuncs). So where
we had done something like 'floor-= 4' in the above code, we could
have equally well used 'subtract(floor,4,floor)'. With three
arguments, to 'substract()', the operation is performed in-place. You
could also create a copy of 'floor' using 'floor=subtract(floor,4)';
but this is probably not what you want, since the change will be in a
new array, not a subsection of 'room'.
However, ufuncs are not merely functions. They are callable objects
that also have some methods of their own: '.reduce()' is probably the
most useful of these. It works just like the Python built-in function
'reduce()', with each operation being the underlying ufunc (the
methods are much faster though when applied to Numerical arrays). In
other words 'add.reduce()' means 'sum()' and 'multiply.reduce()' means
'product()'.(those shortcut names are defined too).
Before you can sum the room cell temperatures, you need to take a
flat, 1-D, view of the data. Otherwise, you add along the primary
dimension, producing an new array of reduced dimension, e.g.:
>>> add.reduce(room)
array([[ 276., 292., 308., 292., 276.],
[ 276., 276., 276., 276., 276.],
[ 276., 276., 276., 276., 276.]])
Such dimensional summary can be useful, but it is not what we want in
this case.
Since we are modeling a physical system, let us add a bit more
realism. Rooms have air micro-currents in them that mix up
temperature. We might model that by supposing that every once in a
while, a cell adjusts to the temperatures around it:
>>> def equalize(room):
... z,y,x = map(randint, (1,1,1), room.shape)
... zmin,ymin,xmin = maximum([z-2,y-2,x-2],[0,0,0]).tolist()
... zmax,ymax,xmax = [z+1,y+1,x+1]
... region = room[zmin:zmax,ymin:ymax,xmin:xmax].copy()
... room[z-1,y-1,x-1] = sum(region.flat)/len(region.flat)
... return room
This model is a bit unrealistic, of course: cells pick up on the
temperature around them, but do not impart a differential back to
their neighbors. Still, let us look at what it does. First we pick a
random cell--or actually we pick one more than the cell itself in each
dimension, since '.shape' gives us the length rather than the maximum
index. The value 'zmin' and friends make sure we only index down to
zero, not in negative ranges; 'zmax' and friends are not really
needed, since indexing past the size of an array dimension is just
treated as the maximum value (much as with Python lists).
Next thing, we need to define the region of adjacent cells. Given our
small room dimensions, picking a cell at a room surface, edge, or
corner is rather common--the 'region' of a cell is likely to be
smaller than the maximum 27-elements of a 3x3x3 subsection. That is
not a problem, we just need to be sure to use the right denominator
for our average. This new average temperature is assigned to the
randomly chosen cell.
You can perform as many equalization events as you wish to in your
model. Each call just adjusts one cell. Asymptotically, a lot of
calls will lead to equalization of the temperature in parts of the
room. Even though arrays are mutable, the function 'equalize()'
returns its array. This is useful if you only want to equalize a
-copy- of the model:
>>> print equalize(room.copy())
[[[ 70. 70. 70. 70. 70. ]
[ 70. 70. 70. 70. 70. ]
[ 70. 70. 70. 70. 70. ]]
[[ 70. 70. 71.333333 70. 70. ]
[ 70. 70. 70. 70. 70. ]
[ 70. 70. 70. 70. 70. ]]
[[ 70. 78. 78. 78. 70. ]
[ 70. 70. 70. 70. 70. ]
[ 70. 70. 70. 70. 70. ]]
[[ 66. 74. 90. 74. 66. ]
[ 66. 66. 66. 66. 66. ]
[ 66. 66. 66. 68. 66. ]]]
CONCLUSION
------------------------------------------------------------------------
This introduction touched on only a bit of what [numarry] can do. The
full capabilities are quite fancy. For example, you can populate
arrays based on populating functions, which is likely to be useful for
physical models. You can specify subsections of arrays not only with
slices, but also with index arrays--this not only lets you operate on
non-contiguous segments of an array, it also lets you--via the
'take()' function, redimension and reform arrays in interesting ways.
Moreover, for the most part, I have presented only operations between
scalars and arrays; you can also perform operations between arrays,
including ones of different dimensionality. There is a lot to it, but
the API manages to remain fairly intuitive throughout.
I encourage readers to install [numarray] and/or [numeric] on their
own systems. It is not hard to get started, and the fast
manipulations provided have a surprisingly large area of
application--often where you would not, at first, expect it.
RESOURCES
------------------------------------------------------------------------
The Numerical Python homepage is at the below URL. This has good
links to documentation on both [numarray] and [numeric].
http://www.pfdubois.com/numpy/
The project itself is hosted by Sourceforge, and you can find the
Numerical Python project page at:
http://sourceforge.net/projects/numpy
ABOUT THE AUTHOR
------------------------------------------------------------------------
{Picture of Author: http://gnosis.cx/cgi-bin/img_dqm.cgi}
While David Mertz also likes laziness and impatience, this
installment is about hubris. David may be reached at
mertz@gnosis.cx; his life pored over at http://gnosis.cx/publish/.
Suggestions and recommendations on this, past, or future, columns
are welcomed. Check out his book, _Text Processing in Python_ at
.