CHARMING 2 PYTHON #B14: Numerical Python Working with the [numeric] and [numarray] packages David Mertz, Ph.D. Idempotentate, Gnosis Software, Inc. September, 2003 Numerical Python (often called NumPy) is a widely used extension library for fast operations on fixed-type arrays, of any dimensionality, in Python. Since the underlying code is well-optimized C, any speed limitations of Python's interpreter usually go away when major operations are performed in NumPy calls. As successful as NumPy has been, its developers have decided to supercede NumPy with a new module called Numarray that is mostly, but not quite entirely, compatible with NumPy. This installment looks both at the general features of NumPy, and at the specific improvements forthcoming with Numarray. INTRODUCTION ------------------------------------------------------------------------ The first thing to know about the Numerical Python package is that it doesn't let you do anything you cannot already do with standard Python. What it does, is let you do many of the same things -a heck of a lot faster-. Actually, that's not quite all; there are a number of operations on arrays that are much more elegant to express in [numeric] or [numarray] than they are with standard Python datatypes and syntax. But mostly it is the impressive speed that draws users to Numerical Python. In essence, Numerical Python just gives you a new datatype, the array. Numerical arrays contain only elements of a homogeneous datatype, in contrast to to lists, tuples, and dictionaries which may contain heterogeneous elements. The other useful thing about Numerical arrays is that they may be multi-dimensional--but the dimensionality of arrays is a bit different that the simple nestability of lists. Numerical Python draws on the experience of programmers--particularly those from scientific backgrounds--who have abstracted the best features of arrays languages like APL, FORTRAN, MATLAB, and S, and created arrays whose shape and dimensionality is easily changeable. Back to this soon. Operations on arrays in Numerical Python are performed -elementwise-. Even though two dimensional arrays are similar to matrices from linear algebra, operations (like multiply) have nothing to do with the operations in linear algebra (like matrix multiplication). Let us look at a quick example of all these points. In plain Python, you could create a "2-D list" with, e.g.: >>> pyarr = [[1,2,3], ... [4,5,6], ... [7,8,9]] >>> print pyarr [[1, 2, 3], [4, 5, 6], [7, 8, 9]] >>> pyarr[1][1] = 0 >>> print pyarr [[1, 2, 3], [4, 0, 6], [7, 8, 9]] Good enough, but about all you can do with this structure is set and retrieve elements by individual (multi-dimesional) index. In contrast, a Numerical array is a bit more flexible: >>> from numarray import * >>> numarr = array(pyarr) >>> print numarr [[1 2 3] [4 0 6] [7 8 9]] Not much has changed, but what about operations? >>> numarr2 = numarr * 2 >>> print numarr2 [[ 2 4 6] [ 8 0 12] [14 16 18]] >>> print numarr2 + numarr [[ 3 6 9] [12 0 18] [21 24 27]] And changing the shape: >>> numarr2.shape = (9,) >>> print numarr2 [ 2 4 6 8 0 12 14 16 18] CHANGES BETWEEN NUMERIC AND NUMARRAY: ------------------------------------------------------------------------ For the most part, the newer package [numarray] is API compatible with the earlier [numeric]. However, experience using [numeric] by many users prompted developers to make a few incompatible improvements. Instead of breaking the any application that relies on [numeric], developers launched a new project called [numarray]. As of this writing, [numarray] lacks a few features in [numeric], but implementation of most is planned. Some of the improvements for [numarray] are: (1) Element types are arranged in a hierarchical class structure to support isinstance() checks. [numeric] used only character type codes when specifying data types (but initializers in [numarray] can still accept the old character codes). (2) The type coercion rules have changed to preserve types in arrays (more often), rather than converting to the type of a Python scalar. (3) Added array attributes (not just getters and setters). (4) More flexible exception handling. New users probably do not to worry about any of these changes; it is best to start out with [numarray] at this point, in most cases. TIMING EXAMPLE ------------------------------------------------------------------------ Let us try to get a sense of just how much faster operations in Numerical Python can be than standard Python. As a toy task, we will create a sequence of numbers, then double each one. First, a few variants on standard Python approaches: def timer(fun, n, comment=""): from time import clock start = clock() print comment, len(fun(n)), "elements", print "in %.2f seconds" % (clock()-start) def double1(n): return map(lambda n: 2*n, xrange(n)) timer(double1, 5000000, "Running map() on xrange iterator:") def double2(n): return [2*n for n in xrange(n)] timer(double2, 5000000, "Running listcomp on xrange iter: ") def double3(n): double = [] for n in xrange(n): double.append(2*n) return double timer(double3, 5000000, "Building new list from iterator: ") We can check whether any speed differences exist between 'map()', list comprehensions, and traditional loops. What about standard module [array], which also requires homogeneous element types, perhaps it will run faster: import array def double4(n): return [2*n for n in array.array('i',range(n))] timer(double4, 5000000, "Running listcomp on array.array: ") Finally, let us look at how [numarray] does. As an extra check, we should see whether we still benefit if it is necessary to convert an array back into a standard list: from numarray import * def double5(n): return 2*arange(n) timer(double5, 5000000, "Numarray scalar multiplication: ") def double6(n): return (2*arange(n)).tolist() timer(double6, 5000000, "Numarray mult, returning list: ") Now running it: $ python2.3 timing.py Running map() on xrange iterator: 5000000 elements in 13.61 seconds Running listcomp on xrange iter: 5000000 elements in 16.46 seconds Building new list from iterator: 5000000 elements in 20.13 seconds Running listcomp on array.array: 5000000 elements in 25.58 seconds Numarray scalar multiplication: 5000000 elements in 0.61 seconds Numarray mult, returning list: 5000000 elements in 3.70 seconds The minor differences between techniques for working with lists is perhaps worth noticing, as is the misstep in trying standard module [array]. But the general moral is that [numarray] performs the operation in less than 1/20th the time. Coercing back to a standard list loses a lot of that gain. Do not put too much weight in such a simple comparison, but the speedup is probably about typical. For big scientific calculations, moving from months to days--or days to hours--become quite worthwhile. MODELING SYSTEMS ------------------------------------------------------------------------ The typical use case for Numerical Python is in scientific modeling, or perhaps in related areas like graphic manipulations and convolutions or signal processing. To illustrate a number of features of [numarray], I will use a moderately realistic problem. Suppose you have a 3D physical space over which a quantity varies. Abstractly, any parametric space, with any number of dimensions, is equally applicable to [numarray]. But it is familiar and easy to imagine, for example, a room in which temperature varies from point to point. As winter comes to my New England home, the example starts to seem germane. While this point is perhaps obvious, it is worth mentioning explicitly. My below examples work with smallish arrays because it is easier to illustrate with them. But [numarray] remains fast for arrays with millions of elements, rather than merely dozens; the former are probably more common for real scientific models. As a start, let us create the "room". There are a number of ways to do this--the most general is with the callable 'array()'. Using that, we can generate Numerical arrays with a variety of initialization parameters, including initial data from any Python sequences. But for our room, the convenience function 'zeros()' produces a nice--evenly chilly--room: >>> from numarray import * >>> room = zeros((4,3,5),Float) >>> print room [[[ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.]] [[ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.]] [[ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.]] [[ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0.]]] Each 2-D "matrix" printed represents on horizontal slice of the 3-D room--we will call it top-to-bottom. First thing, let us raise the whole room temperature to a comfortable 70 degree Farenheit: >>> room += 70 >>> print room [[[ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.]] [[ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.]] [[ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.]] [[ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.]]] There is a very important difference between Numerical arrays and Python lists to keep in mind for the rest of our operations. When you take slices of arrays--and we will see that slicing is quite flexible and powerful for multiple dimensions--what you get is not a copy, but a "view." There are multiple ways to point to the same data. Let us take a look. Suppose our room has a draft that comes in and cools the floor by four degrees: >>> floor = room[3] >>> floor -= 4 >>> print room [[[ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.]] [[ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.]] [[ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.]] [[ 66. 66. 66. 66. 66.] [ 66. 66. 66. 66. 66.] [ 66. 66. 66. 66. 66.]]] In contrast, the fireplace on the north wall heats each position next to it by 8 degrees, while itself being an ambient 90 degrees: >>> north = room[:,0] >>> near_fireplace = north[2:4,2:5] >>> near_fireplace += 8 >>> north[3,2] = 90 # the fireplace cell itself >>> print room [[[ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.]] [[ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.]] [[ 70. 78. 78. 78. 70.] [ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.]] [[ 66. 74. 90. 74. 66.] [ 66. 66. 66. 66. 66.] [ 66. 66. 66. 66. 66.]]] Here we get into some tricker indexing, where slices can be indicated along multiple dimensions. These views are nice to keep around. For example, you might wonder what the current temperatures are over the north wall: >>> print north [[ 70. 70. 70. 70. 70.] [ 70. 70. 70. 70. 70.] [ 70. 78. 78. 78. 70.] [ 66. 74. 90. 74. 66.]] MORE OPERATIONS ------------------------------------------------------------------------ This introductory look will only a small part of all the convenience functions and array methods/attributes that are available in [numarray]. My hope is only to give reader a general feel; the [numarray] documentation is quite excellent. Now that our room has been subject to various temperature fluctuation, we might want to judge its overall state. For example, what is the current average temperature in the room: >>> add.reduce(room.flat)/len(room.flat) 70.066666666666663 This merits a bit of explanation. All the operations you can perform on arrays have corresponding -Universal Functions- (ufuncs). So where we had done something like 'floor-= 4' in the above code, we could have equally well used 'subtract(floor,4,floor)'. With three arguments, to 'substract()', the operation is performed in-place. You could also create a copy of 'floor' using 'floor=subtract(floor,4)'; but this is probably not what you want, since the change will be in a new array, not a subsection of 'room'. However, ufuncs are not merely functions. They are callable objects that also have some methods of their own: '.reduce()' is probably the most useful of these. It works just like the Python built-in function 'reduce()', with each operation being the underlying ufunc (the methods are much faster though when applied to Numerical arrays). In other words 'add.reduce()' means 'sum()' and 'multiply.reduce()' means 'product()'.(those shortcut names are defined too). Before you can sum the room cell temperatures, you need to take a flat, 1-D, view of the data. Otherwise, you add along the primary dimension, producing an new array of reduced dimension, e.g.: >>> add.reduce(room) array([[ 276., 292., 308., 292., 276.], [ 276., 276., 276., 276., 276.], [ 276., 276., 276., 276., 276.]]) Such dimensional summary can be useful, but it is not what we want in this case. Since we are modeling a physical system, let us add a bit more realism. Rooms have air micro-currents in them that mix up temperature. We might model that by supposing that every once in a while, a cell adjusts to the temperatures around it: >>> def equalize(room): ... z,y,x = map(randint, (1,1,1), room.shape) ... zmin,ymin,xmin = maximum([z-2,y-2,x-2],[0,0,0]).tolist() ... zmax,ymax,xmax = [z+1,y+1,x+1] ... region = room[zmin:zmax,ymin:ymax,xmin:xmax].copy() ... room[z-1,y-1,x-1] = sum(region.flat)/len(region.flat) ... return room This model is a bit unrealistic, of course: cells pick up on the temperature around them, but do not impart a differential back to their neighbors. Still, let us look at what it does. First we pick a random cell--or actually we pick one more than the cell itself in each dimension, since '.shape' gives us the length rather than the maximum index. The value 'zmin' and friends make sure we only index down to zero, not in negative ranges; 'zmax' and friends are not really needed, since indexing past the size of an array dimension is just treated as the maximum value (much as with Python lists). Next thing, we need to define the region of adjacent cells. Given our small room dimensions, picking a cell at a room surface, edge, or corner is rather common--the 'region' of a cell is likely to be smaller than the maximum 27-elements of a 3x3x3 subsection. That is not a problem, we just need to be sure to use the right denominator for our average. This new average temperature is assigned to the randomly chosen cell. You can perform as many equalization events as you wish to in your model. Each call just adjusts one cell. Asymptotically, a lot of calls will lead to equalization of the temperature in parts of the room. Even though arrays are mutable, the function 'equalize()' returns its array. This is useful if you only want to equalize a -copy- of the model: >>> print equalize(room.copy()) [[[ 70. 70. 70. 70. 70. ] [ 70. 70. 70. 70. 70. ] [ 70. 70. 70. 70. 70. ]] [[ 70. 70. 71.333333 70. 70. ] [ 70. 70. 70. 70. 70. ] [ 70. 70. 70. 70. 70. ]] [[ 70. 78. 78. 78. 70. ] [ 70. 70. 70. 70. 70. ] [ 70. 70. 70. 70. 70. ]] [[ 66. 74. 90. 74. 66. ] [ 66. 66. 66. 66. 66. ] [ 66. 66. 66. 68. 66. ]]] CONCLUSION ------------------------------------------------------------------------ This introduction touched on only a bit of what [numarry] can do. The full capabilities are quite fancy. For example, you can populate arrays based on populating functions, which is likely to be useful for physical models. You can specify subsections of arrays not only with slices, but also with index arrays--this not only lets you operate on non-contiguous segments of an array, it also lets you--via the 'take()' function, redimension and reform arrays in interesting ways. Moreover, for the most part, I have presented only operations between scalars and arrays; you can also perform operations between arrays, including ones of different dimensionality. There is a lot to it, but the API manages to remain fairly intuitive throughout. I encourage readers to install [numarray] and/or [numeric] on their own systems. It is not hard to get started, and the fast manipulations provided have a surprisingly large area of application--often where you would not, at first, expect it. RESOURCES ------------------------------------------------------------------------ The Numerical Python homepage is at the below URL. This has good links to documentation on both [numarray] and [numeric]. http://www.pfdubois.com/numpy/ The project itself is hosted by Sourceforge, and you can find the Numerical Python project page at: http://sourceforge.net/projects/numpy ABOUT THE AUTHOR ------------------------------------------------------------------------ {Picture of Author: http://gnosis.cx/cgi-bin/img_dqm.cgi} While David Mertz also likes laziness and impatience, this installment is about hubris. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/publish/. Suggestions and recommendations on this, past, or future, columns are welcomed. Check out his book, _Text Processing in Python_ at .