David Mertz, Ph.D.
Your Obedient Writer, Gnosis Software, Inc.
October, 2004
Pyrex is a language specially designed for writing Python extension modules. It's designed to bridge the gap between the nice, high-level, easy-to-use world of Python and the messy, low-level world of C. Almost any piece of Python code is also valid Pyrex code; but you can add optional static type declarations to Pyrex code, making the declared objects run at C speed. David contrasts writing code in Pyrex--generally for use with larger Python applications--with speeding up Python applications using the Psyco compiler (that he has written about previously).
In some sense, Pyrex is just part of a growing family of Python-like
languages: Jython, IronPython, Prothon, Boo, formerly Vyper, in a way
Stackless, or in another way the Parrot runtime.. In language terms,
Pyrex is essentially just Python with type declarations added in. The
few other variations on the language do not amount to that much
(though the extension to the for
loop has an elegance to it). The
reason you actually want to use Pyrex, however, is to write modules
that run faster--maybe a lot faster--than pure Python can possibly
run.
What Pyrex does internally is generate a C program out of a Pyrex one.
The intermediate module.c
file remains available for hand tweaking,
in the unlikely event you have reason to do that. For "normal" Pyrex
users, there is no reason to modify the generated C module--Pyrex
itself gives you access to all the C-level constructs that are most
important for speed, while saving you all the C gotchas of memory
allocation and deallocation, pointer arithmetic, prototypes, and so
on. Pyrex also seamlessly handles all the interfacing with
Python-level objects; mostly it does this by declaring variables as
PyObject
structures where necessary, and using Python C-API calls
for memory handling and type conversions.
For the most part, Pyrex runs faster than Python by avoiding the need
to continuously box and unbox variables of simple data types. An
int
in Python, for example, is an object with a bunch of methods;
with tree of ancestors, itself having a computed "method resolution
order" (MRO); with allocators and dellocators for memory handling; and
that knows when to promote itself to a long, and how to enter into
numeric operations with values of other types. All those extra
capabilities mean many levels of indirection and many more conditional
checks when you do things with int
objects. On the other hand, a C
or Pyrex int
variable is just a region in memory with some bits set
to ones or zeors. Doing something with a C/Pyrex int
does not
involve any indirection nor any conditional checks. You just, e.g.
performs a CPU "add" operation in silicon.
In well-chosen cases, a Pyrex module can run 40-50 times faster than a Python version of the same module. But in contrast to writing the module in C, per se, the Pyrex version will be hardly any longer than the Python version, and the code will look much more like Python than like C.
Of course, when you start talking about making Python(like) modules
fast, Pyrex is not the only tool that comes to mind. Psyco also lives
at the back of every Python developer's mind. Psyco is--in very short
version--a just-in-time (JIT) compiler of Python code into (x86)
machine code. Unlike Pyrex, Psyco does not exactly type variables, but
rather creates several speculative machine code versions of each
Python block, based on each hypothesis about what the data types
-might be. If the data turns out to be of a simple type like int
for the entire run of a given block, that block (especially if it
loops) can run very quickly. So, for example, x
can be an int
within a loop that runs a million times, but still be assigned a
float
value at the end of the loop. Psyco happily speeds up the
million iterations, essentially by exactly the same (speculative)
unboxing that you can explicitly specify with Pyrex.
While Pyrex is not difficult, Psyco is childishly simple to use. Using Psyco amounts to nothing more than putting a few lines at the end of your module; in fact if you use the right lines, the module will run identically (except more slowly) even when Psyco is not available:
# Import Psyco if available try: import psyco psyco.full() except ImportError: pass
To use Pyrex, you need to change a bit more in your code (just a bit), and you also need to have a C compiler available and properly configued on the system where you generate the Pyrex module. You can distribute binary Pyrex modules, but you must match the Python version, architecture, and optimization flags that the end user needs for your module to work elsewhere.
I recently created a pure-Python implementation of hashcash, in
writing a developerWorks article about hashcash. Read that article
for details, but basically, hashcash is a technique for proving CPU
work using SHA-1 challenges. Python has a standard module sha
,
which makes coding hashcash relatively simple.
Unlike 95% of the Python programs I write, the slow speed of my
hashcash
module bothers me, at least a little bit. By design, the
protocol is meant to eat CPU cycles, so runtime efficiency really does
matter. The ANSI C binary of hashcash.c
runs about 10-times as
quickly as my hashcash.py
script. Moreover, the brilliantly
optimized PPC/Altivec-enabled binary of hashcash.c
runs another
4-times as fast as the generic ANSI C version (a G4/Altivec at 1 Ghz
easy outpaces a Pentium4/MMX at 3 Ghz in hashcash/SHA speed; a G5 is
that much faster still). So testing on my TiPowerbook shows my module
an embarrassing 40-times slower than the optimized C version. The
gap is far less on x86 though.
The module runs slowly, maybe Pyrex is a good candidate for speeding
it up. At least that was my thought. The first thing I did in
Pyrx-izing hashcash.py
(after installing Pyrex, of course) was
simply copy it to hashcash_pyx.pyx
and try processing it with:
$ pyrexc hashcash_pyx.pyx
Running this command happily generates a file hashcash.c
(once a few
minor changes are made to the source file). Unfortunately, getting the
gcc
switches just right for my platform was a bit tricker, so I
decided to take the recommended shortcut of letting distutils
figure
it out for me. A standard Python installation knows how to work with
the local C compiler during module installations, and using
distutils
makes sharing a Pyrex module easier. I created a
setup_hashcash.py
script as follows:
from distutils.core import setup from distutils.extension import Extension from Pyrex.Distutils import build_ext setup( name = "hashcash_pyx", ext_modules=[ Extension("hashcash_pyx", ["hashcash_pyx.pyx"], libraries = []) ], cmdclass = {'build_ext': build_ext} )
Now running the below line fully builds a C-based extension module
hashcash
:
python2.3 prime_setup.py build_ext --inplace
I very slightly overstated the ease I experienced in generating a
C-based module out of hashcash.pyx
. Actually, I had to make two
changes to the source code; I found the locations by looking at where
pyrexc
complained. I used one unsupported list comprehension in my
code, which I had to unroll into a regular for
loop. Simple enough.
I also had to change one augmented assignment from counter+=1
to
counter=counter+1
.
That's it. I had written my first Pyrex module.
In order to easily test the speed of the incrementally improving modules I planned to develop, I wrote a small test harness to run different module version:
#!/usr/bin/env python2.3 import time, sys, optparse hashcash = __import__(sys.argv[1]) start = time.time() print hashcash.mint('[email protected]', bits=20) timer = time.time()-start sys.stderr.write("%0.4f seconds (%d hashes per second)\n" % (timer, hashcash.tries[0]/timer))
Excitedly, I decided to see just how much speed improvement I got just by compiling via Pyrex. Note in all the below samples that the actual stamp generation time varies widely and randomly. The figure to look at is the "hashes per second", which pretty reliably measures speed. So comparing native Python with Pyrex:
$ ./hashcash_test.py hashcash 1:20:041003:[email protected]::I+lyNUpV:167dca 13.7879 seconds (106904 hashes per second) $ ./hashcash_test.py hashcash_pyx > /dev/null 6.0695 seconds (89239 hashes per second)
Ooops! We just got almost 20% slower by using Pyrex. Not really what I was hoping for. It is time to start analyzing the code for speedup possibilities. Here is the short function that takes substantially all the time:
def _mint(challenge, bits): "Answer a 'generalized hashcash' challenge'" counter = 0 hex_digits = int(ceil(bits/4.)) zeros = '0'*hex_digits hash = sha while 1: digest = hash(challenge+hex(counter)[2:]).hexdigest() if digest[:hex_digits] == zeros: tries[0] = counter return hex(counter)[2:] counter += 1
We need to take some actual advantage of Pyrex variable declarations
to get a speedup. Some variables in there are obviously integers, and
some others obviously strings--let's specify that. And while we are
at it, let's use Pyrex's enhanced for
loop:
cdef _mint(challenge, int bits): # Answer a 'generalized hashcash' challenge'" cdef int counter, hex_digits, i cdef char *digest hex_digits = int(ceil(bits/4.)) hash = sha for counter from 0 <= counter < sys.maxint: py_digest = hash(challenge+hex(counter)[2:]).hexdigest() digest = py_digest for i from 0 <= i < hex_digits: if digest[i] != c'0': break else: tries[0] = counter return hex(counter)[2:]
Pretty easy so far. We have just declared some variable types that we
clearly know, and used the cleanest Pyrex counter loop. A little
trick is assigning py_digest
(a Python string) to digest
(a
C/Pyrex string) in order to type it. Experimentally, I also found
that a looping string comparison is faster than taking a slice. How
do we do?
$ ./hashcash_test.py hashcash_pyx2 >/dev/null 20.3749 seconds (116636 hashes per second)
We are doing better. We have managed a slight improvement from native Python, and a pretty good improvement over our initial Pyrex module. Still nothing very impressive though--a few percent gain.
Something seems wrong here. Gaining a few percent in speed is not much
like gaining 40-times like the Pyrex homepage--and many Pyrex
users--boast about. I figured out it was time to see where my Python
_mint()
function was actually spending its time. A quick script (not
shown) breaks out just what is going on in the complex compound
operation sha(challenge+hex(counter)[2:]).hexdigest()
:
1000000 empty loops: 0.559 ------------------------------ 1000000 sha()s: 2.332 1000000 hex()[2:]s: 3.151 just hex()s: <2.471> 1000000 concatenations: 0.855 1000000 hexdigest()s: 3.742 ------------------------------ Total: 10.079
Clearly, we cannot eliminate the loop itself from our _mint()
function. Pyrex's enhanced for
probably speeds it up slightly, but
the whole function is mainly the loop. And we cannot get rid of the
call to sha()
--at least not unless we are willing to reimplement
SHA-1 in Pyrex (I am far from confident that I could better than the
writers of the Python standard sha
module even if I did this).
Moreover, if we hope to get an actual hash out of the sha.SHA
object, we have to call .hexdigest()
or .digest()
--the former is
slightly faster too.
All that is really left for us to address is the hex()
converstion
on the counter variable, and perhaps the slice taken from the result.
We might be able to squeeze a little bit out of concatenating Pyrex/C
strings, rather than Python string objects, too. The only way I see
to avoid the hex()
conversion, however, is to manually build a
suffix out of nested loops. Doing this can avoid any int->char
conversion, but also makes for more code:
cdef _mint(char *challenge, int bits): cdef int hex_digits, i0, i1, i2, i3, i4, i5 cdef char *ab, *digest, *trial, *suffix suffix = '******' ab = alphabet hex_digits = int(ceil(bits/4.)) hash = sha for i0 from 0 <= i0 < 55: suffix[0] = ab[i0] for i1 from 0 <= i1 < 55: suffix[1] = ab[i1] for i2 from 0 <= i2 < 55: suffix[2] = ab[i2] for i3 from 0 <= i3 < 55: suffix[3] = ab[i3] for i4 from 0 <= i4 < 55: suffix[4] = ab[i4] for i5 from 0 <= i5 < 55: suffix[5] = ab[i5] py_digest = hash(challenge+suffix).hexdigest() digest = py_digest for i from 0 <= i < hex_digits: if digest[i] != c'0': break else: return suffix
This Pyrex function still looks quite a bit easier to read than the corresponding C function would; but it is certainly more complicated than was my naively elegant pure-Python version. By the way, unrolling the suffix generation in pure-Python has a slightly negative effect on overall speed versus the initial version. In Pyrex, as we would expect, these nested loops are pretty much free, and we save the cost of conversion and slicing:
$ ./hashcash_test.py hashcash_pyx3 >/dev/null 13.2270 seconds (166125 hashes per second)
Better than we started with, certainly. But still well under a doubling of speed. The problem is that most of the time was, and is, spent in calls to the Python library, and nothing we might code around it prevents or speeds up those calls.
Getting a speedup of 50-60% still seems worthwhile. And we have not
done that much coding to get it. But if you think about adding the
two statements import psyco;psyco.bind(_mint)
to the initial
Python version, this speedup does not seem so impressive:
$ ./hashcash_test.py hashcash_psyco >/dev/null 15.2300 seconds (157550 hashes per second)
In other words, Psyco does almost as much with just two generic lines
of code. Of course, Psyco only works on x86 platforms, while Pyrex
will work anywhere that has a C compiler. But for the particular
example at issue, os.popen('hashcash -m '+options)
will still be
many times faster than either Pyrex or Psyco will get you (assuming
the C utility hashcash
is available, of course).
In the best case, Pyrex can indeed produce quite fast code. For example, the Pyrex homepage prominently features a numeric intensive function for calculating a list of initial prime numbers. All operations involved are performed on integers, so type declarations can speed up the algorithm quite substantially. This Pyrex function just barely differs from pure Python--just a few declarations added:
def primes(int kmax): # Calculate initial prime numbers cdef int n, k, i cdef int p[100000] result = [] k = 0 n = 2 while k < kmax: i = 0 while i < k and n % p[i] <> 0: i = i + 1 if i == k: p[k] = n k = k + 1 result.append(n) n = n + 1 return result
I also wrote the same function as actual Python, basically just by taking the declarations back out. Running the Pyrex and Python versions, and also a Psyco-ized Python version gives these times:
$ ./prime_test.py Pure Python version First 10000 primes 60.30 seconds Psyco-ized Python First 10000 primes 7.62 seconds Pyrex Version First 10000 primes 1.31 seconds
So in the best case, Pyrex does a lot better than Python, and still quite significantly better than the Psyco speedup. I have a hunch, however, that I might be able to improve Psyco's speed by fiddling with some algorithm details. Even so, Pyrex almost certainly represents the best you can do for this type of problem. The generated C code looks almost exactly the same as what you'd write if you simply started with C.
There are a few things that Pyrex does quite well. For code that works with simple numeric or character data in tight loops, Pyrex can produce significant speedups; maybe 50 times the speed in best cases. If a Python application encounters a significant bottleneck in numeric functions, creating Pyrex versions of those functions is very sensible. But the cases where you find significant gains are relatively constrained. Code that spends most of its time making library calls just is not going to benefit that much from Pyrex speeding up incidental loop overhead. Moreover, in many cases, a generic two lines enabling Psyco can produce a similar improvement to what you get though a moderate degree of rewriting from Python into Pyrex. Pyrex code is easy to write, but you have to write it, unlike with Psyco.
The efforts with hashcash in this article are not the best you might
do, I will note. I am confident that (with much more work) it would be
possible to modify the Python sha
module a bit to enable direct
calls to the C-level interface, thereby avoiding the Python-level
calling overheads. Or just to find some other optimized SHA-1
implementation in C. Pyrex code is perfectly able to utilize external
C code, and calling an sha()
function written in C will be faster
than boxing and unboxing it in Python objects and namespaces. But
then, it is not clear why this is worthwhile, given a quite good
existing C implementation of hashcash.
Another option to think about, however, in writing specialized numeric
functions using Pyrex, is whether Numerical Python might be a suitable
tool for your work. The numeric
package is fairly general, and quite
fast for what it does. Using numeric
does not involve any
non-Python code for its end user, just calls to appropriate library
functions. The coverage of numeric
is certainly not identical to
those functions that can benefit from Pyrex; but there is certainly
some overlap.
The Pyrex website includes manuals and a tutorial, as well as the module itself. It is at:
http://nz.cosc.canterbury.ac.nz/~greg/python/Pyrex/
Charming Python installment "Make Python run as fast as C with Psyco":
http://www-106.ibm.com/developerworks/library/l-psyco.html
My article on hashcash, "Hashcash: Arbitrary CPU Costs as a Mechanism of Access Control" can be found at:
IBM URL pending
The Python module that I wrote, hashcash.py
can be found at:
http://gnosis.cx/download/gnosis/util/hashcash.py
I wrote a Charming Python installment on Numarray and Numerical Python at:
http://www-106.ibm.com/developerworks/linux/library/l-cpnum.html
David Mertz has a slow brain, and most of his programs still run slowly. David may be reached at [email protected]; his life pored over athttp://gnosis.cx/publish/. Check out David's book Text Processing in Python (http://gnosis.cx/TPiP/).