CHARMING PYTHON #B21: Pyrex. Writing C code in Python syntax David Mertz, Ph.D. Your Obedient Writer, Gnosis Software, Inc. October, 2004 Pyrex is a language specially designed for writing Python extension modules. It's designed to bridge the gap between the nice, high-level, easy-to-use world of Python and the messy, low-level world of C. Almost any piece of Python code is also valid Pyrex code; but you can add optional static type declarations to Pyrex code, making the declared objects run at C speed. David contrasts writing code in Pyrex--generally for use with larger Python applications--with speeding up Python applications using the Psyco compiler (that he has written about previously). MAKING PYTHON FAST ------------------------------------------------------------------------ In some sense, Pyrex is just part of a growing family of Python-like languages: Jython, IronPython, Prothon, Boo, formerly Vyper, in a way Stackless, or in another way the Parrot runtime.. In language terms, Pyrex is essentially just Python with type declarations added in. The few other variations on the language do not amount to that much (though the extension to the 'for' loop has an elegance to it). The reason you actually want to use Pyrex, however, is to write modules that run faster--maybe a lot faster--than pure Python can possibly run. What Pyrex does internally is generate a C program out of a Pyrex one. The intermediate 'module.c' file remains available for hand tweaking, in the unlikely event you have reason to do that. For "normal" Pyrex users, there is no reason to modify the generated C module--Pyrex itself gives you access to all the C-level constructs that are most important for speed, while saving you all the C gotchas of memory allocation and deallocation, pointer arithmetic, prototypes, and so on. Pyrex also seamlessly handles all the interfacing with Python-level objects; mostly it does this by declaring variables as 'PyObject' structures where necessary, and using Python C-API calls for memory handling and type conversions. For the most part, Pyrex runs faster than Python by avoiding the need to continuously -box- and -unbox- variables of simple data types. An 'int' in Python, for example, is an object with a bunch of methods; with tree of ancestors, itself having a computed "method resolution order" (MRO); with allocators and dellocators for memory handling; and that knows when to promote itself to a long, and how to enter into numeric operations with values of other types. All those extra capabilities mean many levels of indirection and many more conditional checks when you -do- things with 'int' objects. On the other hand, a C or Pyrex 'int' variable is just a region in memory with some bits set to ones or zeors. Doing something with a C/Pyrex 'int' does not involve -any- indirection nor any conditional checks. You just, e.g. performs a CPU "add" operation in silicon. In well-chosen cases, a Pyrex module can run 40-50 times faster than a Python version of the same module. But in contrast to writing the module in C, per se, the Pyrex version will be hardly any longer than the Python version, and the code will look much more like Python than like C. Of course, when you start talking about making Python(-like) modules fast, Pyrex is not the only tool that comes to mind. Psyco also lives at the back of every Python developer's mind. Psyco is--in very short version--a just-in-time (JIT) compiler of Python code into (x86) machine code. Unlike Pyrex, Psyco does not exactly type variables, but rather creates several speculative machine code versions of each Python block, based on each hypothesis about what the data types -might- be. If the data turns out to be of a simple type like 'int' for the entire run of a given block, that block (especially if it loops) can run very quickly. So, for example, 'x' can be an 'int' within a loop that runs a million times, but still be assigned a 'float' value at the end of the loop. Psyco happily speeds up the million iterations, essentially by exactly the same (speculative) unboxing that you can explicitly specify with Pyrex. While Pyrex is not difficult, Psyco is childishly simple to use. Using Psyco amounts to nothing more than putting a few lines at the end of your module; in fact if you use the right lines, the module will run identically (except more slowly) even when Psyco is not available: # Import Psyco if available try: import psyco psyco.full() except ImportError: pass To use Pyrex, you need to change a bit more in your code (just a bit), and you also need to have a C compiler available and properly configued on the system where you generate the Pyrex module. You -can- distribute binary Pyrex modules, but you must match the Python version, architecture, and optimization flags that the end user needs for your module to work elsewhere. A (NAIVE) FIRST TRY FOR SPEED ------------------------------------------------------------------------ I recently created a pure-Python implementation of hashcash, in writing a developerWorks article about hashcash. Read that article for details, but basically, hashcash is a technique for proving CPU work using SHA-1 challenges. Python has a standard module [sha], which makes coding hashcash relatively simple. Unlike 95% of the Python programs I write, the slow speed of my [hashcash] module bothers me, at least a little bit. By design, the protocol is meant to eat CPU cycles, so runtime efficiency really does matter. The ANSI C binary of 'hashcash.c' runs about 10-times as quickly as my 'hashcash.py' script. Moreover, the brilliantly optimized PPC/Altivec-enabled binary of 'hashcash.c' runs another 4-times as fast as the generic ANSI C version (a G4/Altivec at 1 Ghz easy outpaces a Pentium4/MMX at 3 Ghz in hashcash/SHA speed; a G5 is that much faster still). So testing on my TiPowerbook shows my module an embarrassing 40-times slower than the optimized C version. The gap is far less on x86 though. The module runs slowly, maybe Pyrex is a good candidate for speeding it up. At least that was my thought. The first thing I did in Pyrx-izing 'hashcash.py' (after installing Pyrex, of course) was simply copy it to 'hashcash_pyx.pyx' and try processing it with: #----------- Generating a C source from Pyrex source ------------# $ pyrexc hashcash_pyx.pyx Creating a binary module Running this command happily generates a file 'hashcash.c' (once a few minor changes are made to the source file). Unfortunately, getting the 'gcc' switches just right for my platform was a bit tricker, so I decided to take the recommended shortcut of letting [distutils] figure it out for me. A standard Python installation knows how to work with the local C compiler during module installations, and using [distutils] makes sharing a Pyrex module easier. I created a 'setup_hashcash.py' script as follows: #--------------------- hashcash_setup.py ------------------------# from distutils.core import setup from distutils.extension import Extension from Pyrex.Distutils import build_ext setup( name = "hashcash_pyx", ext_modules=[ Extension("hashcash_pyx", ["hashcash_pyx.pyx"], libraries = []) ], cmdclass = {'build_ext': build_ext} ) Now running the below line fully builds a C-based extension module [hashcash]: #------ Generate and compile a module from Pyrex source ---------# python2.3 prime_setup.py build_ext --inplace Code changes I very slightly overstated the ease I experienced in generating a C-based module out of 'hashcash.pyx'. Actually, I had to make two changes to the source code; I found the locations by looking at where 'pyrexc' complained. I used one unsupported list comprehension in my code, which I had to unroll into a regular 'for' loop. Simple enough. I also had to change one augmented assignment from 'counter+=1' to 'counter=counter+1'. That's it. I had written my first Pyrex module. TESTING SPEEDS ------------------------------------------------------------------------ In order to easily test the speed of the incrementally improving modules I planned to develop, I wrote a small test harness to run different module version: #--------------------- hashcash_test.py -------------------------# #!/usr/bin/env python2.3 import time, sys, optparse hashcash = __import__(sys.argv[1]) start = time.time() print hashcash.mint('mertz@gnosis.cx', bits=20) timer = time.time()-start sys.stderr.write("%0.4f seconds (%d hashes per second)\n" % (timer, hashcash.tries[0]/timer)) Excitedly, I decided to see just how much speed improvement I got just by compiling via Pyrex. Note in all the below samples that the actual stamp generation time varies widely and randomly. The figure to look at is the "hashes per second", which pretty reliably measures speed. So comparing native Python with Pyrex: #----------- Native Python versus "barely Pyrex" ----------------# $ ./hashcash_test.py hashcash 1:20:041003:mertz@gnosis.cx::I+lyNUpV:167dca 13.7879 seconds (106904 hashes per second) $ ./hashcash_test.py hashcash_pyx > /dev/null 6.0695 seconds (89239 hashes per second) Ooops! We just got almost 20% slower by using Pyrex. Not really what I was hoping for. It is time to start analyzing the code for speedup possibilities. Here is the short function that takes substantially all the time: #---------------- Worker function in 'hashcash.py' --------------# def _mint(challenge, bits): "Answer a 'generalized hashcash' challenge'" counter = 0 hex_digits = int(ceil(bits/4.)) zeros = '0'*hex_digits hash = sha while 1: digest = hash(challenge+hex(counter)[2:]).hexdigest() if digest[:hex_digits] == zeros: tries[0] = counter return hex(counter)[2:] counter += 1 We need to take some actual advantage of Pyrex variable declarations to get a speedup. Some variables in there are obviously integers, and some others obviously strings--let's specify that. And while we are at it, let's use Pyrex's enhanced 'for' loop: #--------------- Minimally Pyrex-enhanced minter ----------------# cdef _mint(challenge, int bits): # Answer a 'generalized hashcash' challenge'" cdef int counter, hex_digits, i cdef char *digest hex_digits = int(ceil(bits/4.)) hash = sha for counter from 0 <= counter < sys.maxint: py_digest = hash(challenge+hex(counter)[2:]).hexdigest() digest = py_digest for i from 0 <= i < hex_digits: if digest[i] != c'0': break else: tries[0] = counter return hex(counter)[2:] Pretty easy so far. We have just declared some variable types that we clearly know, and used the cleanest Pyrex counter loop. A little trick is assigning 'py_digest' (a Python string) to 'digest' (a C/Pyrex string) in order to type it. Experimentally, I also found that a looping string comparison is faster than taking a slice. How do we do? #----------- Pyrex-ized speed results for minting ---------------# $ ./hashcash_test.py hashcash_pyx2 >/dev/null 20.3749 seconds (116636 hashes per second) We are doing better. We have managed a slight improvement from native Python, and a pretty good improvement over our initial Pyrex module. Still nothing very impressive though--a few percent gain. Profiling. Something seems wrong here. Gaining a few percent in speed is not much like gaining 40-times like the Pyrex homepage--and many Pyrex users--boast about. I figured out it was time to see -where- my Python '_mint()' function was actually spending its time. A quick script (not shown) breaks out just what is going on in the complex compound operation 'sha(challenge+hex(counter)[2:]).hexdigest()': #---------- Timing aspects of hashcash minting ------------------# 1000000 empty loops: 0.559 ------------------------------ 1000000 sha()s: 2.332 1000000 hex()[2:]s: 3.151 just hex()s: <2.471> 1000000 concatenations: 0.855 1000000 hexdigest()s: 3.742 ------------------------------ Total: 10.079 Clearly, we cannot eliminate the loop itself from our '_mint()' function. Pyrex's enhanced 'for' probably speeds it up slightly, but the whole function is mainly the loop. And we cannot get rid of the call to 'sha()'--at least not unless we are willing to reimplement SHA-1 in Pyrex (I am far from confident that I could better than the writers of the Python standard [sha] module even if I did this). Moreover, if we hope to get an actual hash out of the 'sha.SHA' object, we have to call '.hexdigest()' or '.digest()'--the former is slightly faster too. All that is really left for us to address is the 'hex()' converstion on the counter variable, and perhaps the slice taken from the result. We might be able to squeeze a little bit out of concatenating Pyrex/C strings, rather than Python string objects, too. The only way I see to avoid the 'hex()' conversion, however, is to manually build a suffix out of nested loops. Doing this can avoid any int->char conversion, but also makes for more code: #----------------- Fully Pyrex-optimized minter -----------------# cdef _mint(char *challenge, int bits): cdef int hex_digits, i0, i1, i2, i3, i4, i5 cdef char *ab, *digest, *trial, *suffix suffix = '******' ab = alphabet hex_digits = int(ceil(bits/4.)) hash = sha for i0 from 0 <= i0 < 55: suffix[0] = ab[i0] for i1 from 0 <= i1 < 55: suffix[1] = ab[i1] for i2 from 0 <= i2 < 55: suffix[2] = ab[i2] for i3 from 0 <= i3 < 55: suffix[3] = ab[i3] for i4 from 0 <= i4 < 55: suffix[4] = ab[i4] for i5 from 0 <= i5 < 55: suffix[5] = ab[i5] py_digest = hash(challenge+suffix).hexdigest() digest = py_digest for i from 0 <= i < hex_digits: if digest[i] != c'0': break else: return suffix This Pyrex function still looks quite a bit easier to read than the corresponding C function would; but it is certainly more complicated than was my naively elegant pure-Python version. By the way, unrolling the suffix generation in pure-Python has a slightly negative effect on overall speed versus the initial version. In Pyrex, as we would expect, these nested loops are pretty much free, and we save the cost of conversion and slicing: #------ Optimiez Pyrex-ized speed results for minting -----------# $ ./hashcash_test.py hashcash_pyx3 >/dev/null 13.2270 seconds (166125 hashes per second) Better than we started with, certainly. But still well under a doubling of speed. The problem is that most of the time was, and is, spent in calls to the Python library, and nothing we might code around it prevents or speeds up those calls. A disappointing comparison. Getting a speedup of 50-60% still seems worthwhile. And we have not done -that- much coding to get it. But if you think about adding the -two- statements 'import psyco;psyco.bind(_mint)' to the initial Python version, this speedup does not seem so impressive: #------------ Psyco-ized speed results for minting --------------# $ ./hashcash_test.py hashcash_psyco >/dev/null 15.2300 seconds (157550 hashes per second) In other words, Psyco does almost as much with just two generic lines of code. Of course, Psyco only works on x86 platforms, while Pyrex will work anywhere that has a C compiler. But for the particular example at issue, 'os.popen('hashcash -m '+options)' will still be many times faster than either Pyrex or Psyco will get you (assuming the C utility 'hashcash' is available, of course). WHERE PYREX SHINES ------------------------------------------------------------------------ In the best case, Pyrex can indeed produce quite fast code. For example, the Pyrex homepage prominently features a numeric intensive function for calculating a list of initial prime numbers. All operations involved are performed on integers, so type declarations can speed up the algorithm quite substantially. This Pyrex function just barely differs from pure Python--just a few declarations added: #------------- Pyrex function for finding primes ----------------# def primes(int kmax): # Calculate initial prime numbers cdef int n, k, i cdef int p[100000] result = [] k = 0 n = 2 while k < kmax: i = 0 while i < k and n % p[i] <> 0: i = i + 1 if i == k: p[k] = n k = k + 1 result.append(n) n = n + 1 return result I also wrote the same function as actual Python, basically just by taking the declarations back out. Running the Pyrex and Python versions, and also a Psyco-ized Python version gives these times: #----- Times for finding primes in Python, Psyco and Pyrex ------# $ ./prime_test.py Pure Python version First 10000 primes 60.30 seconds Psyco-ized Python First 10000 primes 7.62 seconds Pyrex Version First 10000 primes 1.31 seconds So in the best case, Pyrex does -a lot- better than Python, and still quite significantly better than the Psyco speedup. I have a hunch, however, that I might be able to improve Psyco's speed by fiddling with some algorithm details. Even so, Pyrex almost certainly represents the best you can do for this type of problem. The generated C code looks almost exactly the same as what you'd write if you simply started with C. CONCLUSION ------------------------------------------------------------------------ There are a few things that Pyrex does quite well. For code that works with simple numeric or character data in tight loops, Pyrex can produce significant speedups; maybe 50 times the speed in best cases. If a Python application encounters a significant bottleneck in numeric functions, creating Pyrex versions of those functions is very sensible. But the cases where you find significant gains are relatively constrained. Code that spends most of its time making library calls just is not going to benefit that much from Pyrex speeding up incidental loop overhead. Moreover, in many cases, a generic two lines enabling Psyco can produce a similar improvement to what you get though a moderate degree of rewriting from Python into Pyrex. Pyrex code is easy to write, but you have to -write- it, unlike with Psyco. The efforts with hashcash in this article are not the best you might do, I will note. I am confident that (with much more work) it would be possible to modify the Python [sha] module a bit to enable direct calls to the C-level interface, thereby avoiding the Python-level calling overheads. Or just to find some other optimized SHA-1 implementation in C. Pyrex code is perfectly able to utilize external C code, and calling an 'sha()' function written in C will be faster than boxing and unboxing it in Python objects and namespaces. But then, it is not clear why this is worthwhile, given a quite good existing C implementation of hashcash. Another option to think about, however, in writing specialized numeric functions using Pyrex, is whether Numerical Python might be a suitable tool for your work. The [numeric] package is fairly general, and quite fast for what it does. Using [numeric] does not involve -any- non-Python code for its end user, just calls to appropriate library functions. The coverage of [numeric] is certainly not identical to those functions that can benefit from Pyrex; but there is certainly -some- overlap. RESOURCES ------------------------------------------------------------------------ The Pyrex website includes manuals and a tutorial, as well as the module itself. It is at: http://nz.cosc.canterbury.ac.nz/~greg/python/Pyrex/ _Charming Python_ installment "Make Python run as fast as C with Psyco": http://www-106.ibm.com/developerworks/library/l-psyco.html My article on hashcash, "Hashcash: Arbitrary CPU Costs as a Mechanism of Access Control" can be found at: [IBM URL pending] The Python module that I wrote, 'hashcash.py' can be found at: http://gnosis.cx/download/gnosis/util/hashcash.py I wrote a _Charming Python_ installment on Numarray and Numerical Python at: http://www-106.ibm.com/developerworks/linux/library/l-cpnum.html ABOUT THE AUTHOR ------------------------------------------------------------------------ {Picture of Author: http://gnosis.cx/cgi-bin/img_dqm.cgi} David Mertz has a slow brain, and most of his programs still run slowly. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/publish/. Check out David's book _Text Processing in Python_ (http://gnosis.cx/TPiP/).