CHARMING PYTHON #B21: Pyrex.
Writing C code in Python syntax

David Mertz, Ph.D.
Your Obedient Writer, Gnosis Software, Inc.
October, 2004

    Pyrex is a language specially designed for writing Python extension
    modules. It's designed to bridge the gap between the nice,
    high-level, easy-to-use world of Python and the messy, low-level
    world of C. Almost any piece of Python code is also valid Pyrex
    code; but you can add optional static type declarations to Pyrex
    code, making the declared objects run at C speed. David contrasts
    writing code in Pyrex--generally for use with larger Python
    applications--with speeding up Python applications using the Psyco
    compiler (that he has written about previously).

MAKING PYTHON FAST
------------------------------------------------------------------------

  In some sense, Pyrex is just part of a growing family of Python-like
  languages: Jython, IronPython, Prothon, Boo, formerly Vyper, in a way
  Stackless, or in another way the Parrot runtime.. In language terms,
  Pyrex is essentially just Python with type declarations added in. The
  few other variations on the language do not amount to that much
  (though the extension to the 'for' loop has an elegance to it). The
  reason you actually want to use Pyrex, however, is to write modules
  that run faster--maybe a lot faster--than pure Python can possibly
  run.

  What Pyrex does internally is generate a C program out of a Pyrex one.
  The intermediate 'module.c' file remains available for hand tweaking,
  in the unlikely event you have reason to do that.  For "normal" Pyrex
  users, there is no reason to modify the generated C module--Pyrex
  itself gives you access to all the C-level constructs that are most
  important for speed, while saving you all the C gotchas of memory
  allocation and deallocation, pointer arithmetic, prototypes, and so
  on.  Pyrex also seamlessly handles all the interfacing with
  Python-level objects; mostly it does this by declaring variables as
  'PyObject' structures where necessary, and using Python C-API calls
  for memory handling and type conversions.

  For the most part, Pyrex runs faster than Python by avoiding the need
  to continuously -box- and -unbox- variables of simple data types. An
  'int' in Python, for example, is an object with a bunch of methods;
  with tree of ancestors, itself having a computed "method resolution
  order" (MRO); with allocators and dellocators for memory handling; and
  that knows when to promote itself to a long, and how to enter into
  numeric operations with values of other types. All those extra
  capabilities mean many levels of indirection and many more conditional
  checks when you -do- things with 'int' objects. On the other hand, a C
  or Pyrex 'int' variable is just a region in memory with some bits set
  to ones or zeors. Doing something with a C/Pyrex 'int' does not
  involve -any- indirection nor any conditional checks. You just, e.g.
  performs a CPU "add" operation in silicon.

  In well-chosen cases, a Pyrex module can run 40-50 times faster than a
  Python version of the same module. But in contrast to writing the
  module in C, per se, the Pyrex version will be hardly any longer than
  the Python version, and the code will look much more like Python than
  like C.

  Of course, when you start talking about making Python(-like) modules
  fast, Pyrex is not the only tool that comes to mind. Psyco also lives
  at the back of every Python developer's mind. Psyco is--in very short
  version--a just-in-time (JIT) compiler of Python code into (x86)
  machine code. Unlike Pyrex, Psyco does not exactly type variables, but
  rather creates several speculative machine code versions of each
  Python block, based on each hypothesis about what the data types
  -might- be. If the data turns out to be of a simple type like 'int'
  for the entire run of a given block, that block (especially if it
  loops) can run very quickly.  So, for example, 'x' can be an 'int'
  within a loop that runs a million times, but still be assigned a
  'float' value at the end of the loop.  Psyco happily speeds up the
  million iterations, essentially by exactly the same (speculative)
  unboxing that you can explicitly specify with Pyrex.

  While Pyrex is not difficult, Psyco is childishly simple to use.
  Using Psyco amounts to nothing more than putting a few lines at the
  end of your module; in fact if you use the right lines, the module
  will run identically (except more slowly) even when Psyco is not
  available:

      # Import Psyco if available
      try:
          import psyco
          psyco.full()
      except ImportError:
          pass

  To use Pyrex, you need to change a bit more in your code (just a bit),
  and you also need to have a C compiler available and properly
  configued on the system where you generate the Pyrex module. You -can-
  distribute binary Pyrex modules, but you must match the Python
  version, architecture, and optimization flags that the end user needs
  for your module to work elsewhere.

A (NAIVE) FIRST TRY FOR SPEED
------------------------------------------------------------------------

  I recently created a pure-Python implementation of hashcash, in
  writing a developerWorks article about hashcash.  Read that article
  for details, but basically, hashcash is a technique for proving CPU
  work using SHA-1 challenges.  Python has a standard module [sha],
  which makes coding hashcash relatively simple.

  Unlike 95% of the Python programs I write, the slow speed of my
  [hashcash] module bothers me, at least a little bit. By design, the
  protocol is meant to eat CPU cycles, so runtime efficiency really does
  matter. The ANSI C binary of 'hashcash.c' runs about 10-times as
  quickly as my 'hashcash.py' script. Moreover, the brilliantly
  optimized PPC/Altivec-enabled binary of 'hashcash.c' runs another
  4-times as fast as the generic ANSI C version (a G4/Altivec at 1 Ghz
  easy outpaces a Pentium4/MMX at 3 Ghz in hashcash/SHA speed; a G5 is
  that much faster still). So testing on my TiPowerbook shows my module
  an embarrassing 40-times slower than the optimized C version.  The
  gap is far less on x86 though.

  The module runs slowly, maybe Pyrex is a good candidate for speeding
  it up. At least that was my thought. The first thing I did in
  Pyrx-izing 'hashcash.py' (after installing Pyrex, of course) was
  simply copy it to 'hashcash_pyx.pyx' and try processing it with:

      #----------- Generating a C source from Pyrex source ------------#
      $ pyrexc hashcash_pyx.pyx

Creating a binary module

  Running this command happily generates a file 'hashcash.c' (once a few
  minor changes are made to the source file). Unfortunately, getting the
  'gcc' switches just right for my platform was a bit tricker, so I
  decided to take the recommended shortcut of letting [distutils] figure
  it out for me. A standard Python installation knows how to work with
  the local C compiler during module installations, and using
  [distutils] makes sharing a Pyrex module easier. I created a
  'setup_hashcash.py' script as follows:

      #--------------------- hashcash_setup.py ------------------------#
      from distutils.core import setup
      from distutils.extension import Extension
      from Pyrex.Distutils import build_ext
      setup(
        name = "hashcash_pyx",
        ext_modules=[
          Extension("hashcash_pyx", ["hashcash_pyx.pyx"], libraries = [])
          ],
        cmdclass = {'build_ext': build_ext}
      )

  Now running the below line fully builds a C-based extension module
  [hashcash]:

      #------ Generate and compile a module from Pyrex source ---------#
      python2.3 prime_setup.py build_ext --inplace

Code changes

  I very slightly overstated the ease I experienced in generating a
  C-based module out of 'hashcash.pyx'.  Actually, I had to make two
  changes to the source code; I found the locations by looking at where
  'pyrexc' complained.  I used one unsupported list comprehension in my
  code, which I had to unroll into a regular 'for' loop.  Simple enough.
   I also had to change one augmented assignment from 'counter+=1' to
   'counter=counter+1'.

   That's it. I had written my first Pyrex module.

TESTING SPEEDS
------------------------------------------------------------------------

  In order to easily test the speed of the incrementally improving
  modules I planned to develop, I wrote a small test harness to run
  different module version:

      #--------------------- hashcash_test.py -------------------------#
      #!/usr/bin/env python2.3
      import time, sys, optparse
      hashcash = __import__(sys.argv[1])
      start = time.time()
      print hashcash.mint('mertz@gnosis.cx', bits=20)
      timer = time.time()-start
      sys.stderr.write("%0.4f seconds (%d hashes per second)\n" %
              (timer, hashcash.tries[0]/timer))

  Excitedly, I decided to see just how much speed improvement I got just
  by compiling via Pyrex.  Note in all the below samples that the actual
  stamp generation time varies widely and randomly.  The figure to look
  at is the "hashes per second", which pretty reliably measures speed.
  So comparing native Python with Pyrex:

      #----------- Native Python versus "barely Pyrex" ----------------#
      $ ./hashcash_test.py hashcash
      1:20:041003:mertz@gnosis.cx::I+lyNUpV:167dca
      13.7879 seconds (106904 hashes per second)

      $ ./hashcash_test.py hashcash_pyx > /dev/null
      6.0695 seconds (89239 hashes per second)

  Ooops! We just got almost 20% slower by using Pyrex.  Not really what
  I was hoping for.  It is time to start analyzing the code for speedup
  possibilities.  Here is the short function that takes substantially
  all the time:

      #---------------- Worker function in 'hashcash.py' --------------#
      def _mint(challenge, bits):
          "Answer a 'generalized hashcash' challenge'"
          counter = 0
          hex_digits = int(ceil(bits/4.))
          zeros = '0'*hex_digits
          hash = sha
          while 1:
              digest = hash(challenge+hex(counter)[2:]).hexdigest()
              if digest[:hex_digits] == zeros:
                  tries[0] = counter
                  return hex(counter)[2:]
              counter += 1

  We need to take some actual advantage of Pyrex variable declarations
  to get a speedup.  Some variables in there are obviously integers, and
  some others obviously strings--let's specify that.  And while we are
  at it, let's use Pyrex's enhanced 'for' loop:

      #--------------- Minimally Pyrex-enhanced minter ----------------#
      cdef _mint(challenge, int bits):
          # Answer a 'generalized hashcash' challenge'"
          cdef int counter, hex_digits, i
          cdef char *digest
          hex_digits = int(ceil(bits/4.))
          hash = sha
          for counter from 0 <= counter < sys.maxint:
              py_digest = hash(challenge+hex(counter)[2:]).hexdigest()
              digest = py_digest
              for i from 0 <= i < hex_digits:
                  if digest[i] != c'0': break
              else:
                  tries[0] = counter
                  return hex(counter)[2:]

  Pretty easy so far.  We have just declared some variable types that we
  clearly know, and used the cleanest Pyrex counter loop.  A little
  trick is assigning 'py_digest' (a Python string) to 'digest' (a
  C/Pyrex string) in order to type it.  Experimentally, I also found
  that a looping string comparison is faster than taking a slice.  How
  do we do?

      #----------- Pyrex-ized speed results for minting ---------------#
      $ ./hashcash_test.py hashcash_pyx2 >/dev/null
      20.3749 seconds (116636 hashes per second)

  We are doing better.  We have managed a slight improvement from native
  Python, and a pretty good improvement over our initial Pyrex module.
  Still nothing very impressive though--a few percent gain.

Profiling.

  Something seems wrong here. Gaining a few percent in speed is not much
  like gaining 40-times like the Pyrex homepage--and many Pyrex
  users--boast about. I figured out it was time to see -where- my Python
  '_mint()' function was actually spending its time. A quick script (not
  shown) breaks out just what is going on in the complex compound
  operation 'sha(challenge+hex(counter)[2:]).hexdigest()':

      #---------- Timing aspects of hashcash minting ------------------#
      1000000 empty loops:     0.559
      ------------------------------
      1000000 sha()s:          2.332
      1000000 hex()[2:]s:      3.151
         just hex()s:         <2.471>
      1000000 concatenations:  0.855
      1000000 hexdigest()s:    3.742
      ------------------------------
      Total:                  10.079

  Clearly, we cannot eliminate the loop itself from our '_mint()'
  function.  Pyrex's enhanced 'for' probably speeds it up slightly, but
  the whole function is mainly the loop.  And we cannot get rid of the
  call to 'sha()'--at least not unless we are willing to reimplement
  SHA-1 in Pyrex (I am far from confident that I could better than the
  writers of the Python standard [sha] module even if I did this).
  Moreover, if we hope to get an actual hash out of the 'sha.SHA'
  object, we have to call '.hexdigest()' or '.digest()'--the former is
  slightly faster too.

  All that is really left for us to address is the 'hex()' converstion
  on the counter variable, and perhaps the slice taken from the result.
  We might be able to squeeze a little bit out of concatenating Pyrex/C
  strings, rather than Python string objects, too.  The only way I see
  to avoid the 'hex()' conversion, however, is to manually build a
  suffix out of nested loops.  Doing this can avoid any int->char
  conversion, but also makes for more code:

      #----------------- Fully Pyrex-optimized minter -----------------#
      cdef _mint(char *challenge, int bits):
          cdef int hex_digits, i0, i1, i2, i3, i4, i5
          cdef char *ab, *digest, *trial, *suffix
          suffix = '******'
          ab = alphabet
          hex_digits = int(ceil(bits/4.))
          hash = sha
          for i0 from 0 <= i0 < 55:
              suffix[0] = ab[i0]
              for i1 from 0 <= i1 < 55:
                  suffix[1] = ab[i1]
                  for i2 from 0 <= i2 < 55:
                      suffix[2] = ab[i2]
                      for i3 from 0 <= i3 < 55:
                          suffix[3] = ab[i3]
                          for i4 from 0 <= i4 < 55:
                              suffix[4] = ab[i4]
                              for i5 from 0 <= i5 < 55:
                                  suffix[5] = ab[i5]
                                  py_digest = hash(challenge+suffix).hexdigest()
                                  digest = py_digest
                                  for i from 0 <= i < hex_digits:
                                      if digest[i] != c'0': break
                                  else:
                                      return suffix

  This Pyrex function still looks quite a bit easier to read than the
  corresponding C function would; but it is certainly more complicated
  than was my naively elegant pure-Python version.  By the way,
  unrolling the suffix generation in pure-Python has a slightly negative
  effect on overall speed versus the initial version.  In Pyrex, as we
  would expect, these nested loops are pretty much free, and we save the
  cost of conversion and slicing:

      #------ Optimiez Pyrex-ized speed results for minting -----------#
      $ ./hashcash_test.py hashcash_pyx3 >/dev/null
      13.2270 seconds (166125 hashes per second)

  Better than we started with, certainly. But still well under a
  doubling of speed. The problem is that most of the time was, and is,
  spent in calls to the Python library, and nothing we might code around
  it prevents or speeds up those calls.

A disappointing comparison.

  Getting a speedup of 50-60% still seems worthwhile. And we have not
  done -that- much coding to get it. But if you think about adding the
  -two- statements 'import psyco;psyco.bind(_mint)' to the initial
  Python version, this speedup does not seem so impressive:

      #------------ Psyco-ized speed results for minting --------------#
      $ ./hashcash_test.py hashcash_psyco >/dev/null
      15.2300 seconds (157550 hashes per second)

  In other words, Psyco does almost as much with just two generic lines
  of code. Of course, Psyco only works on x86 platforms, while Pyrex
  will work anywhere that has a C compiler. But for the particular
  example at issue, 'os.popen('hashcash -m '+options)' will still be
  many times faster than either Pyrex or Psyco will get you (assuming
  the C utility 'hashcash' is available, of course).

WHERE PYREX SHINES
------------------------------------------------------------------------

  In the best case, Pyrex can indeed produce quite fast code.  For
  example, the Pyrex homepage prominently features a numeric intensive
  function for calculating a list of initial prime numbers.  All
  operations involved are performed on integers, so type declarations
  can speed up the algorithm quite substantially.  This Pyrex function
  just barely differs from pure Python--just a few declarations added:

      #------------- Pyrex function for finding primes ----------------#
      def primes(int kmax):
          # Calculate initial prime numbers
          cdef int n, k, i
          cdef int p[100000]
          result = []
          k = 0
          n = 2
          while k < kmax:
              i = 0
              while i < k and n % p[i] <> 0:
                  i = i + 1
              if i == k:
                  p[k] = n
                  k = k + 1
                  result.append(n)
              n = n + 1
          return result

  I also wrote the same function as actual Python, basically just by
  taking the declarations back out.  Running the Pyrex and Python
  versions, and also a Psyco-ized Python version gives these times:

      #----- Times for finding primes in Python, Psyco and Pyrex ------#
      $ ./prime_test.py
      Pure Python version
      First 10000 primes
      60.30 seconds

      Psyco-ized Python
      First 10000 primes
      7.62 seconds

      Pyrex Version
      First 10000 primes
      1.31 seconds

  So in the best case, Pyrex does -a lot- better than Python, and still
  quite significantly better than the Psyco speedup.  I have a hunch,
  however, that I might be able to improve Psyco's speed by fiddling
  with some algorithm details.  Even so, Pyrex almost certainly
  represents the best you can do for this type of problem.  The
  generated C code looks almost exactly the same as what you'd write if
  you simply started with C.

CONCLUSION
------------------------------------------------------------------------

  There are a few things that Pyrex does quite well. For code that works
  with simple numeric or character data in tight loops, Pyrex can
  produce significant speedups; maybe 50 times the speed in best cases.
  If a Python application encounters a significant bottleneck in numeric
  functions, creating Pyrex versions of those functions is very
  sensible. But the cases where you find significant gains are
  relatively constrained. Code that spends most of its time making
  library calls just is not going to benefit that much from Pyrex
  speeding up incidental loop overhead. Moreover, in many cases, a
  generic two lines enabling Psyco can produce a similar improvement to
  what you get though a moderate degree of rewriting from Python into
  Pyrex. Pyrex code is easy to write, but you have to -write- it, unlike
  with Psyco.

  The efforts with hashcash in this article are not the best you might
  do, I will note. I am confident that (with much more work) it would be
  possible to modify the Python [sha] module a bit to enable direct
  calls to the C-level interface, thereby avoiding the Python-level
  calling overheads. Or just to find some other optimized SHA-1
  implementation in C. Pyrex code is perfectly able to utilize external
  C code, and calling an 'sha()' function written in C will be faster
  than boxing and unboxing it in Python objects and namespaces.  But
  then, it is not clear why this is worthwhile, given a quite good
  existing C implementation of hashcash.

  Another option to think about, however, in writing specialized numeric
  functions using Pyrex, is whether Numerical Python might be a suitable
  tool for your work. The [numeric] package is fairly general, and quite
  fast for what it does. Using [numeric] does not involve -any-
  non-Python code for its end user, just calls to appropriate library
  functions.  The coverage of [numeric] is certainly not identical to
  those functions that can benefit from Pyrex; but there is certainly
  -some- overlap.

RESOURCES
------------------------------------------------------------------------

  The Pyrex website includes manuals and a tutorial, as well as the
  module itself. It is at:

    http://nz.cosc.canterbury.ac.nz/~greg/python/Pyrex/

  _Charming Python_  installment "Make Python run as fast as C with
  Psyco":

    http://www-106.ibm.com/developerworks/library/l-psyco.html

  My article on hashcash, "Hashcash: Arbitrary CPU Costs as a Mechanism
  of Access Control" can be found at:

    [IBM URL pending]

  The Python module that I wrote, 'hashcash.py' can be found at:

    http://gnosis.cx/download/gnosis/util/hashcash.py

  I wrote a _Charming Python_ installment on Numarray and Numerical
  Python at:

    http://www-106.ibm.com/developerworks/linux/library/l-cpnum.html

ABOUT THE AUTHOR
------------------------------------------------------------------------

  {Picture of Author: http://gnosis.cx/cgi-bin/img_dqm.cgi}
  David Mertz has a slow brain, and most of his programs still run
  slowly. David may be reached at mertz@gnosis.cx; his life pored over
  at http://gnosis.cx/publish/. Check out David's book _Text Processing
  in Python_ (http://gnosis.cx/TPiP/).