CHARMING PYTHON #14 (20000061)
Text processing in Python with mxTextTools: Advanced Tips

David Mertz, Ph.D.
Whirling dervish, Gnosis Software, Inc.
January, 2001

    [mxTextTools] is a powerful extension module that lets Python
    programmers flexibly process text files at a lightning fast
    speed.  Speed comes at the price of some work, however.  The
    tips and techniques in this column will guide a reader
    towards developing text processing programs utilizing
    [mxTextTools].


WHAT IS PYTHON?
------------------------------------------------------------------------

  Python is a freely available, very-high-level, interpreted
  language developed by Guido van Rossum.  It combines a clear
  syntax with powerful (but optional) object-oriented semantics.
  Python is available for almost every computer platform you
  might find yourself working on, and has strong portability
  between platforms.


INTRODUCTION
------------------------------------------------------------------------

  One of the strengths of Python is a good set of text processing
  tools.  The inherent tools are powerful, flexible, and easy to
  work with.  One thing Python's built-in text processing is
  -not- is particularly fast.  Mind you, for a lot of thing,
  Python by itself is as fast as you need.  But for a certain
  class of problems, Python has a problem.

  Readers of this column will remember the 'Txt2Html' tool that
  has been discussed and enhanced to demonstrate various
  techniques and extension modules.  The purpose of this tool, at
  heart, is to let me write articles like this one in a "smart
  ASCII" format that uses email-like conventions to lightly mark
  features like word emphasis, source code, and URL links.  By
  obeying just a few conventions (that are almost the same as I
  would use on Usenet or in email), I can write without much
  clutter, then convert to the HTML you are probably reading.  On
  my own website, "smart ASCII" content I have placed there is
  dynamically delivered as converted HTML also.

  The 'Txt2Html' utility uses a generous collection of regular
  expressions to identify and modify markup patterns in source
  text.  Even though Python's regular expression engine is
  moderately slow (Perl is faster, but Python 2 also adds
  improvements), converting an article like this one takes only a
  couple seconds.  In practice, 'Txt2Html' is more than adequate
  for my own 20k documents, and for the few hits my personal
  website gets.  However, it is easy to imagine a
  not-so-different situation where one was converting
  multi-megabyte documents and/or delivering such dynamicly
  converted content on a high-volume web site.  In such a case,
  Python's string operations, and especially regular expressions,
  would simply be too slow.

  Fortunately, Marc-Andre Lemburg has provided the Python world
  with his blazingly fast and extremely powerful [mxTextTools]
  extension module (written itself in C).  On the minus side, it
  is, frankly, a lot of work to adjust to the conceptual space of
  using [mxTextTools]; and it is a bit laborious to get the kinks
  out of a complex processing task.  In this column, we will take
  a look at working with [mxTextTools], and implement a speedup
  for 'Txt2Html'.


BENCHMARKS
------------------------------------------------------------------------

  A familiar computer-industry paraphrase of Mark Twain dictates
  that there are "Lies, Damn Lies, and Benchmarks."  I will not
  argue with that; and certainly do not want readers to put too
  great an import on the timings below.  Nonetheless, in writing
  this article, I wanted to get some sense of just how fast
  [mxTextTools] might be.  So here is a rough idea.

  Although 'Txt2Html' does a variety of transformation tricks at
  both block and inline levels, the feature I expect to encounter
  the most in "smart ASCII" is inline markup of words and phrases
  as '<code>', '<em>', '<strong>' and the like.  This markup is
  converted by the function 'Typographify()' (not all at once,
  however, but in paragraph blocks).  Therefore, I decided to
  make conversion of this function my first [mxTextTools] task.
  In retrospect, some later efforts at timing show that not
  nearly as much time is spent in 'Typographify()' as I had
  expected; therefore, the speedup in 'Txt2Html' as a whole is
  not huge.  I believe, however, that converting other elements
  of the program to take advantage of [mxTextTools] would provide
  additional speed gains.

  In order to get a timable test case, I concatenated 110 copies
  of a recent article to get a file a bit over 2mb. 'wc' reports
  that it has 41k lines and 300k words.  The self-test in
  [mxTypographify] processes an entire input as one text block
  (unlike 'Txt2Html'), first using an [mxTextTools] version of
  'Typographify()', then using the old [re] version.  The time
  spent in each function call is reported using 'time.time()'
  calls that surround only the calls themselves (to eliminate any
  issues of file opening, program startup, etc).

  I reduced the processing time of the same test file from about
  34 seconds to about 12 seconds on one Linux test machine
  (running Python 1.5.2).  In other words, [mxTextTools] has
  given me about a 3x speedup over what I get with the [re]
  module, at least for my requirement.  My results,
  naturally, are specific to the task I am addressing.  However,
  I have been careful enough with both versions of
  'Typographify()' that I doubt large improvements are possible
  while still using the same extension module for each (but
  readers' improvements are welcomed).  Based on such a 3x
  speedup, [mxTextTools] definitely looks like a useful and
  promising module for text processing.


HOW DOES [mxTextTools] WORK?
------------------------------------------------------------------------

  For the computer scientists, [mxTextTools] looks an awful lot
  like a Turing Machine.  You have a set of states (tag tuples)
  together with a tape (read buffer).  Each state can do one of
  two things, jump on failure to one next state or jump on success
  to another next state (they might be the same).  On failure,
  the read head is restored to the same position it was at before
  the state evaluation.  However, unlike a standard definition of
  a Turing Machine, one state can evaluate multiple tape symbols
  to decide on success or failure.  And also unlike a standard
  Turing Machine, little state machines are typically nested
  inside one another, with success or failure propogating upward.

  All the Turing/state machine talk is a bit formalistic.  The
  short characterization is that [mxTextTools] can do everything
  regular expressions can, plus some things regular expressions
  cannot.  Readers can get a better idea by diving into some
  code.  Let's take a bit from this column's project:

      #----------- Detector for isolated punctuation ----------#
      stray_punct = \
       ( emit_misc, Table+CallTag,           # Pickup any cases where multiple (1)
         ( (None, IsInSet, punct_set),       # punctuation character occur     (2)
           (None, AllInSet, punct_set),      # alone (followed by whitespace)  (3)
           (None, IsInSet, whitespace_set),                                    (4)
           (None, Skip, -1)                  # Give back the whitespace        (5)
         )
       )

  The [mxTypographify] module's overal purpose is to read through
  a string, and determine if the "next thing" matches one of the
  markup patterns used in "smart ASCII".  Or rather, it better
  match some pattern or it just will not know what action to
  take for the next bit of text.  The "tag tuple" bound to the
  name "stray_punct" is mostly a fallback condition.  It detects
  the condition where the "next thing" is some punctuation
  symbols standing alone without abutting any words.  In most
  cases, I don't -want- "smart ASCII" to contain such a pattern,
  but [mxTypographify] has to do -something- with them if they
  are encountered.  Let's walk through the lines above, this
  example is a good illustration of what [mxTextTools] does
  generally.

    (1) Declare what action to perform and what type of pattern
        to look for.  In this example, we are using a callback
        function 'emit_misc' in the case of a match.  The
        function gets passed arguments that tell it about the
        details of the match, if found.  We might also append the
        match to a "taglist," or just do nothing if "None" is
        specified.  The second tuple element expresses two things
        in the example.  It tries to match a table (the tuple
        that follows), and if it does match it calls the callback
        function.

    (2) Lines 2-5 are all part of the table/tuple indicated in
        line (1).  In other words the success or failure of the
        whole tuple in 2-5 determines the correspondng success or
        failure of "stray_punct" as a whole.  Line (2) itself
        uses the 'IsInSet' command to try to match the current
        head position against a collection of values (in this
        case "punct_set" which is defined elsewhere in the
        program).  Commands that start with "Is" match exactly
        one character, or fail.  Should line (2) fail, it will
        pass the failure back up to "stray_punct" which then
        fails as a whole.  However, had we specified another
        tuple element in line (2), we could have jumped to
        another line in case of failure.  By default, on success
        we jump to the next line; however, by specifying the
        fifth tuple position we could make the jump go somewhere
        other than the next state (i.e. default is +1).

    (3) Operates much like line (2) does.  However, the
        'AllInSet' command matches as many characters as it can,
        not just one like 'Is*' commands do.  In order to
        succeed, 'All*' commands must match at least one
        character.  The read head is advanced with each matching
        character encountered.  As above, failure is propogated
        up to "stray_punct".  The 'All*' commands are very similar
        to the "+" operator in regular expressions.

    (4) This is basically just like line (2).  The only differnce
        is that a different set of characters is matched
        (whitespace characters, in the example).

    (5) The 'Skip' command is a little subtle, and very
        important.  'Skip' -always- successfully matches the
        current read head position, then causes the read head to
        move by the amount specified in the next element.  The
        match that is propogated up to "stray_punct" includes
        only the characters between the initial and current read
        head position.  In the example, this means that the
        whitespace must be matched for "stray_punct" to succeed,
        but that whitespace is not included in the actual match.
        In terms of regular expressions, this resembles the
        "(?:\s)" pattern.

  A few example of matches and failures might help here (assume
  the read head is at the beginning of each line):

      #---------- stray_punct successes and failures ----------#
      -- spam      # matches "--"
      & spam       # fails at "AllInSet" since (2) advanced head
      #@$ %% spam  # matches "#@$"
      **spam       # fails (whitespace isn't encountered before 's')

   After each success, the read head is at the space right before
   the next word "spam" or "%%".  After a failure, the read head
   remains where it started out (at the beginning of the line).

ASSEMBLING A TEXT PROCESSOR
------------------------------------------------------------------------

  Putting together a complete processing application (or
  function) is largely a matter of defining all the needed match
  components (like "stray_punct").  The module [mxTypographify]
  contains a more fleshed out processor, but we can look at the
  essential structure here:

      #----------------- "smart ASCII" tagger -----------------#
      # Tag all the (possibly marked-up) words
      tag_words = \
       ( stray_punct+ (+1,),
         emphs   + (+1,),                    # -Emphasized- words
         funcs   + (+1,),                    # 'Functions()'
         strongs + (+1,),                    # *Strong* emphasis
         plain_words + (+1,),                # Unadorned words
         (emit_misc, AllInSet+CallTag,
                leadout_set,+1,+1),          # Lead-out eater
         (jump_count, Skip+CallTag, 0),      # Check for infinite loop
         (None, EOF, Here, -7)               # Check for EOF
       )

      tag(txt, tag_words, 0, len(txt), None)

  This code deserves a bit of further explanation.

  Like "stray_punct," "emphs," "funcs," "strongs," and
  "plain_words" contain tuples with tag tables.  They each have
  their appropriate callback functions (all "emitters" of various
  names, because they "emit" the match along with surrounding
  HTML codes if needed).  The first five lines each have a "+1"
  appended to their tuple; what this does is specify where to
  jump in case of a match failure.  That is, even if these
  patterns fail to match, we continue on--with the read head in
  the same position--to try matching against the other patterns.

  After the basic word patterns each attempt a match, we get to
  the "Lead-out eater" line.  In my terminology a "leadout" is
  the opposite of a "leadin".  That is, the latter are things
  that might precede a word pattern, and the former are things that
  might follow a word pattern.  The "leadout_set" includes
  whitespace characters, but it also includes things like comma,
  period, question-mark, which might end a word.  The "Lead-out
  eater" uses a callback function too.  As designed, I preserve
  exactly the whitespace the input has.  However, it would be
  easy to normalize whitespace here by emitting something other
  than the actual match (e.g. a single space always).

  The 'jump_count' is extremely important, but let us come back
  to it.  For now, it is enough to say that we -hope- the line
  never does anything.

  The 'EOF' line is our flow control, in a way.  The call made by
  this line is to 'None', which is to say that nothing is
  actually -done- with any match.  The command 'EOF' is the
  important thing ('Here' is just a filler value that occupies
  the tuple position).  It succeeds if the read head is past the
  end of the read buffer.  On success, the whole tag table
  "tag_words" succeeds; and having succeeded, processing stops.
  'EOF' failure is more interesting.  Assuming we haven't reached
  the end of our string, we jump -7 states (to "stray_punct").
  From there, the whole process starts over; hopefully with the
  read head advanced to the next word.  By looping back to the
  start of the list tag tuples, we continue eating successive
  word patterns until the read buffer is exhausted (calling
  callbacks along the way).

  The 'tag()' call simply launches processing of the tag table we
  pass to it (against the read buffer contained in 'txt').  In
  our case, we do not care about the return value of 'tag()'
  since everything is handled in callbacks.  However, in cases
  where the tag table does not loop itself, part of the
  return value can be used to determine if there is reason to
  call 'tag()' again with a tail of the read buffer.


DEBUGGING A TAG TABLE
------------------------------------------------------------------------

  Describing it is easy, but I spent a large number of hours
  finding the exact collection of tag tables that would match
  every pattern I was interested in without mismatching any
  pattern as something it wasn't.  While "smart ASCII" markup
  seems pretty simple, there are actually quite a few
  complications (e.g. markup characters being used in non-markup
  contexts, or markup characters and other punctuation appearing
  in various sequences).  Any structured document format that is
  complicated enough to warrant using [mxTextTools] instead of
  [string] is likely to have similar complications.

  Without question, the worst thing that can go wrong in a
  looping state pattern like the above is that -none- of the
  listed states match from the current read head position.  If
  that happens, your program winds up in a tight infinite loop
  (entirely inside the extension module, so you cannot really get
  at it with Python code directly).  I wound up forcing a manual
  kill of the process -countless- times during development.

  Only near the end of my project with [mxTextTools] did I hit on
  a good way to handle the infinite loop problem.  This is the
  'jump_count' callback.  Let me present my function as a
  reference:

      #------- mxTypographify.pypinfinite loop catcher --------#
      def jump_count(taglist,txt,l,r,subtag):
          global head_pos
          if head_pos is None: head_pos = r
          elif head_pos == r:
              raise "InfiniteLoopError", \
                    txt[l-20:l]+'{'+txt[l]+'}'+txt[l+1:r+15]
          else: head_pos = r

  The basic purpose of 'jump_count' is simple:  we want to catch
  the situation where our tag table has been run through multiple
  times without matching anything.  The simplest way to do this
  is to check whether the last read head position is the same as
  the current.  If it is, more loops cannot get anywhere, since
  we have reached the exact same state twice, and the same thing
  is fated to happen forever.  In my example, I simply raise an
  error to stop the program (I also report a little bit of buffer
  context in case I can eyeball what is going on).  However, it
  would also be possible to manually move the read head, and try
  again from a different starting position.

  Not as bad as an infinite loop, but still undesirable, is
  having patterns within a tag table match when they are not
  supposed to or not match when they are suppose to (but
  something else has to match, or we would have an infinite loop
  issue).  Using callbacks everywhere makes examining this
  situation much easier.  During development, I frequently
  created temporary changes to my 'emit_*' callbacks to print or
  log when certain emitters get called.  By looking at output
  from these temporary 'print' statements, most times one can
  tell where the problem lies.


CONCLUSION
------------------------------------------------------------------------

  If text processing speed is crucial to your project, you can
  probably still use Python by adding in [mxTextTools].  Getting
  up to speed with the extension module involves adjusting your
  thinking from usual [string] and [re] techniques.  But with
  some planning and effort, [mxTextTools] can match or beat the
  performance you would find in a low level language like C (or a
  high level one with tuned regular expressions, like Perl).


RESOURCES
------------------------------------------------------------------------

  Obtaining [TextTools] is unfortunately a bit harder than it
  should be (depending on platform).  Your starting point should
  be the project homepage at:

    http://www.lemburg.com/files/python/mxTextTools.html

  On that homepage, you can download the module archive at:

    http://www.lemburg.com/files/python/mxTextTools-1.1.1.zip

  However, if you are installing on a Windows platform, the
  required DLL is not included in the package (despite the
  documentation claiming otherwise).  You can compile it yourself
  if you happen to have VC++ by following the instructions on the
  project homepage (so I presume, I have not done it).
  Hopefully, the packaging will be improved with time (or will
  even become part of "standard" Python distributions).

  An introductory discussion of text processing techniques in
  Python is contained in my earlier column:

    _Charming Python #5: Text Processing in Python_
    http://gnosis.cx/publish/programming/charming_python_5.html

  The public-domain 'Txt2Html' utility, including the
  [mxTypographify] module discussed in this column, can be
  downloaded from:

    http://gnosis.cx/download/txt2html.zip


ABOUT THE AUTHOR
------------------------------------------------------------------------

  {Picture of Author: http://gnosis.cx/cgi-bin/img_dqm.cgi}
  David Mertz wonders if he could pass the Turing test (and
  whether this article has).  David may be reached at
  mertz@gnosis.cx; his life pored over at
  http://gnosis.cx/publish/.  Suggestions and recommendations on
  this, past, or future, columns are welcomed.