Charming Python #14 (20000061)

Text processing in Python with mxTextTools: Advanced Tips


David Mertz, Ph.D.
Whirling dervish, Gnosis Software, Inc.
January, 2001

mxTextTools is a powerful extension module that lets Python programmers flexibly process text files at a lightning fast speed. Speed comes at the price of some work, however. The tips and techniques in this column will guide a reader towards developing text processing programs utilizing mxTextTools.

What Is Python?

Python is a freely available, very-high-level, interpreted language developed by Guido van Rossum. It combines a clear syntax with powerful (but optional) object-oriented semantics. Python is available for almost every computer platform you might find yourself working on, and has strong portability between platforms.

Introduction

One of the strengths of Python is a good set of text processing tools. The inherent tools are powerful, flexible, and easy to work with. One thing Python's built-in text processing is not is particularly fast. Mind you, for a lot of thing, Python by itself is as fast as you need. But for a certain class of problems, Python has a problem.

Readers of this column will remember the Txt2Html tool that has been discussed and enhanced to demonstrate various techniques and extension modules. The purpose of this tool, at heart, is to let me write articles like this one in a "smart ASCII" format that uses email-like conventions to lightly mark features like word emphasis, source code, and URL links. By obeying just a few conventions (that are almost the same as I would use on Usenet or in email), I can write without much clutter, then convert to the HTML you are probably reading. On my own website, "smart ASCII" content I have placed there is dynamically delivered as converted HTML also.

The Txt2Html utility uses a generous collection of regular expressions to identify and modify markup patterns in source text. Even though Python's regular expression engine is moderately slow (Perl is faster, but Python 2 also adds improvements), converting an article like this one takes only a couple seconds. In practice, Txt2Html is more than adequate for my own 20k documents, and for the few hits my personal website gets. However, it is easy to imagine a not-so-different situation where one was converting multi-megabyte documents and/or delivering such dynamicly converted content on a high-volume web site. In such a case, Python's string operations, and especially regular expressions, would simply be too slow.

Fortunately, Marc-Andre Lemburg has provided the Python world with his blazingly fast and extremely powerful mxTextTools extension module (written itself in C). On the minus side, it is, frankly, a lot of work to adjust to the conceptual space of using mxTextTools; and it is a bit laborious to get the kinks out of a complex processing task. In this column, we will take a look at working with mxTextTools, and implement a speedup for Txt2Html.

Benchmarks

A familiar computer-industry paraphrase of Mark Twain dictates that there are "Lies, Damn Lies, and Benchmarks." I will not argue with that; and certainly do not want readers to put too great an import on the timings below. Nonetheless, in writing this article, I wanted to get some sense of just how fast mxTextTools might be. So here is a rough idea.

Although Txt2Html does a variety of transformation tricks at both block and inline levels, the feature I expect to encounter the most in "smart ASCII" is inline markup of words and phrases as <code>, <em>, <strong> and the like. This markup is converted by the function Typographify() (not all at once, however, but in paragraph blocks). Therefore, I decided to make conversion of this function my first mxTextTools task. In retrospect, some later efforts at timing show that not nearly as much time is spent in Typographify() as I had expected; therefore, the speedup in Txt2Html as a whole is not huge. I believe, however, that converting other elements of the program to take advantage of mxTextTools would provide additional speed gains.

In order to get a timable test case, I concatenated 110 copies of a recent article to get a file a bit over 2mb. wc reports that it has 41k lines and 300k words. The self-test in mxTypographify processes an entire input as one text block (unlike Txt2Html), first using an mxTextTools version of Typographify(), then using the old re version. The time spent in each function call is reported using time.time() calls that surround only the calls themselves (to eliminate any issues of file opening, program startup, etc).

I reduced the processing time of the same test file from about 34 seconds to about 12 seconds on one Linux test machine (running Python 1.5.2). In other words, mxTextTools has given me about a 3x speedup over what I get with the re module, at least for my requirement. My results, naturally, are specific to the task I am addressing. However, I have been careful enough with both versions of Typographify() that I doubt large improvements are possible while still using the same extension module for each (but readers' improvements are welcomed). Based on such a 3x speedup, mxTextTools definitely looks like a useful and promising module for text processing.

How Does mxTextTools Work?

For the computer scientists, mxTextTools looks an awful lot like a Turing Machine. You have a set of states (tag tuples) together with a tape (read buffer). Each state can do one of two things, jump on failure to one next state or jump on success to another next state (they might be the same). On failure, the read head is restored to the same position it was at before the state evaluation. However, unlike a standard definition of a Turing Machine, one state can evaluate multiple tape symbols to decide on success or failure. And also unlike a standard Turing Machine, little state machines are typically nested inside one another, with success or failure propogating upward.

All the Turing/state machine talk is a bit formalistic. The short characterization is that mxTextTools can do everything regular expressions can, plus some things regular expressions cannot. Readers can get a better idea by diving into some code. Let's take a bit from this column's project:

Detector for isolated punctuation

stray_punct = \
 ( emit_misc, Table+CallTag,           # Pickup any cases where multiple (1)
   ( (None, IsInSet, punct_set),       # punctuation character occur     (2)
     (None, AllInSet, punct_set),      # alone (followed by whitespace)  (3)
     (None, IsInSet, whitespace_set),                                    (4)
     (None, Skip, -1)                  # Give back the whitespace        (5)
   )
 )

The mxTypographify module's overal purpose is to read through a string, and determine if the "next thing" matches one of the markup patterns used in "smart ASCII". Or rather, it better match some pattern or it just will not know what action to take for the next bit of text. The "tag tuple" bound to the name "stray_punct" is mostly a fallback condition. It detects the condition where the "next thing" is some punctuation symbols standing alone without abutting any words. In most cases, I don't want "smart ASCII" to contain such a pattern, but mxTypographify has to do something with them if they are encountered. Let's walk through the lines above, this example is a good illustration of what mxTextTools does generally.

(1) Declare what action to perform and what type of pattern to look for. In this example, we are using a callback function emit_misc in the case of a match. The function gets passed arguments that tell it about the details of the match, if found. We might also append the match to a "taglist," or just do nothing if "None" is specified. The second tuple element expresses two things in the example. It tries to match a table (the tuple that follows), and if it does match it calls the callback function.
(2) Lines 2-5 are all part of the table/tuple indicated in line (1). In other words the success or failure of the whole tuple in 2-5 determines the correspondng success or failure of "stray_punct" as a whole. Line (2) itself uses the IsInSet command to try to match the current head position against a collection of values (in this case "punct_set" which is defined elsewhere in the program). Commands that start with "Is" match exactly one character, or fail. Should line (2) fail, it will pass the failure back up to "stray_punct" which then fails as a whole. However, had we specified another tuple element in line (2), we could have jumped to another line in case of failure. By default, on success we jump to the next line; however, by specifying the fifth tuple position we could make the jump go somewhere other than the next state (i.e. default is +1).
(3) Operates much like line (2) does. However, the AllInSet command matches as many characters as it can, not just one like Is* commands do. In order to succeed, All* commands must match at least one character. The read head is advanced with each matching character encountered. As above, failure is propogated up to "stray_punct". The All* commands are very similar to the "+" operator in regular expressions.
(4) This is basically just like line (2). The only differnce is that a different set of characters is matched (whitespace characters, in the example).
(5) The Skip command is a little subtle, and very important. Skip always successfully matches the current read head position, then causes the read head to move by the amount specified in the next element. The match that is propogated up to "stray_punct" includes only the characters between the initial and current read head position. In the example, this means that the whitespace must be matched for "stray_punct" to succeed, but that whitespace is not included in the actual match. In terms of regular expressions, this resembles the "(?:\s)" pattern.

A few example of matches and failures might help here (assume the read head is at the beginning of each line):

stray_punct successes and failures

-- spam      # matches "--"
& spam       # fails at "AllInSet" since (2) advanced head
#@$ %% spam  # matches "#@$"
**spam       # fails (whitespace isn't encountered before 's')

After each success, the read head is at the space right before the next word "spam" or "%%". After a failure, the read head remains where it started out (at the beginning of the line).

Assembling A Text Processor

Putting together a complete processing application (or function) is largely a matter of defining all the needed match components (like "stray_punct"). The module mxTypographify contains a more fleshed out processor, but we can look at the essential structure here:

"smart ASCII" tagger

# Tag all the (possibly marked-up) words
tag_words = \
 ( stray_punct+ (+1,),
   emphs   + (+1,),                    # -Emphasized- words
   funcs   + (+1,),                    # 'Functions()'
   strongs + (+1,),                    # *Strong* emphasis
   plain_words + (+1,),                # Unadorned words
   (emit_misc, AllInSet+CallTag,
          leadout_set,+1,+1),          # Lead-out eater
   (jump_count, Skip+CallTag, 0),      # Check for infinite loop
   (None, EOF, Here, -7)               # Check for EOF
 )

tag(txt, tag_words, 0, len(txt), None)

This code deserves a bit of further explanation.

Like "stray_punct," "emphs," "funcs," "strongs," and "plain_words" contain tuples with tag tables. They each have their appropriate callback functions (all "emitters" of various names, because they "emit" the match along with surrounding HTML codes if needed). The first five lines each have a "+1" appended to their tuple; what this does is specify where to jump in case of a match failure. That is, even if these patterns fail to match, we continue on--with the read head in the same position--to try matching against the other patterns.

After the basic word patterns each attempt a match, we get to the "Lead-out eater" line. In my terminology a "leadout" is the opposite of a "leadin". That is, the latter are things that might precede a word pattern, and the former are things that might follow a word pattern. The "leadout_set" includes whitespace characters, but it also includes things like comma, period, question-mark, which might end a word. The "Lead-out eater" uses a callback function too. As designed, I preserve exactly the whitespace the input has. However, it would be easy to normalize whitespace here by emitting something other than the actual match (e.g. a single space always).

The jump_count is extremely important, but let us come back to it. For now, it is enough to say that we hope the line never does anything.

The EOF line is our flow control, in a way. The call made by this line is to None, which is to say that nothing is actually done with any match. The command EOF is the important thing (Here is just a filler value that occupies the tuple position). It succeeds if the read head is past the end of the read buffer. On success, the whole tag table "tag_words" succeeds; and having succeeded, processing stops. EOF failure is more interesting. Assuming we haven't reached the end of our string, we jump -7 states (to "stray_punct"). From there, the whole process starts over; hopefully with the read head advanced to the next word. By looping back to the start of the list tag tuples, we continue eating successive word patterns until the read buffer is exhausted (calling callbacks along the way).

The tag() call simply launches processing of the tag table we pass to it (against the read buffer contained in txt). In our case, we do not care about the return value of tag() since everything is handled in callbacks. However, in cases where the tag table does not loop itself, part of the return value can be used to determine if there is reason to call tag() again with a tail of the read buffer.

Debugging A Tag Table

Describing it is easy, but I spent a large number of hours finding the exact collection of tag tables that would match every pattern I was interested in without mismatching any pattern as something it wasn't. While "smart ASCII" markup seems pretty simple, there are actually quite a few complications (e.g. markup characters being used in non-markup contexts, or markup characters and other punctuation appearing in various sequences). Any structured document format that is complicated enough to warrant using mxTextTools instead of string is likely to have similar complications.

Without question, the worst thing that can go wrong in a looping state pattern like the above is that none of the listed states match from the current read head position. If that happens, your program winds up in a tight infinite loop (entirely inside the extension module, so you cannot really get at it with Python code directly). I wound up forcing a manual kill of the process countless times during development.

Only near the end of my project with mxTextTools did I hit on a good way to handle the infinite loop problem. This is the jump_count callback. Let me present my function as a reference:

mxTypographify.pypinfinite loop catcher

def jump_count(taglist,txt,l,r,subtag):
    global head_pos
    if head_pos is None: head_pos = r
    elif head_pos == r:
        raise "InfiniteLoopError", \
              txt[l-20:l]+'{'+txt[l]+'}'+txt[l+1:r+15]
    else: head_pos = r

The basic purpose of jump_count is simple: we want to catch the situation where our tag table has been run through multiple times without matching anything. The simplest way to do this is to check whether the last read head position is the same as the current. If it is, more loops cannot get anywhere, since we have reached the exact same state twice, and the same thing is fated to happen forever. In my example, I simply raise an error to stop the program (I also report a little bit of buffer context in case I can eyeball what is going on). However, it would also be possible to manually move the read head, and try again from a different starting position.

Not as bad as an infinite loop, but still undesirable, is having patterns within a tag table match when they are not supposed to or not match when they are suppose to (but something else has to match, or we would have an infinite loop issue). Using callbacks everywhere makes examining this situation much easier. During development, I frequently created temporary changes to my emit_* callbacks to print or log when certain emitters get called. By looking at output from these temporary print statements, most times one can tell where the problem lies.

Conclusion

If text processing speed is crucial to your project, you can probably still use Python by adding in mxTextTools. Getting up to speed with the extension module involves adjusting your thinking from usual string and re techniques. But with some planning and effort, mxTextTools can match or beat the performance you would find in a low level language like C (or a high level one with tuned regular expressions, like Perl).

Resources

Obtaining TextTools is unfortunately a bit harder than it should be (depending on platform). Your starting point should be the project homepage at:

http://www.lemburg.com/files/python/mxTextTools.html

On that homepage, you can download the module archive at:

http://www.lemburg.com/files/python/mxTextTools-1.1.1.zip

However, if you are installing on a Windows platform, the required DLL is not included in the package (despite the documentation claiming otherwise). You can compile it yourself if you happen to have VC++ by following the instructions on the project homepage (so I presume, I have not done it). Hopefully, the packaging will be improved with time (or will even become part of "standard" Python distributions).

An introductory discussion of text processing techniques in Python is contained in my earlier column:

Charming Python #5: Text Processing in Python http://gnosis.cx/publish/programming/charming_python_5.html

The public-domain Txt2Html utility, including the mxTypographify module discussed in this column, can be downloaded from:

http://gnosis.cx/download/txt2html.zip

About The Author

Picture of Author David Mertz wonders if he could pass the Turing test (and whether this article has). David may be reached at [email protected]; his life pored over at http://gnosis.cx/publish/. Suggestions and recommendations on this, past, or future, columns are welcomed.