CHARMING PYTHON #14 (20000061) Text processing in Python with mxTextTools: Advanced Tips David Mertz, Ph.D. Whirling dervish, Gnosis Software, Inc. January, 2001 [mxTextTools] is a powerful extension module that lets Python programmers flexibly process text files at a lightning fast speed. Speed comes at the price of some work, however. The tips and techniques in this column will guide a reader towards developing text processing programs utilizing [mxTextTools]. WHAT IS PYTHON? ------------------------------------------------------------------------ Python is a freely available, very-high-level, interpreted language developed by Guido van Rossum. It combines a clear syntax with powerful (but optional) object-oriented semantics. Python is available for almost every computer platform you might find yourself working on, and has strong portability between platforms. INTRODUCTION ------------------------------------------------------------------------ One of the strengths of Python is a good set of text processing tools. The inherent tools are powerful, flexible, and easy to work with. One thing Python's built-in text processing is -not- is particularly fast. Mind you, for a lot of thing, Python by itself is as fast as you need. But for a certain class of problems, Python has a problem. Readers of this column will remember the 'Txt2Html' tool that has been discussed and enhanced to demonstrate various techniques and extension modules. The purpose of this tool, at heart, is to let me write articles like this one in a "smart ASCII" format that uses email-like conventions to lightly mark features like word emphasis, source code, and URL links. By obeying just a few conventions (that are almost the same as I would use on Usenet or in email), I can write without much clutter, then convert to the HTML you are probably reading. On my own website, "smart ASCII" content I have placed there is dynamically delivered as converted HTML also. The 'Txt2Html' utility uses a generous collection of regular expressions to identify and modify markup patterns in source text. Even though Python's regular expression engine is moderately slow (Perl is faster, but Python 2 also adds improvements), converting an article like this one takes only a couple seconds. In practice, 'Txt2Html' is more than adequate for my own 20k documents, and for the few hits my personal website gets. However, it is easy to imagine a not-so-different situation where one was converting multi-megabyte documents and/or delivering such dynamicly converted content on a high-volume web site. In such a case, Python's string operations, and especially regular expressions, would simply be too slow. Fortunately, Marc-Andre Lemburg has provided the Python world with his blazingly fast and extremely powerful [mxTextTools] extension module (written itself in C). On the minus side, it is, frankly, a lot of work to adjust to the conceptual space of using [mxTextTools]; and it is a bit laborious to get the kinks out of a complex processing task. In this column, we will take a look at working with [mxTextTools], and implement a speedup for 'Txt2Html'. BENCHMARKS ------------------------------------------------------------------------ A familiar computer-industry paraphrase of Mark Twain dictates that there are "Lies, Damn Lies, and Benchmarks." I will not argue with that; and certainly do not want readers to put too great an import on the timings below. Nonetheless, in writing this article, I wanted to get some sense of just how fast [mxTextTools] might be. So here is a rough idea. Although 'Txt2Html' does a variety of transformation tricks at both block and inline levels, the feature I expect to encounter the most in "smart ASCII" is inline markup of words and phrases as '', '', '' and the like. This markup is converted by the function 'Typographify()' (not all at once, however, but in paragraph blocks). Therefore, I decided to make conversion of this function my first [mxTextTools] task. In retrospect, some later efforts at timing show that not nearly as much time is spent in 'Typographify()' as I had expected; therefore, the speedup in 'Txt2Html' as a whole is not huge. I believe, however, that converting other elements of the program to take advantage of [mxTextTools] would provide additional speed gains. In order to get a timable test case, I concatenated 110 copies of a recent article to get a file a bit over 2mb. 'wc' reports that it has 41k lines and 300k words. The self-test in [mxTypographify] processes an entire input as one text block (unlike 'Txt2Html'), first using an [mxTextTools] version of 'Typographify()', then using the old [re] version. The time spent in each function call is reported using 'time.time()' calls that surround only the calls themselves (to eliminate any issues of file opening, program startup, etc). I reduced the processing time of the same test file from about 34 seconds to about 12 seconds on one Linux test machine (running Python 1.5.2). In other words, [mxTextTools] has given me about a 3x speedup over what I get with the [re] module, at least for my requirement. My results, naturally, are specific to the task I am addressing. However, I have been careful enough with both versions of 'Typographify()' that I doubt large improvements are possible while still using the same extension module for each (but readers' improvements are welcomed). Based on such a 3x speedup, [mxTextTools] definitely looks like a useful and promising module for text processing. HOW DOES [mxTextTools] WORK? ------------------------------------------------------------------------ For the computer scientists, [mxTextTools] looks an awful lot like a Turing Machine. You have a set of states (tag tuples) together with a tape (read buffer). Each state can do one of two things, jump on failure to one next state or jump on success to another next state (they might be the same). On failure, the read head is restored to the same position it was at before the state evaluation. However, unlike a standard definition of a Turing Machine, one state can evaluate multiple tape symbols to decide on success or failure. And also unlike a standard Turing Machine, little state machines are typically nested inside one another, with success or failure propogating upward. All the Turing/state machine talk is a bit formalistic. The short characterization is that [mxTextTools] can do everything regular expressions can, plus some things regular expressions cannot. Readers can get a better idea by diving into some code. Let's take a bit from this column's project: #----------- Detector for isolated punctuation ----------# stray_punct = \ ( emit_misc, Table+CallTag, # Pickup any cases where multiple (1) ( (None, IsInSet, punct_set), # punctuation character occur (2) (None, AllInSet, punct_set), # alone (followed by whitespace) (3) (None, IsInSet, whitespace_set), (4) (None, Skip, -1) # Give back the whitespace (5) ) ) The [mxTypographify] module's overal purpose is to read through a string, and determine if the "next thing" matches one of the markup patterns used in "smart ASCII". Or rather, it better match some pattern or it just will not know what action to take for the next bit of text. The "tag tuple" bound to the name "stray_punct" is mostly a fallback condition. It detects the condition where the "next thing" is some punctuation symbols standing alone without abutting any words. In most cases, I don't -want- "smart ASCII" to contain such a pattern, but [mxTypographify] has to do -something- with them if they are encountered. Let's walk through the lines above, this example is a good illustration of what [mxTextTools] does generally. (1) Declare what action to perform and what type of pattern to look for. In this example, we are using a callback function 'emit_misc' in the case of a match. The function gets passed arguments that tell it about the details of the match, if found. We might also append the match to a "taglist," or just do nothing if "None" is specified. The second tuple element expresses two things in the example. It tries to match a table (the tuple that follows), and if it does match it calls the callback function. (2) Lines 2-5 are all part of the table/tuple indicated in line (1). In other words the success or failure of the whole tuple in 2-5 determines the correspondng success or failure of "stray_punct" as a whole. Line (2) itself uses the 'IsInSet' command to try to match the current head position against a collection of values (in this case "punct_set" which is defined elsewhere in the program). Commands that start with "Is" match exactly one character, or fail. Should line (2) fail, it will pass the failure back up to "stray_punct" which then fails as a whole. However, had we specified another tuple element in line (2), we could have jumped to another line in case of failure. By default, on success we jump to the next line; however, by specifying the fifth tuple position we could make the jump go somewhere other than the next state (i.e. default is +1). (3) Operates much like line (2) does. However, the 'AllInSet' command matches as many characters as it can, not just one like 'Is*' commands do. In order to succeed, 'All*' commands must match at least one character. The read head is advanced with each matching character encountered. As above, failure is propogated up to "stray_punct". The 'All*' commands are very similar to the "+" operator in regular expressions. (4) This is basically just like line (2). The only differnce is that a different set of characters is matched (whitespace characters, in the example). (5) The 'Skip' command is a little subtle, and very important. 'Skip' -always- successfully matches the current read head position, then causes the read head to move by the amount specified in the next element. The match that is propogated up to "stray_punct" includes only the characters between the initial and current read head position. In the example, this means that the whitespace must be matched for "stray_punct" to succeed, but that whitespace is not included in the actual match. In terms of regular expressions, this resembles the "(?:\s)" pattern. A few example of matches and failures might help here (assume the read head is at the beginning of each line): #---------- stray_punct successes and failures ----------# -- spam # matches "--" & spam # fails at "AllInSet" since (2) advanced head #@$ %% spam # matches "#@$" **spam # fails (whitespace isn't encountered before 's') After each success, the read head is at the space right before the next word "spam" or "%%". After a failure, the read head remains where it started out (at the beginning of the line). ASSEMBLING A TEXT PROCESSOR ------------------------------------------------------------------------ Putting together a complete processing application (or function) is largely a matter of defining all the needed match components (like "stray_punct"). The module [mxTypographify] contains a more fleshed out processor, but we can look at the essential structure here: #----------------- "smart ASCII" tagger -----------------# # Tag all the (possibly marked-up) words tag_words = \ ( stray_punct+ (+1,), emphs + (+1,), # -Emphasized- words funcs + (+1,), # 'Functions()' strongs + (+1,), # *Strong* emphasis plain_words + (+1,), # Unadorned words (emit_misc, AllInSet+CallTag, leadout_set,+1,+1), # Lead-out eater (jump_count, Skip+CallTag, 0), # Check for infinite loop (None, EOF, Here, -7) # Check for EOF ) tag(txt, tag_words, 0, len(txt), None) This code deserves a bit of further explanation. Like "stray_punct," "emphs," "funcs," "strongs," and "plain_words" contain tuples with tag tables. They each have their appropriate callback functions (all "emitters" of various names, because they "emit" the match along with surrounding HTML codes if needed). The first five lines each have a "+1" appended to their tuple; what this does is specify where to jump in case of a match failure. That is, even if these patterns fail to match, we continue on--with the read head in the same position--to try matching against the other patterns. After the basic word patterns each attempt a match, we get to the "Lead-out eater" line. In my terminology a "leadout" is the opposite of a "leadin". That is, the latter are things that might precede a word pattern, and the former are things that might follow a word pattern. The "leadout_set" includes whitespace characters, but it also includes things like comma, period, question-mark, which might end a word. The "Lead-out eater" uses a callback function too. As designed, I preserve exactly the whitespace the input has. However, it would be easy to normalize whitespace here by emitting something other than the actual match (e.g. a single space always). The 'jump_count' is extremely important, but let us come back to it. For now, it is enough to say that we -hope- the line never does anything. The 'EOF' line is our flow control, in a way. The call made by this line is to 'None', which is to say that nothing is actually -done- with any match. The command 'EOF' is the important thing ('Here' is just a filler value that occupies the tuple position). It succeeds if the read head is past the end of the read buffer. On success, the whole tag table "tag_words" succeeds; and having succeeded, processing stops. 'EOF' failure is more interesting. Assuming we haven't reached the end of our string, we jump -7 states (to "stray_punct"). From there, the whole process starts over; hopefully with the read head advanced to the next word. By looping back to the start of the list tag tuples, we continue eating successive word patterns until the read buffer is exhausted (calling callbacks along the way). The 'tag()' call simply launches processing of the tag table we pass to it (against the read buffer contained in 'txt'). In our case, we do not care about the return value of 'tag()' since everything is handled in callbacks. However, in cases where the tag table does not loop itself, part of the return value can be used to determine if there is reason to call 'tag()' again with a tail of the read buffer. DEBUGGING A TAG TABLE ------------------------------------------------------------------------ Describing it is easy, but I spent a large number of hours finding the exact collection of tag tables that would match every pattern I was interested in without mismatching any pattern as something it wasn't. While "smart ASCII" markup seems pretty simple, there are actually quite a few complications (e.g. markup characters being used in non-markup contexts, or markup characters and other punctuation appearing in various sequences). Any structured document format that is complicated enough to warrant using [mxTextTools] instead of [string] is likely to have similar complications. Without question, the worst thing that can go wrong in a looping state pattern like the above is that -none- of the listed states match from the current read head position. If that happens, your program winds up in a tight infinite loop (entirely inside the extension module, so you cannot really get at it with Python code directly). I wound up forcing a manual kill of the process -countless- times during development. Only near the end of my project with [mxTextTools] did I hit on a good way to handle the infinite loop problem. This is the 'jump_count' callback. Let me present my function as a reference: #------- mxTypographify.pypinfinite loop catcher --------# def jump_count(taglist,txt,l,r,subtag): global head_pos if head_pos is None: head_pos = r elif head_pos == r: raise "InfiniteLoopError", \ txt[l-20:l]+'{'+txt[l]+'}'+txt[l+1:r+15] else: head_pos = r The basic purpose of 'jump_count' is simple: we want to catch the situation where our tag table has been run through multiple times without matching anything. The simplest way to do this is to check whether the last read head position is the same as the current. If it is, more loops cannot get anywhere, since we have reached the exact same state twice, and the same thing is fated to happen forever. In my example, I simply raise an error to stop the program (I also report a little bit of buffer context in case I can eyeball what is going on). However, it would also be possible to manually move the read head, and try again from a different starting position. Not as bad as an infinite loop, but still undesirable, is having patterns within a tag table match when they are not supposed to or not match when they are suppose to (but something else has to match, or we would have an infinite loop issue). Using callbacks everywhere makes examining this situation much easier. During development, I frequently created temporary changes to my 'emit_*' callbacks to print or log when certain emitters get called. By looking at output from these temporary 'print' statements, most times one can tell where the problem lies. CONCLUSION ------------------------------------------------------------------------ If text processing speed is crucial to your project, you can probably still use Python by adding in [mxTextTools]. Getting up to speed with the extension module involves adjusting your thinking from usual [string] and [re] techniques. But with some planning and effort, [mxTextTools] can match or beat the performance you would find in a low level language like C (or a high level one with tuned regular expressions, like Perl). RESOURCES ------------------------------------------------------------------------ Obtaining [TextTools] is unfortunately a bit harder than it should be (depending on platform). Your starting point should be the project homepage at: http://www.lemburg.com/files/python/mxTextTools.html On that homepage, you can download the module archive at: http://www.lemburg.com/files/python/mxTextTools-1.1.1.zip However, if you are installing on a Windows platform, the required DLL is not included in the package (despite the documentation claiming otherwise). You can compile it yourself if you happen to have VC++ by following the instructions on the project homepage (so I presume, I have not done it). Hopefully, the packaging will be improved with time (or will even become part of "standard" Python distributions). An introductory discussion of text processing techniques in Python is contained in my earlier column: _Charming Python #5: Text Processing in Python_ http://gnosis.cx/publish/programming/charming_python_5.html The public-domain 'Txt2Html' utility, including the [mxTypographify] module discussed in this column, can be downloaded from: http://gnosis.cx/download/txt2html.zip ABOUT THE AUTHOR ------------------------------------------------------------------------ {Picture of Author: http://gnosis.cx/cgi-bin/img_dqm.cgi} David Mertz wonders if he could pass the Turing test (and whether this article has). David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/publish/. Suggestions and recommendations on this, past, or future, columns are welcomed.