SimpleParse
.David Mertz, Ph.D.
Analyzer, Gnosis Software, Inc.
December, 2001
A number of parsing tools have been written for Python. An earlier column dealt with the low-level state-machine (and therefore, parser)mxTextTools
. This column discusses one high level parsing language built on top of Python.SimpleParse
provides an EBNF-style syntax on top ofmxTextTools
that can greatly clarify the expression of grammars.
Python is a freely available, very-high-level, interpreted language developed by Guido van Rossum. It combines a clear syntax with powerful (but optional) object-oriented semantics. Python is available for almost every computer platform you might find yourself working on, and has strong portability between platforms.
Formal parsers are a bit new to me, as perhaps to a number of readers. In this article, I introduce some basic concepts in parsing, and discuss a Python tool for peforming parsing. Hopefully, both readers and I will benefit from the exercise.
Naturally, like most any programmer, I have frequently needed to identify parts and structures that exist inside textual documents: log files, configuration files, delimited data, and more free-form (but still semi-structured) report formats. All of these documents have their own "little languages" for what can occur within them. In fact, a number of installments of this column have dealt with these very matters.
The way I have programmed these informal parsing tasks has always been somewhat of a hodgepodge of custom state-machines, regular expressions, and context driven string tests. The pattern in these programs was always, roughly, "read a bit of text, figure out if we can make something of it, maybe read a bit more text afterwards, keep trying."
Parsers of the formal sort distill descriptions the parts and structures in documents into concise, clear, and declarative rules for how to identify what makes up a document. The declarative aspect is particularly interesting here. All my old ad hoc parsers were imperative in flavor: read some characters, make some decisions, accumulate some variables, rinse, repeat. As this column's installments on functional programming have observed, the recipe style of program flow is comparatively error-prone and difficult to maintain.
Formal parsers almost always use variants on Extended Backus-Naur Form (EBNF) to describe the "grammars" of the languages they describe. Those tools we look at here do so, as does the popular compiler development tool YACC (and its variants). Basically, an EBNF grammar gives names to the parts one might find in a document; additionally, larger parts are frequently composed of smaller parts. The frequency and order in which small parts may occur in larger parts is specified by operators--mostly the same symbols one sees in regular expressions. In parser-talk, each named part in a grammar is called a "production."
Possibly without even knowing it, readers have already seen EBNF descriptions at work. For example, the familiar Python Language Reference defines what a floating point number looks like in Python as:
floatnumber: pointfloat | exponentfloat pointfloat: [intpart] fraction | intpart "." exponentfloat: (nonzerodigit digit* | pointfloat) exponent intpart: nonzerodigit digit* | "0" fraction: "." digit+ exponent: ("e"|"E") ["+"|"-"] digit+
Or one might have seen an XML DTD element defined in an EBNF
style. For example, the <body>
of a developerWorks tutorial
looks like:
<!ELEMENT body ((example-column | image-column)?, text-column) >
Spellings vary slightly, but the general notions of quantification, alternation and sequencing exist in in all EBNF-style language grammars.
SimpleParse
SimpleParse
is an interesting tool. To use this module, you
need the underlying module mxTextTools
, which implements a
"tagging engine" in C. An earlier installment discussed
mxTextTools
, which is powerful, but rather difficult to use.
Once SimpleParse
is layered on top of mxTextTools
, the work
becomes a lot easier.
What one does to use SimpleParse
is really quite simple, and
removes the need to think about most of the complexity of
mxTextTools
. The first thing to do is create an EBNF-style
grammar that describes the language one wants to handle. The
second step is to call mxTextTools
to create a tag list
that describes all the successful productions when the grammar
is applied to the document. Finally, one actually does
something with the tagtable returned by mxTextTools
.
For this article, the "language" we will parse is the set of
markup codes used by "smart ASCII" to indicate things like
boldface, module names and book titles. This is the very same
language mxTextTools
was earlier used to identify, and
regular expressions and state-machines before that, in earlier
installments. The language is far simpler than a full
programming language would be, but complicated enough to be
representative.
We probably need to back up for one moment here. What the heck
is a "tag list" that mxTextTools
gives us? Basically, this
is a nested structure that simply gives the character offsets
where every production was matched in the source text.
mxTextTools
traverses a source text quickly, but it does not
do anything to the source text itself (at least not when
using the SimpleParse
grammars). Let us look at an abridged
tag list to illustrate:
(1, [('plain', 0, 15, [('word', 0, 4, [('alphanums', 0, 4, [])]), ('whitespace', 4, 5, []), ('word', 5, 10, [('alphanums', 5, 10, [])]), ('whitespace', 10, 11, []), ('word', 11, 14, [('alphanums', 11, 14, [])]), ('whitespace', 14, 15, [])]), ('markup', 15, 27, ... 289)
The elipses in the middle contain a bunch more matches. But the part we see says the following. The root production ("para") succeeds and ends at offset 289 (the length of the source text). The child production "plain" matches offsets 0 through 15. This "plain" child is itself composed of smaller productions. After the "plain" production, the "markup" production matches offsets 15 through 27. The details are left out, but this first "markup" is made of components, and additional productions succeed later in the source.
We have seen a glance at the tag list that SimpleParse
+
mxTextTools
can give us. But what we really need to look at
is the grammar that was used to generate this tag list. The
grammar is where the real work happens. EBNF grammars are
almost self-explanatory to read (although designing one does
require a bit of thought and testing):
para := (plain / markup)+ plain := (word / whitespace / punctuation)+ whitespace := [ \t\r\n]+ alphanums := [a-zA-Z0-9]+ word := alphanums, (wordpunct, alphanums)*, contraction? wordpunct := [-_] contraction := "'", ('am'/'clock'/'d'/'ll'/'m'/'re'/'s'/'t'/'ve') markup := emph / strong / module / code / title emph := '-', plain, '-' strong := '*', plain, '*' module := '[', plain, ']' code := "'", plain, "'" title := '_', plain, '_' punctuation := (safepunct / mdash) mdash := '--' safepunct := [!@#$%^&()+=|\{}:;<>,.?/"]
This grammar is almost exactly the way you would describe the "smart ASCII" language verbally, which is a nice sort of clarity. A paragraph consist of some plain text and some marked-up text. Plain text consists of some collection of words, whitespace, and punctuation. Marked up text might be emphasized, or strongly emphasized, or module names, etc. Strongly emphasized text is surrounded by asterisks. And so on. A couple features like just what a "word" really is, or just what a contraction can end with, take a bit of thought, but the syntax of EBNF doesn't get in the way.
In contrast, the same sort of rules can be described even more
tersely using regular expressions. This is what the first
version of the "smart ASCII" markup program did. But this
terseness is much harder to write, and harder still to tweak
later. The below re
code expresses largely (but not
precisely) the same set of rules:
# [module] names re_mods = r"""([\(\s'/">]|^)\[(.*?)\]([<\s\.\),:;'"?!/-])""" # *strongly emphasize* words re_strong = r"""([\(\s'/"]|^)\*(.*?)\*([\s\.\),:;'"?!/-])""" # -emphasize- words re_emph = r"""([\(\s'/"]|^)-(.*?)-([\s\.\),:;'"?!/])""" # _Book Title_ citations re_title = r"""([\(\s'/"]|^)_(.*?)_([\s\.\),:;'"?!/-])""" # 'Function()' names re_funcs = r"""([\(\s/"]|^)'(.*?)'([\s\.\),:;"?!/-])"""
If you discover or invent some slightly new variant of the
language, it is a lot easier to play with the EBNF grammar
than with those regular expressions. Moreover, using
mxTextTools
will generally even be faster in performing the
manipulations of the patterns
For our sample program, we put the actual grammar in a separate
file. For most purposes, this is a good organization to use.
Changing the grammar is usually a different sort of task than
changing the application logic; and the files reflect this.
But the whole of what we do with the grammar is pass it as a
string to a SimpleParse
function, so in principle we could
include it in the main application (or even dynamically
generate it in some way).
Let us look at our entire--compact--tagging application:
import os from sys import stdin, stdout, stderr from simpleparse import generator from mx.TextTools import TextTools input = stdin.read() decl = open('typographify.def').read() from typo_html import codes parser = generator.buildParser(decl).parserbyname('para') taglist = TextTools.tag(input, parser) for tag, beg, end, parts in taglist[1]: if tag == 'plain': stdout.write(input[beg:end]) elif tag == 'markup': markup = parts[0] mtag, mbeg, mend = markup[:3] start, stop = codes.get(mtag, ('<!-- unknown -->','<!-- / -->')) stdout.write(start + input[mbeg+1:mend-1] + stop) stderr.write('parsed %s chars of %s\n' % (taglist[-1], len(input)))
Here is what it does. First read in the grammar, and create an
mxTextTools
parser from the grammar. The generated parser is
similar to the tag-table that is found in the hand-written
mxTypographify
module discussed in an earlier installment
(but without the comments in the earlier, of course). Next we
apply the tag-table/parser to the input source to create a tag
list. Finally, we loop through the tag list, and emit some new
marked-up text. The loop could, of course, do anything else
desired with each production encountered.
For the particular grammar used for smart ASCII, everything in the source text is expected to fall into either a "plain" production or a "markup" production. Therefore, it suffices to loop across a single level in the tag list (except when we look exactly one level lower for the specific markup production, such as "title") But a more free-form grammar--such as occurs for most programming languages--could easily recursively descend into the tag list, and look for production names at every level. For example, if the grammar were to allow nested markup codes, this recursive style would probably be used. Readers might enjoy the exercise of figuring out how to adjust the grammar (hint: remember that productions are allowed to be mutually recursive).
The particular markup codes that go to the output live in yet
another file, for organizational not essential reasons. A
little trick of using a dictionary as a switch
statment is
used here (although the otherwise
case remains too narrow in
the example). The idea is just that we might in the future
want to create multiple "output format" files for, say, HTML,
DocBook, LaTeX, or others. The particular markup file used for
the example just looks like:
codes = \ { 'emph' : ('<em>', '</em>'), 'strong' : ('<strong>', '</strong>'), 'module' : ('<em><code>', '</code></em>'), 'code' : ('<code>', '</code>'), 'title' : ('<cite>', '</cite>'), }
Extending this to other output formats is straightforward.
SimpleParse
provides a concise and very readable EBNF-style
wrapper to the underlying power and speed of the cryptic
mxTextTools
C module. Moreover, EBNF grammars are already
familiar to many programmers, even if only in passing. I
cannot prove anything about what is easier to
understand--intuitions differ--but I can comment quantitatively
on source length. The mxTypographify
module that was
manually developed earlier is the following size:
199 776 7041 mxTypographify.py
Of these 199 lines, a fair number are comments. And 18 of
those lines are an included regular expression version of the
markup function that is included for timing comparisons. But
what the program does is essentially identical to what
typographify.py
--listed above--does. In contrast, our
SimpleParse
program, including its support files comes to:
19 79 645 typographify.def 20 79 721 typographify.py 6 25 205 typo_html.py 45 183 1571 total
In other words, about one fourth as many lines. This version
has fewer comments, but that is mostly because the EBNF grammar
is fairly self-documenting. I would not want to emphasize LOC
too strongly--obviously, one can play games with minimizing or
maximizing code length. But in a general way, one of the few
real empirical results of work studies on programmers is that
kLOC/programmer-month is fairly close to constant across
languages and libraries. Of course, the regular expression
version is, in turn, one third as long as the SimpleParse
version--but I think the density of its expression makes it
fragile to maintain and harder to write. And we saw in the
previous installment that mxTextTools
is considerably faster
at runtime. I think, on balance, SimpleParse
wins of the
approaches considered.
An earlier installment of this column (#14) called Text
Processing in Python with mxTextTools introduces the
mxTextTools
library that SimpleParse
is built on top of:
http://gnosis.cx/publish/programming/charming_python_14.html
The reference module mxTypographify
was built using
mxTextTools
directly. We see in this article how much more
readable the SimpleParse
version becomes:
http://gnosis.cx/download/mxTypographify.py
mxTextTools
is now part of the larger eGenix package of
extensions. Information can be found at:
http://www.lemburg.com/files/python/mxTextTools.html
Mike Fletcher's SimpleParse
can be found, along with a brief
introduction to its usage, at:
http://members.rogers.com/mcfletch/programming/simpleparse/simpleparse.html
John Aycock's Spark
module is in many ways a more
sophisticated parsing framework than is SimpleParse
. A
number of Python developers have recommended Spark
to me,
which has the additional virtue of being pure-Python (with a
corresponding natural disadvantage in terms of speed) Spark
has a homepage at:
http://pages.cpsc.ucalgary.ca/~aycock/spark/
Information on the ISO 14977 standard for EBNF syntax can be found at:
http://www.cl.cam.ac.uk/~mgk25/iso-ebnf.html
David Mertz would like to write, with Nietzsche, that these are the musings of an old philologist, but that untruth would unmask itself. But perhaps his (right here gratuitously plugged) forthcoming book, Text Processing in Python, will someday be mistaken for a cybernetic variant of philology. David may be reached at [email protected]; his life pored over at http://gnosis.cx/publish/. Suggestions and recommendations on this, past, or future, columns are welcomed.