CHAPTER IV -- PARSERS AND STATE MACHINES ------------------------------------------------------------------- All the techniques presented in the prior chapters of this book have something in common, but something that is easy to overlook. In a sense, every basic string and regular expression operation treats strings as -homogeneous-. Put another way: String and regex techniques operate on -flat- texts. While said techniques are largely in keeping with the "Zen of Python" maxim that "Flat is better than nested," sometimes the maxim (and homogeneous operations) cannot solve a problem. Sometimes the data in a text has a deeper -structure- than the linear sequence of bytes that make up strings. It is not entirely true that the prior chapters have eschewed data structures. From time to time, the examples presented broke flat texts into lists of lines, or of fields, or of segments matched by patterns. But the structures used have been quite simple and quite regular. Perhaps a text was treated as a list of substrings, with each substring manipulated in some manner--or maybe even a list of lists of such substrings, or a list of tuples of data fields. But overall, the data structures have had limited (and mostly fixed) nesting depth and have consisted of sequences of items that are themselves treated similarly. What this chapter introduces is the notion of thinking about texts as -trees- of nodes, or even still more generally as graphs. Before jumping too far into the world of nonflat texts, I should repeat a warning this book has issued from time to time. If you do not -need- to use the techniques in this chapter, you are better off sticking with the simpler and more maintainable techniques discussed in the prior chapters. Solving too general a problem too soon is a pitfall for application development--it is almost always better to do less than to do more. Full-scale parsers and state machines fall to the "more" side of such a choice. As we have seen already, the class of problems you can solve using regular expressions--or even only string operations--is quite broad. There is another warning that can be mentioned at this point. This book does not attempt to explain parsing theory or the design of parseable languages. There are a lot of intricacies to these matters, about which a reader can consult a specialized text like the so-called "Dragon Book"--Aho, Sethi, and Ullman's _Compilers: Principle, Techniques and Tools_--or Levine, Mason, and Brown's _Lex & Yacc_. When Extended Backus-Naur Form (EBNF) grammars or other parsing descriptions are discussed below, it is in a general fashion that does not delve into algorithmic resolution of ambiguities or big-O efficiencies (at least not in much detail). In practice, everyday Python programmers who are processing texts--but who are not designing new programming languages--need not worry about those parsing subtleties omitted from this book. SECTION 1 -- An Introduction to Parsers ------------------------------------------------------------------------ TOPIC -- When Data Becomes Deep and Texts Become Stateful -------------------------------------------------------------------- Regular expressions can match quite complicated patterns, but they fall short when it comes to matching arbitrarily nested subpatterns. Such nested subpatterns occur quite often in programming languages and textual markup languages (and other places sometimes). For example, in HTML documents, you can find lists or tables nested inside each other. For that matter, character-level markup is also allowed to nest arbitrarily--the following defines a valid HTML fragment: >>> s = '''
Plain text, italicized phrase, italicized subphrase, bold subphrase, other italic phrase
''' The problem with this fragment is that most any regular expression will match either less or more than a desired '' element body. For example: >>> ital = r'''(?sx).+''' >>> for phrs in re.findall(ital, s): ... print phrs, '\n-----' ... italicized phrase, italicized subphrase, bold subphrase, other italic phrase ----- >>> ital2 = r'''(?sx).+?''' >>> for phrs in re.findall(ital2, s): ... print phrs, '\n-----' ... italicized phrase, italicized subphrase ----- other italic phrase ----- What is missing in the proposed regular expressions is a concept of -state-. If you imagine reading through a string character-by-character (which a regular expression match must do within the underlying regex engine), it would be useful to keep track of "How many layers of italics tags am I in?" With such a count of nesting depth, it would be possible to figure out which opening tag '' a given closing tag '' was meant to match. But regular expressions are not stateful in the right way to do this. You encounter a similar nesting in most programming languages. For example, suppose we have a hypothetical (somewhat BASIC-like) language with an IF/THEN/END structure. To simplify, suppose that every condition is spelled to match the regex 'cond\d+', and every action matches 'act\d+'. But the wrinkle is that IF/THEN/END structures can nest within each other also. So for example, let us define the following three top-level structures: >>> s = ''' IF cond1 THEN act1 END ----- IF cond2 THEN IF cond3 THEN act3 END END ----- IF cond4 THEN act4 END ''' As with the markup example, you might first try to identify the three structures using a regular expression like: >>> pat = r'''(?sx) IF \s+ cond\d+ \s+ THEN \s+ act\d+ \s+ END''' >>> for stmt in re.findall(pat, s): ... print stmt, '\n-----' ... IF cond1 THEN act1 END ----- IF cond3 THEN act3 END ----- IF cond4 THEN act4 END ----- This indeed finds three structures, but the wrong three. The second top-level structure should be the compound statement that used 'cond2', not its child using 'cond3'. It is not too difficult to allow a nested IF/THEN/END structure to optionally substitute for a simple action; for example: >>> pat2 = '''(?sx)( IF \s+ cond\d+ \s+ THEN \s+ ( (IF \s+ cond\d+ \s+ THEN \s+ act\d+ \s+ END) | (act\d+) ) \s+ END )''' >>> for stmt in re.findall(pat2, s): ... print stmt[0], '\n-----' ... IF cond1 THEN act1 END ----- IF cond2 THEN IF cond3 THEN act3 END END ----- IF cond4 THEN act4 END ----- By manually nesting a "first order" IF/THEN/END structure as an alternative to a simple action, we can indeed match the example in the desired fashion. But we have assumed that nesting of IF/THEN/END structures goes only one level deep. What if a "second order" structure is nested inside a "third order" structure--and so on, ad infinitum? What we would like is a means of describing arbitrarily nested structures in a text, in a manner similar to, but more general than, what regular expressions can describe. TOPIC -- What is a Grammar? -------------------------------------------------------------------- In order to parse nested structures in a text, you usually use something called a "grammar." A grammar is a specification of a set of "nodes" (also called "productions") arranged into a strictly hierarchical "tree" data structure. A node can have a name--and perhaps some other properties--and it can also have an ordered collection of child nodes. When a document is parsed under a grammar, no resultant node can ever be a descendent of itself; this is another way of saying that a grammar produces a tree rather than a graph. In many actual implementations, such as the famous C-based tools 'lex' and 'yacc', a grammar is expressed at two layers. At the first layer, a "lexer" (or "tokenizer") produces a stream of "tokens" for a "parser" to operate on. Such tokens are frequently what you might think of as words or fields, but in principle they can split the text differently than does our normal idea of a "word." In any case tokens are nonoverlapping subsequences of the original text. Depending on the specific tool and specification used, some subsequences may be dropped from the token stream. A "zero-case" lexer is one that simply treats the actual input bytes as the tokens a parser operates on (some modules discussed do this, without losing generality). The second layer of a grammar is the actual parser. A parser reads a stream or sequence of tokens and generates a "parse tree" out of it. Or rather, a tree is generated under the assumption that the underlying input text is "well-formed" according to the grammar--that is, there is a way to consume the tokens within the grammar specification. With most parser tools, a grammar is specified using a variant on EBNF. An EBNF grammar consists of a set of rule declarations, where each rule allows similar quantification and alternation as that in regular expressions. Different tools use slightly different syntax for specifying grammars, and different tools also differ in expressivity and available quantifiers. But almost all tools have a fairly similar feel in their grammar specifications. Even the DTDs used in XML dialect specifications (see Chapter 5) have a very similar syntax to other grammar languages--which makes sense since an XML dialect is a particular grammar. A DTD entry looks like: #*----- EBNF-style description in a developerWorks DTD -----# In brief, under the sample DTD, a '' element may contain either one or zero occurrences of a "first thing"--that first thing being -either- an ''+txt[l+1:r-1]+'
')
def emit_modl(tl,txt,l,r,s):
ws.append(''+txt[l+1:r-1]+'
')
def emit_emph(tl,txt,l,r,s):
ws.append(''+txt[l+1:r-1]+'')
def emit_strg(tl,txt,l,r,s):
ws.append(''+txt[l+1:r-1]+'')
def emit_titl(tl,txt,l,r,s):
ws.append(''+txt[l+1:r-1]+'')
def jump_count(tl,txt,l,r,s):
global head_pos, loops
loops = loops+1
if head_pos is None: head_pos = r
elif head_pos == r:
raise "InfiniteLoopError", \
txt[l-20:l]+'{'+txt[l]+'}'+txt[l+1:r+15]
else: head_pos = r
#-- What can appear inside, and what can be, markups?
punct_set = set("`!@#$%^&*()_-+=|\{}[]:;'<>,.?/"+'"')
markable = alphanumeric+whitespace+"`!@#$%^&()+=|\{}:;<>,.?/"+'"'
markable_func = set(markable+"*-_[]")
markable_modl = set(markable+"*-_'")
markable_emph = set(markable+"*_'[]")
markable_strg = set(markable+"-_'[]")
markable_titl = set(markable+"*-'[]")
markup_set = set("-*'[]_")
#-- What can precede and follow markup phrases?
darkins = '(/"'
leadins = whitespace+darkins # might add from "-*'[]_"
darkouts = '/.),:;?!"'
darkout_set = set(darkouts)
leadouts = whitespace+darkouts # for non-conflicting markup
leadout_set = set(leadouts)
#-- What can appear inside plain words?
word_set = set(alphanumeric+'{}/@#$%^&-_+=|\><'+darkouts)
wordinit_set = set(alphanumeric+"$#+\<.&{"+darkins)
#-- Define the word patterns (global so as to do it only at import)
# Special markup
def markup_struct(lmark, rmark, callback, markables, x_post="-"):
struct = \
( callback, Table+CallTag,
( (None, Is, lmark), # Starts with left marker
(None, AllInSet, markables), # Stuff marked
(None, Is, rmark), # Ends with right marker
(None, IsInSet, leadout_set,+2,+1),# EITHR: postfix w/ leadout
(None, Skip, -1,+1, MatchOk), # ..give back trailng ldout
(None, IsIn, x_post, MatchFail), # OR: special case postfix
(None, Skip, -1,+1, MatchOk) # ..give back trailing char
)
)
return struct
funcs = markup_struct("'", "'", emit_func, markable_func)
modules = markup_struct("[", "]", emit_modl, markable_modl)
emphs = markup_struct("-", "-", emit_emph, markable_emph, x_post="")
strongs = markup_struct("*", "*", emit_strg, markable_strg)
titles = markup_struct("_", "_", emit_titl, markable_titl)
# All the stuff not specially marked
plain_words = \
( ws, Table+AppendMatch, # AppendMatch only -slightly-
( (None, IsInSet, # faster than emit_misc callback
wordinit_set, MatchFail), # Must start with word-initial
(None, Is, "'",+1), # May have apostrophe next
(None, AllInSet, word_set,+1), # May have more word-internal
(None, Is, "'", +2), # May have trailing apostrophe
(None, IsIn, "st",+1), # May have [ts] after apostrophe
(None, IsInSet,
darkout_set,+1, MatchOk), # Postfixed with dark lead-out
(None, IsInSet,
whitespace_set, MatchFail), # Give back trailing whitespace
(None, Skip, -1)
) )
# Catch some special cases
bullet_point = \
( ws, Table+AppendMatch,
( (None, Word+CallTag, "* "), # Asterisk bullet is a word
) )
horiz_rule = \
( None, Table,
( (None, Word, "-"*50), # 50 dashes in a row
(None, AllIn, "-"), # More dashes
) )
into_mark = \
( ws, Table+AppendMatch, # Special case where dark leadin
( (None, IsInSet, set(darkins)), # is followed by markup char
(None, IsInSet, markup_set),
(None, Skip, -1) # Give back the markup char
) )
stray_punct = \
( ws, Table+AppendMatch, # Pickup any cases where multiple
( (None, IsInSet, punct_set), # punctuation character occur
(None, AllInSet, punct_set), # alone (followed by whitespace)
(None, IsInSet, whitespace_set),
(None, Skip, -1) # Give back the whitespace
) )
leadout_eater = (ws, AllInSet+AppendMatch, leadout_set)
#-- Tag all the (possibly marked-up) words
tag_words = \
( bullet_point+(+1,),
horiz_rule + (+1,),
into_mark + (+1,),
stray_punct+ (+1,),
emphs + (+1,),
funcs + (+1,),
strongs + (+1,),
modules + (+1,),
titles + (+1,),
into_mark+(+1,),
plain_words +(+1,), # Since file is mstly plain wrds, can
leadout_eater+(+1,-1), # shortcut by tight looping (w/ esc)
(jump_count, Skip+CallTag, 0), # Check for infinite loop
(None, EOF, Here, -13) # Check for EOF
)
def Typography(txt):
global ws
ws = [] # clear the list before we proceed
tag(txt, tag_words, 0, len(txt), ws)
return string.join(ws, '')
if __name__ == '__main__':
print Typography(open(sys.argv[1]).read())
+++
'mxTypographify.py' reads through a string and determines if the
next bit of text matches one of the markup patterns in
'tag_words'. Or rather, it better match some pattern or the
application just will not know what action to take for the next
bit of text. Whenever a named subtable matches, a callback
function is called, which leads to a properly annotated string
being appended to the global list 'ws'. In the end, all such
appended strings are concatenated.
Several of the patterns given are mostly fallback conditions. For
example, the 'stray_punct' tag table detects the condition where
the next bit of text is some punctuation symbols standing alone
without abutting any words. In most cases, you don't -want- smart
ASCII to contain such a pattern, but [mxTypographify] has to do
-something- with them if they are encountered.
Making sure that every subsequence is matched by some subtable or
another is tricky. Here are a few examples of matches and
failures for the 'stray_punct' subtable. Everything that does not
match this subtable needs to match some other subtable instead:
#*--------- stray_punct successes and failures ----------#
-- spam # matches "--"
& spam # fails at "AllInSet" since '&' advanced head
#@$ %% spam # matches "#@$"
**spam # fails (whitespace isn't encountered before 's')
After each success, the read-head is at the space right before
the next word "spam" or "%%". After a failure, the read-head
remains where it started out (at the beginning of the line).
Like 'stray_punct', 'emphs', 'funcs', 'strongs', and 'plain_words',
et cetera contain tag tables. Each entry in 'tag_words' has its
appropriate callback functions (all "emitters" of various names,
because they "emit" the match, along with surrounding markup if
needed). Most lines each have a "+1" appended to their tuple;
what this does is specify where to jump in case of a match
failure. That is, even if these patterns fail to match, we
continue on--with the read-head in the same position--to try
matching against the other patterns.
After the basic word patterns each attempt a match, we get to the
"leadout eater" line. For 'mxTypography.py', a "leadout" is the
opposite of a "leadin." That is, the latter are things that might
precede a word pattern, and the former are things that might
follow a word pattern. The 'leadout_set' includes whitespace
characters, but it also includes things like a comma, period,
and question mark, which might end a word. The "leadout eater" uses
a callback function, too. As designed, it preserves exactly the
whitespace the input has. However, it would be easy to normalize
whitespace here by emitting something other than the actual match
(e.g., a single space always).
The 'jump_count' is extremely important; we will come back to it
momentarily. For now, it is enough to say that we -hope- the line
never does anything.
The 'EOF' line is our flow control, in a way. The call made by
this line is to 'None', which is to say that nothing is actually
-done- with any match. The command 'EOF' is the important thing
('Here' is just a filler value that occupies the tuple position).
It succeeds if the read-head is past the end of the read buffer.
On success, the whole tag table 'tag_words' succeeds, and having
succeeded, processing stops. 'EOF' failure is more interesting.
Assuming we haven't reached the end of our string, we jump -13
states (to 'bullet_point'). From there, the whole process starts
over, hopefully with the read-head advanced to the next word. By
looping back to the start of the list of tuples, we continue
eating successive word patterns until the read buffer is
exhausted (calling callbacks along the way).
The 'tag()' call simply launches processing of the tag table we
pass to it (against the read buffer contained in 'txt'). In our
case, we do not care about the return value of 'tag()' since
everything is handled in callbacks. However, in cases where the
tag table does not loop itself, the returned tuple can be
used to determine if there is reason to call 'tag()' again with a
tail of the read buffer.
DEBUGGING A TAG TABLE:
Describing it is easy, but I spent a large number of hours
finding the exact collection of tag tables that would match every
pattern I was interested in without mismatching any pattern as
something it wasn't. While smart ASCII markup seems pretty
simple, there are actually quite a few complications (e.g.,
markup characters being used in nonmarkup contexts, or markup
characters and other punctuation appearing in various sequences).
Any structured document format that is complicated enough to
warrant using [mx.TextTools] instead of [string] is likely to
have similar complications.
Without question, the worst thing that can go wrong in a looping
state pattern like the one above is that -none- of the listed
states match from the current read-head position. If that
happens, your program winds up in a tight infinite loop (entirely
inside the extension module, so you cannot get at it with Python
code directly). I wound up forcing a manual kill of the process
-countless- times during my first brush at [mx.TextTools]
development.
Fortunately, there is a solution to the infinite loop problem.
This is to use a callback like 'jump_count'.
#-------- mxTypography.py infinite loop catcher ---------#
def jump_count(taglist,txt,l,r,subtag):
global head_pos
if head_pos is None: head_pos = r
elif head_pos == r:
raise "InfiniteLoopError", \
txt[l-20:l]+'{'+txt[l]+'}'+txt[l+1:r+15]
else: head_pos = r
The basic purpose of 'jump_count' is simple: We want to catch the
situation where our tag table has been run through multiple times
without matching anything. The simplest way to do this is to
check whether the last read-head position is the same as the
current. If it is, more loops cannot get anywhere, since we have
reached the exact same state twice, and the same thing is fated
to happen forever. 'mxTypography.py' simply raises an error to
stop the program (and reports a little bit of buffer context to
see what is going on).
It is also be possible manually to move the read-head, and try
again from a different starting position. To manipulate the read
head in this fashion, you could use the 'Call' command in tag
table items. But a better approach is to create a nonlooping tag
table that is called repeatedly from a Python loop. This
Python loop can look at a returned tuple and use adjusted
offsets in the next call if no match occurred. Either way,
since much more time is spent in Python this way than with the
loop tag table approach, less speed would be gained from
[mx.TextTools].
Not as bad as an infinite loop, but still undesirable, is
having patterns within a tag table match when they are not
supposed to or not match when they are suppose to (but
something else has to match, or we would have an infinite loop
issue). Using callbacks everywhere makes examining this
situation much easier. During development, I frequently
create temporary changes to my 'emit_*' callbacks to print or
log when certain emitters get called. By looking at output
from these temporary 'print' statements, most times you can
tell where the problem lies.
+++
-*-
CONSTANTS:
The [mx.TextTools] module contains constants for a number of
frequently used collections of characters. Many of these
character classes are the same as ones in the [string] module.
Each of these constants also has a 'set' version predefined; a
set is an efficient representation of a character class that may
be used in tag tables and other [mx.TextTools] functions. You may
also obtain a character set from a (custom) character class using
the `mx.TextTools.set()` function:
>>> from mx.TextTools import a2z, set
>>> varname_chars = a2z + '_'
>>> varname_set = set(varname_chars)
mx.TextTools.a2z
mx.TextTools.a2z_set
English lowercase letters ("abcdefghijklmnopqrstuvwxyz").
mx.TextTools.A2Z
mx.TextTools.A2Z_set
English uppercase letters ("ABCDEFGHIJKLMNOPQRSTUVWXYZ").
mx.TextTools.umlaute
mx.TextTools.umlaute_set
Extra German lowercase hi-bit characters.
mx.TextTools.Umlaute
mx.TextTools.Umlaute_set
Extra German uppercase hi-bit characters.
mx.TextTools.alpha
mx.TextTools.alpha_set
English letters (A2Z + a2z).
mx.TextTools.german_alpha
mx.TextTools.german_alpha_set
German letters (A2Z + a2z + umlaute + Umlaute).
mx.TextTools.number
mx.TextTools.number_set
The decimal numerals ("0123456789").
mx.TextTools.alphanumeric
mx.TextTools.alphanumeric_set
English numbers and letters (alpha + number).
mx.TextTools.white
mx.TextTools.white_set
Spaces and tabs (" \t\v"). Notice that this is more
restricted than `string.whitespace`.
mx.TextTools.newline
mx.TextTools.newline_set
Line break characters for various platforms ("\n\r").
mx.TextTools.formfeed
mx.TextTools.formfeed_set
Formfeed character ("\f").
mx.TextTools.whitespace
mx.TextTools.whitespace_set
Same as `string.whitespace` (white + newline + formfeed).
mx.TextTools.any
mx.TextTools.any_set
All characters (0x00-0xFF).
SEE ALSO, `string.digits`, `string.hexdigits`,
`string.octdigits`, `string.lowercase`, `string.uppercase`,
`string.letters`, `string.punctuation`, `string.whitespace`,
`string.printable`
COMMANDS:
Programming in [mx.TextTools] amounts mostly to correctly
configuring tag tables. Utilizing a tag table requires just one
call to the `mx.TextTools.tag()`, but inside a tag table is a
kind of mini-language--something close to a specialized
Assembly language, in many ways.
Each tuple within a tag table contains several elements, of the
form:
#*----------- tag table entry format --------------------#
(tagobj, command[+modifiers], argument
[,jump_no_match=MatchFail [,jump_match=+1]])
The "tag object" may be 'None', a callable object, or a string.
If 'tagobj' is 'None', the indicated pattern may match, but
nothing is added to a taglist data structure if so, nor is a
callback invoked. If a callable object (usually a function) is
given, it acts as a callback for a match. If a string is used,
it is used to name a part of the taglist data structure
returned by a call to `mx.TextTools.tag()`.
A command indicates a type of pattern to match, and a modifier
can change the behavior that occurs in case of such a match.
Some commands succeed or fail unconditionally, but allow you to
specify behaviors to take if they are reached. An argument is
required, but the specific values that are allowed and how they
are interpreted depends on the command used.
Two jump conditions may optionally be specified. If no values
are given, 'jump_no_match' defaults to 'MatchFail'--that is,
unless otherwise specified, failing to match a tuple in a tag
table causes the tag table as a whole to fail. If a value -is-
given, 'jump_no_match' branches to a tuple the specified number
of states forward or backward. For clarity, an explicit
leading "+" is used in forward branches. Branches backward,
will begin with a minus sign. For example:
#*------ Tuple in tag table with explicit branches ------#
# Branch forward one state if next character -is not- an X
# ... branch backward three states if it is an X
tupX = (None, Is, 'X', +1, -3)
# assume all the tups are defined somewhere...
tagtable = (tupA, tupB, tupV, tupW, tupX, tupY, tupZ)
If no value is given for 'jump_match', branching is one state
forward in the case of a match.
Version 2.1.0 of [mx.TextTools] adds named jump targets, which
are often easier to read (and maintain) than numeric offsets.
An example is given in the [mx.TextTools] documentation:
#*------ Tuple in tag table with named branches ----------#
tag_table = ('start',
('lowercase',AllIn,a2z,+1,'skip'),
('upper',AllIn,A2Z,'skip'),
'skip',
(None,AllIn,white+newline,+1),
(None,AllNotIn,alpha+white+newline,+1),
(None,EOF,Here,'start') )
It is easy to see that if you were to add or remove a tuple, it
is less error prone to retain a jump to, for example, 'skip' than
to change every necessary '+2' to a '+3' or the like.
UNCONDITIONAL COMMANDS:
mx.TextTools.Fail
mx.TextTools.Jump
Nonmatch at this tuple. Used mostly for documentary
purposes in a tag table, usually with the 'Here' or 'To'
placeholder. The tag tables below are equivalent:
#*--------- Fail command and MatchFail branch -----------#
table1 = ( ('foo', Is, 'X', MatchFail, MatchOk), )
table2 = ( ('foo', Is, 'X', +1, +2),
('Not_X', Fail, Here) )
The 'Fail' command may be preferred if several other states
branch to the same failure, or if the condition needs to be
documented explicitly.
'Jump' is equivalent to 'Fail', but it is often better
self-documenting to use one rather than the other; for
example:
#*---------- Jump versus Fail for readability -----------#
tup1 = (None, Fail, Here, +3)
tup2 = (None, Jump, To, +3)
mx.TextTools.Skip
mx.TextTools.Move
Match at this tuple, and change the read-head position.
'Skip' moves the read-head by a relative amount, 'Move' to
an absolute offset (within the slice the tag table is
operating on). For example:
#*---------------- Skip and Move examples ---------------#
# read-head forward 20 chars, jump to next state
tup1 = (None, Skip, 20)
# read-head to position 10, and jump back 4 states
tup2 = (None, Move, 10, 0, -4)
Negative offsets are allowed, as in Python list indexing.
MATCHING PARTICULAR CHARACTERS:
mx.TextTools.AllIn
mx.TextTools.AllInSet
mx.TextTools.AllInCharSet
Match all characters up to the first that is not included
in 'argument'. 'AllIn' uses a character string while
'AllInSet' uses a set as 'argument'. For version 2.1.0,
you may also use 'AllInCharSet' to match CharSet objects.
In general, the set or CharSet form will be faster and is
preferable. The following are functionally the same:
#*-------- Equivalent AllIn and AllInSet tuples ---------#
tup1 = ('xyz', AllIn, 'XYZxyz')
tup2 = ('xyz', AllInSet, set('XYZxyz')
tup3 = ('xyz', AllInSet, CharSet('XYZxyz'))
At least one character must match for the tuple to match.
mx.TextTools.AllNotIn
Match all characters up to the first that -is- included in
'argument'. As of version 2.1.0, [mx.TextTools] does not
include an 'AllNotInSet' command. However, the following
tuples are functionally the same (the second usually
faster):
#*------ Equivalent AllNotIn and AllInSet tuples --------#
from mx.TextTools import AllNotIn, AllInSet, invset
tup1 = ('xyz', AllNotIn, 'XYZxyz')
tup2 = ('xyz', AllInSet, invset('xyzXYZ'))
At least one character must match for the tuple to match.
mx.TextTools.Is
Match specified character. For example:
#*------------ Match the character 'X' ------------------#
tup = ('X', Is, 'X')
mx.TextTools.IsNot
Match any one character except the specified character.
#*------------ Match non-'X' character ------------------#
tup = ('X', IsNot, 'X')
mx.TextTools.IsIn
mx.TextTools.IsInSet
mx.TextTools.IsInCharSet
Match exactly one character if it is in 'argument'.
'IsIn' uses a character string while 'IsInSet' use a set as
'argument'. For version 2.1.0, you may also use
'IsInCharSet' to match CharSet objects. In general, the
set or CharSet form will be faster and is preferable.
The following are functionally the same:
#*-------- Equivalent IsIn and IsInSet tuples ---------#
tup1 = ('xyz', IsIn, 'XYZxyz')
tup2 = ('xyz', IsInSet, set('XYZxyz')
tup3 = ('xyz', IsInSet, CharSet('XYZxyz')
mx.TextTools.IsNotIn
Match exactly one character if it is -not- in 'argument'.
As of version 2.1.0, [mx.TextTools] does not include an '
'AllNotInSet' command. However, the following tuples are
functionally the same (the second usually faster):
#*------ Equivalent IsNotIn and IsInSet tuples --------#
from mx.TextTools import IsNotIn, IsInSet, invset
tup1 = ('xyz', IsNotIn, 'XYZxyz')
tup2 = ('xyz', IsInSet, invset('xyzXYZ'))
MATCHING SEQUENCES:
mx.TextTools.Word
Match a word at the current read-head position. For example:
#*-------------------- Word example ---------------------#
tup = ('spam', Word, 'spam')
mx.TextTools.WordStart
mx.TextTools.sWordStart
mx.TextTools.WordEnd
mx.TextTools.sWordEnd
Search for a word, and match up to the point of the match.
Searches performed in this manner are extremely fast, and
this is one of the most powerful elements of tag tables.
The commands 'sWordStart' and 'sWordEnd' use "search
objects" rather than plaintexts (and are significantly
faster).
'WordStart' and 'sWordStart' leave the read-head
immediately prior to the matched word, if a match succeeds.
'WordEnd' and 'sWordEnd' leave the read-head immediately
after the matched word. On failure, the read-head is not
moved for any of these commands.
>>> from mx.TextTools import *
>>> s = 'spam and eggs taste good'
>>> tab1 = ( ('toeggs', WordStart, 'eggs'), )
>>> tag(s, tab1)
(1, [('toeggs', 0, 9, None)], 9)
>>> s[0:9]
'spam and '
>>> tab2 = ( ('pasteggs', sWordEnd, BMS('eggs')), )
>>> tag(s, tab2)
(1, [('pasteggs', 0, 13, None)], 13)
>>> s[0:13]
'spam and eggs'
SEE ALSO, `mx.TextTools.BMS()`, `mx.TextTools.sFindWord`
mx.TextTools.sFindWord
Search for a word, and match only that word. Any
characters leading up to the match are ignored. This
command accepts a search object as an argument. In case of
a match, the read-head is positioned immediately after the
matched word.
>>> from mx.TextTools import *
>>> s = 'spam and eggs taste good'
>>> tab3 = ( ('justeggs', sFindWord, BMS('eggs')), )
>>> tag(s, tab3)
(1, [('justeggs', 9, 13, None)], 13)
>>> s[9:13]
'eggs'
SEE ALSO, `mx.TextTools.sWordEnd`
mx.TextTools.EOF
Match if the read-head is past the end of the string slice.
Normally used with placeholder argument 'Here', for
example:
#*-------------------- EOF example ----------------------#
tup = (None, EOF, Here)
COMPOUND MATCHES:
mx.TextTools.Table
mx.TextTools.SubTable
Match if the table given as 'argument' matches at the
current read-head position. The difference between the
'Table' and the 'SubTable' commands is in where matches get
inserted. When the 'Table' command is used, any matches in
the indicated table are nested in the data structure
associated with the tuple. When 'SubTable' is used,
matches are written into the current level taglist. For
example:
>>> from mx.TextTools import *
>>> from pprint import pprint
>>> caps = ('Caps', AllIn, A2Z)
>>> lower = ('Lower', AllIn, a2z)
>>> words = ( ('Word', Table, (caps, lower)),
... (None, AllIn, whitespace, MatchFail, -1) )
>>> from pprint import pprint
>>> pprint(tag(s, words))
(0,
[('Word', 0, 4, [('Caps', 0, 1, None), ('Lower', 1, 4, None)]),
('Word', 5, 19, [('Caps', 5, 6, None), ('Lower', 6, 19, None)]),
('Word', 20, 29, [('Caps', 20, 24, None), ('Lower', 24, 29, None)]),
('Word', 30, 35, [('Caps', 30, 32, None), ('Lower', 32, 35, None)])
],
35)
>>> flatwords = ( (None, SubTable, (caps, lower)),
... (None, AllIn, whitespace, MatchFail, -1) )
>>> pprint(tag(s, flatwords))
(0,
[('Caps', 0, 1, None),
('Lower', 1, 4, None),
('Caps', 5, 6, None),
('Lower', 6, 19, None),
('Caps', 20, 24, None),
('Lower', 24, 29, None),
('Caps', 30, 32, None),
('Lower', 32, 35, None)],
35)
For either command, if a match occurs, the read-head is
moved to immediately after the match.
The special constant 'ThisTable' can be used instead of a
tag table to call the current table recursively.
mx.TextTools.TableInList
mx.TextTools.SubTableInList
Similar to 'Table' and 'SubTable' except that
the 'argument' is a tuple of the form
'(list_of_tables,index)'. The advantage (and the danger)
of this is that a list is mutable and may have tables
added after the tuple defined--in particular, the
containing tag table may be added to 'list_of_tables' to
allow recursion. Note, however, that the special value
'ThisTable' can be used with the 'Table' or 'SubTable'
commands and is usually more clear.
SEE ALSO, `mx.TextTools.Table`, `mx.TextTools.SubTable`
mx.TextTools.Call
Match on any computable basis. Essentially, when the
'Call' command is used, control over parsing/matching is
turned over to Python rather than staying in the
[mx.TextTools] engine. The function that is called must
accept arguments 's', 'pos', and 'end'--where 's' is the
underlying string, 'pos' is the current read-head position,
and 'end' is ending of the slice being processed. The
called function must return an integer for the new
read-head position; if the return is different from 'pos',
the match is a success.
As an example, suppose you want to match at a certain
point only if the next N characters make up a dictionary
word. Perhaps an efficient stemmed data structure is
used to represent the dictionary word list. You might
check dictionary membership with a tuple like:
#*-------------------- Call example ---------------------#
tup = ('DictWord', Call, inDict)
Since the function 'inDict' is written in Python, it will
generally not operate as quickly as does an [mx.TextTools]
pattern tuple.
mx.TextTools.CallArg
Same as 'Call', except 'CallArg' allows passing additional
arguments. For example, suppose the dictionary example
given in the discussion of 'Call' also allows you to
specify language and maximum word length for a match:
#*-------------------- Call example ---------------------#
tup = ('DictWord', Call, (inDict,['English',10]))
SEE ALSO, `mx.TextTools.Call`
MODIFIERS:
mx.TextTools.CallTag
Instead of appending '(tagobj,l,r,subtags)' to the taglist
upon a successful match, call the function indicated as the
tag object (which must be a function rather than 'None' or
a string). The function called must accept the arguments
'taglist', 's', 'start', 'end', and 'subtags'--where
'taglist' is the present taglist, 's' is the underlying
string, 'start' and 'end' are the slice indices of the
match, and 'subtags' is the nested taglist. The function
called -may-, but need not, append to or modify 'taglist' or
'subtags' as part of its action. For example, a code
parsing application might include:
#*----------------- CallTag example ---------------------#
>>> def todo_flag(taglist, s, start, end, subtags):
... sys.stderr.write("Fix issue at offset %d\n" % start)
...
>>> tup = (todo_flag, Word+CallTag, 'XXX')
>>> tag('XXX more stuff', (tup,))
Fix issue at offset 0
(1, [], 3)
mx.TextTools.AppendMatch
Instead of appending '(tagobj,start,end,subtags)' to the
taglist upon successful matching, append the match found as
string. The produced taglist is "flattened" and cannot be
used in the same manner as "normal" taglist data
structures. The flat data structure is often useful for
joining or for list processing styles.
>>> from mx.TextTools import *
>>> words = (('Word', AllIn+AppendMatch, alpha),
... (None, AllIn, whitespace, MatchFail, -1))
>>> tag('this and that', words)
(0, ['this', 'and', 'that'], 13)
>>> join(tag('this and that', words)[1], '-')
'this-and-that'
SEE ALSO, `string.split()`
mx.TextTools.AppendToTagobj
Instead of appending '(tagobj,start,end,subtags)' to the
taglist upon successful matching, call the '.append()'
method of the tag object. The tag object must be
a list (or a descendent of 'list' in Python 2.2+).
>>> from mx.TextTools import *
>>> ws = []
>>> words = ((ws, AllIn+AppendToTagobj, alpha),
... (None, AllIn, whitespace, MatchFail, -1))
>>> tag('this and that', words)
(0, [], 13)
>>> ws
[(None, 0, 4, None), (None, 5, 8, None), (None, 9, 13, None)]
SEE ALSO, `mx.TextTools.CallTag`
mx.TextTools.AppendTagobj
Instead of appending '(tagobj,start,end,subtags)' to the
taglist upon successful matching, append the tag object.
The produced taglist is usually nonstandard and cannot be
used in the same manner as "normal" taglist data
structures. A flat data structure is often useful for
joining or for list processing styles.
>>> from mx.TextTools import *
>>> words = (('word', AllIn+AppendTagobj, alpha),
... (None, AllIn, whitespace, MatchFail, -1))
>>> tag('this and that', words)
(0, ['word', 'word', 'word'], 13)
mx.TextTools.LookAhead
If this modifier is used, the read-head position is not
changed when a match occurs. As the name suggests, this
modifier allows you to create patterns similar to regular
expression lookaheads.
>>> from mx.TextTools import *
>>> from pprint import pprint
>>> xwords = ((None, IsIn+LookAhead, 'Xx', +2),
... ('xword', AllIn, alpha, MatchFail, +2),
... ('other', AllIn, alpha),
... (None, AllIn, whitespace, MatchFail, -3))
>>> pprint(tag('Xylophone trumpet xray camera', xwords))
(0,
[('xword', 0, 9, None),
('other', 10, 17, None),
('xword', 18, 22, None),
('other', 23, 29, None)],
29)
CLASSES:
mx.TextTools.BMS(word [,translate])
mx.TextTools.FS(word [,translate])
mx.TextTools.TextSearch(word [,translate [,algorithm=BOYERMOORE]])
Create a search object for the string 'word'. This is
similar in concept to a compiled regular expression. A
search object has several methods to locate its encoded
string within another string. The 'BMS' name is short for
"Boyer-Moore," which is a particular search algorithm. The
name 'FS' is reserved for accessing the "Fast Search"
algorithm in future versions, but currently both classes
use Boyer-Moore. For [mx.TextTools] 2.1.0+, you are
urged to use the '.TextSearch()' constructor.
If a 'translate' argument is given, the searched string is
translated during the search. This is equivalent to
transforming the string with `string.translate()` prior to
searching it.
SEE ALSO, `string.translate()`
mx.TextTools.CharSet(definition)
Version 2.1.0 of [mx.TextTools] adds the Unicode-compatible
'CharSet' object. 'CharSet' objects may be initialized
to supports character ranges, as in regular expressions;
for example 'definition="a-mXYZ"'. In most respects,
CharSet objects are similar to older sets.
METHODS AND ATTRIBUTES:
mx.TextTools.BMS.search(s [,start [,end]])
mx.TextTools.FS.search(s [,start [,end]])
mx.TextTools.TextSearch.search(s [,start [,end]])
Locate as a slice the first match of the search object
against 's'. If optional arguments 'start' and 'end' are
used, only the slice 's[start:end]' is considered. Note:
As of version 2.1.0, the documentation that accompanies
[mx.TextTools] inaccurately describes the 'end' parameter
of search object methods as indicating the length of the
slice rather than its ending offset.
mx.TextTools.BMS.find(s, [,start [,end]])
mx.TextTools.FS.find(s, [,start [,end]])
mx.TextTools.TextSearch.search(s [,start [,end]])
Similar to `mx.TextTools.BMS.search()`, except return only
the starting position of the match. The behavior is
similar to that of `string.find()`.
SEE ALSO, `string.find()`, `mx.TextTools.find()`
mx.TextTools.BMS.findall(s [,start [,end]])
mx.TextTools.FS.findall(s [,start [,end]])
mx.TextTools.TextSearch.search(s [,start [,end]])
Locate as slices -every- match of the search object
against 's'. If the optional arguments 'start' and 'end'
are used, only the slice 's[start:end]' is considered.
>>> from mx.TextTools import BMS, any, upper
>>> foosrch = BMS('FOO', upper(any))
>>> foosrch.search('foo and bar and FOO and BAR')
(0, 3)
>>> foosrch.find('foo and bar and FOO and BAR')
0
>>> foosrch.findall('foo and bar and FOO and BAR')
[(0, 3), (16, 19)]
>>> foosrch.search('foo and bar and FOO and BAR', 10, 20)
(16, 19)
SEE ALSO, `re.findall`, `mx.TextTools.findall()`
mx.TextTools.BMS.match
mx.TextTools.FS.match
mx.TextTools.TextSearch.match
The string that the search object will look for in the
search text (read-only).
mx.TextTools.BMS.translate
mx.TextTools.FS.translate
mx.TextTools.TextSearch.match
The translation string used by the object, or 'None' if no
'translate' string was specified.
mx.TextTools.CharSet.contains(c)
Return a true value if character 'c' is in the CharSet.
mx.TextTools.CharSet.search(s [,direction [,start=0 [,stop=len(s)]]])
Return the position of the first CharSet character that
occurs in 's[start:end]'. Return 'None' if there is no
match. You may specify a negative 'direction' to search
backwards.
SEE ALSO, `re.search()`
mx.TextTools.CharSet.match(s [,direction [,start=0 [,stop=len(s)]]])
Return the length of the longest contiguous match of the
CharSet object against substrings of 's[start:end]'.
mx.TextTools.CharSet.split(s [,start=0 [,stop=len(text)]])
Return a list of substrings of 's[start:end]' divided by
occurrences of characters in the CharSet.
SEE ALSO, `re.search()`
mx.TextTools.CharSet.splitx(s [,start=0 [,stop=len(text)]])
Like `mx.TextTools.CharSet.split()` except retain
characters from CharSet in interspersed list elements.
mx.TextTools.CharSet.strip(s [,where=0 [,start=0 [,stop=len(s)]]])
Strip all characters in 's[start:stop]' appearing in the
character set.
FUNCTIONS:
Many of the functions in [mx.TextTools] are used by the tagging
engine. A number of others are higher-level utility functions
that do not require custom development of tag tables. The
latter are listed under separate heading and generally
resemble faster versions of functions in the [string] module.
mx.TextTools.cmp(t1, t2)
Compare two valid taglist tuples on their slice
positions. Taglists generated with multiple passes of
`mx.TextTools.tag()`, or combined by other means, may not
have tuples sorted in string order. This custom comparison
function is coded in C and is very fast.
>>> import mx.TextTools
>>> from pprint import pprint
>>> tl = [('other', 10, 17, None),
... ('other', 23, 29, None),
... ('xword', 0, 9, None),
... ('xword', 18, 22, None)]
>>> tl.sort(mx.TextTools.cmp)
>>> pprint(tl)
[('xword', 0, 9, None),
('other', 10, 17, None),
('xword', 18, 22, None),
('other', 23, 29, None)]
mx.TextTools.invset(s)
Identical to 'mx.TextTools.set(s, 0)'.
SEE ALSO, `mx.TextTools.set()`
mx.TextTools.set(s [,includechars=1])
Return a bit-position encoded character set. Bit-position
encoding makes tag table commands like 'InSet' and
'AllInSet' operate more quickly than their character-string
equivalents (e.g, 'In', 'AllIn').
If 'includechars' is set to 0, invert the character set.
SEE ALSO, `mx.TextTools.invset()`
mx.TextTools.tag(s, table [,start [,end [,taglist]]])
Apply a tag table to a string. The return value is a
tuple of the form '(success, taglist, next)'. 'success' is
a binary value indicating whether the table matched.
'next' is the read-head position after the match attempt.
Even on a nonmatch of the table, the read-head might have
been advanced to some degree by member tuples matching.
The 'taglist' return value contains the data structure
generated by application. Modifiers and commands within
the tag table can alter the composition of 'taglist'; but
in the normal case, 'taglist' is composed of zero or more
tuples of the form '(tagname, start, end, subtaglist)'.
Assuming a "normal" taglist is created, 'tagname' is a
string value that was given as a tag object in a tuple
within the tag table. 'start' and 'end' the slice ends of
that particular match. 'subtaglist' is either 'None' or a
taglist for a subtable match.
If 'start' or 'end' are given as arguments to
`mx.TextTools.tag()`, application is restricted to the
slice 's[start:end]' (or 's[start:]' if only 'start' is
used). If a 'taglist' argument is passed, that list object
is used instead of a new list. This allows extending a
previously generated taglist, for example. If 'None' is
passed as 'taglist', no taglist is generated.
See the application examples and command illustrations for
a number of concrete uses of `mx.TextTools.tag()`.
UTILITY FUNCTIONS:
mx.TextTools.charsplit(s, char, [start [,end]])
Return a list split around each 'char'. Similar to
`string.split()`, but faster. If the optional arguments
'start' and 'end' are used, only the slice 's[start:end]'
is operated on.
SEE ALSO, `string.split()`, `mx.TextTools.setsplit()`
mx.TextTools.collapse(s, sep=' ')
Return a string with normalized whitespace. This is
equivalent to 'string.join(string.split(s),sep)', but
faster.
>>> from mx.TextTools import collapse
>>> collapse('this and that','-')
'this-and-that'
SEE ALSO, `string.join()`, `string.split()`
mx.TextTools.countlines(s)
Returns the number of lines in 's' in a platform-portable
way. Lines may end with CR (Mac-style), LF (Unix-style),
or CRLF (DOS-style), including a mixture of these.
SEE ALSO, `FILE.readlines()`, `mx.TextTools.splitlines()`
mx.TextTools.find(s, search_obj, [start, [,end]])
Return the position of the first match of 'search_obj'
against 's'. If the optional arguments 'start' and 'end'
are used, only the slice 's[start:end]' is considered.
This function is identical to the search object method of
the same name; the syntax is just slightly different. The
following are synonyms:
#*------ Function and method versions of find() ---------#
from mx.TextTools import BMS, find
s = 'some string with a pattern in it'
pos1 = find(s, BMS('pat'))
pos2 = BMS('pat').find(s)
SEE ALSO, `string.find()`, `mx.TextTools.BMS.find()`
mx.TextTools.findall(s, search_obj [,start [,end]])
Return as slices -every- match of 'search_obj' against
's'. If the optional arguments 'start' and 'end' are used,
only the slice 's[start:end]' is considered. This function
is identical to the search object method of the same name;
the syntax is just slightly different. The following are
synonyms:
#*------ Function and method versions of findall() ------#
from mx.TextTools import BMS, findall
s = 'some string with a pattern in it'
pos1 = findall(s, BMS('pat'))
pos2 = BMS('pat').findall(s)
SEE ALSO, `mx.TextTools.find()`, `mx.TextTools.BMS.findall()`
mx.TextTools.hex2str(hexstr)
Returns a string based on the hex-encoded string 'hexstr'.
>>> from mx.TextTools import hex2str, str2hex
>>> str2hex('abc')
'616263'
>>> hex2str('616263')
'abc'
SEE ALSO, `mx.TextTools.str2hex()`
mx.TextTools.is_whitespace(s [,start [,end]])
Returns a Boolean value indicating whether 's[start:end]'
contains only whitespace characters. 'start' and 'end' are
optional, and will default to '0' and 'len(s)',
respectively.
mx.TextTools.isascii(s)
Returns a Boolean value indicating whether 's' contains
only ASCII characters.
mx.TextTools.join(joinlist [,sep='' [,start [,end]]])
Return a string composed of slices from other strings.
'joinlist' is a sequence of tuples of the form
'(s, start, end, ...)' each indicating the source string
and offsets for the utilized slice. Negative offsets do
not behave like Python slice offsets and should not be
used. If a 'joinlist' item tuple contains extra entries,
they are ignored, but are permissible.
If the optional argument 'sep' is specified, a delimiter
between each joined slice is added. If 'start' and 'end'
are specified, only 'joinlist[start:end]' is utilized in
the joining.
>>> from mx.TextTools import join
>>> s = 'Spam and eggs for breakfast'
>>> t = 'This and that for lunch'
>>> jl = [(s, 0, 4), (s, 9, 13), (t, 0, 4), (t, 9, 13)]
>>> join(jl, '/', 1, 4)
'/eggs/This/that'
SEE ALSO, `string.join()`
mx.TextTools.lower(s)
Return a string with any uppercase letters converted to
lowercase. Functionally identical to `string.lower()`, but
much faster.
SEE ALSO, `string.lower()`, `mx.TextTools.upper()`
mx.TextTools.prefix(s, prefixes [,start [,stop [,translate]]])
Return the first prefix in the tuple 'prefixes' that
matches the end of 's'. If 'start' and 'end'
are specified, only operate on the slice 's[start:end]'.
Return 'None' if no prefix matches.
If a 'translate' argument is given, the searched string is
translated during the search. This is equivalent to
transforming the string with `string.translate()` prior to
searching it.
>>> from mx.TextTools import prefix
>>> prefix('spam and eggs', ('spam','and','eggs'))
'spam'
SEE ALSO, `mx.TextTools.suffix()`
mx.TextTools.multireplace(s ,replacements [,start [,stop]])
Replace multiple nonoverlapping slices in 's' with string
values. 'replacements' must be list of tuples of the form
'(new, left, right)'. Indexing is always relative to 's',
even if an earlier replacement changes the length of the
result. If 'start' and 'end' are specified, only operate
on the slice 's[start:end]'.
>>> from mx.TextTools import findall, multireplace
>>> s = 'spam, bacon, sausage, and spam'
>>> repls = [('X',l,r) for l,r in findall(s, 'spam')]
>>> multireplace(s, repls)
'X, bacon, sausage, and X'
>>> repls
[('X', 0, 4), ('X', 26, 30)]
mx.TextTools.replace(s, old, new [,start [,stop]])
Return a string where the pattern matched by search object
'old' is replaced by string 'new'. If 'start' and 'end'
are specified, only operate on the slice 's[start:end]'.
This function is much faster than `string.replace()`, since
a search object is used in the search aspect.
>>> from mx.TextTools import replace, BMS
>>> s = 'spam, bacon, sausage, and spam'
>>> spam = BMS('spam')
>>> replace(s, spam, 'eggs')
'eggs, bacon, sausage, and eggs'
>>> replace(s, spam, 'eggs', 5)
' bacon, sausage, and eggs'
SEE ALSO, `string.replace()`, `mx.TextTools.BMS`
mx.TextTools.setfind(s, set [,start [,end]])
Find the first occurence of any character in 'set'. If
'start' is specified, look only in 's[start:]'; if 'end'
is specified, look only in 's[start:end]'. The argument
'set' must be a set.
>>> from mx.TextTools import *
>>> s = 'spam and eggs'
>>> vowel = set('aeiou')
>>> setfind(s, vowel)
2
>>> setfind(s, vowel, 7, 10)
9
SEE ALSO, `mx.TextTools.set()`
mx.TextTools.setsplit(s, set [,start [,stop]])
Split 's' into substrings divided at any characters in
'set'. If 'start' is specified, create a list of
substrings of 's[start:]'; if 'end' is specified, use
's[start:end]'. The argument 'set' must be a set.
SEE ALSO, `string.split()`, `mx.TextTools.set()`,
`mx.TextTools.setsplitx()`
mx.TextTools.setsplitx(text,set[,start=0,stop=len(text)])
Split 's' into substrings divided at any characters in
'set'. Include the split characters in the returned list.
Adjacent characters in 'set' are returned in the same list
element. If 'start' is specified, create a list of
substrings of 's[start:]'; if 'end' is specified, use
's[start:end]'. The argument 'set' must be a set.
>>> s = 'do you like spam'
>>> setsplit(s, vowel)
['d', ' y', ' l', 'k', ' sp', 'm']
>>> setsplitx(s, vowel)
['d', 'o', ' y', 'ou', ' l', 'i', 'k', 'e', ' sp', 'a', 'm']
SEE ALSO, `string.split()`, `mx.TextTools.set()`,
`mx.TextTools.setsplit()`
mx.TextTools.splitat(s, char, [n=1 [,start [end]]])
Return a 2-element tuple that divides 's' around the 'n'th
occurence of 'char'. If 'start' and 'end' are specified,
only operate on the slice 's[start:end]'.
>>> from mx.TextTools import splitat
>>> s = 'spam, bacon, sausage, and spam'
>>> splitat(s, 'a', 3)
('spam, bacon, s', 'usage, and spam')
>>> splitat(s, 'a', 3, 5, 20)
(' bacon, saus', 'ge')
mx.TextTools.splitlines(s)
Return a list of lines in 's'. Line-ending combinations
for Mac, PC, and Unix platforms are recognized in any
combination, which makes this function more portable than
is 'string.split(s,"\n")' or `FILE.readlines()`.
SEE ALSO, `string.split()`, `FILE.readlines()`,
`mx.TextTools.setsplit()`,
`mx.TextTools.countlines()`
mx.TextTools.splitwords(s)
Return a list of whitespace-separated words in 's'.
Equivalent to 'string.split(s)'.
SEE ALSO, `string.split()`
mx.TextTools.str2hex(s)
Returns a hexadecimal representation of a string. For
Python 2.0+, this is equivalent to 's.encode("hex")'.
SEE ALSO, `"".encode()`, `mx.TextTools.hex2str()`
mx.TextTools.suffix(s, suffixes [,start [,stop [,translate]]])
Return the first suffix in the tuple 'suffixes' that
matches the end of 's'. If 'start' and 'end'
are specified, only operate on the slice 's[start:end]'.
Return 'None' if no suffix matches.
If a 'translate' argument is given, the searched string is
translated during the search. This is equivalent to
transforming the string with `string.translate()` prior to
searching it.
>>> from mx.TextTools import suffix
>>> suffix('spam and eggs', ('spam','and','eggs'))
'eggs'
SEE ALSO, `mx.TextTools.prefix()`
mx.TextTools.upper(s)
Return a string with any lowercase letters converted to
uppercase. Functionally identical to `string.upper()`, but
much faster.
SEE ALSO, `string.upper()`, `mx.TextTools.lower()`
TOPIC -- High-Level EBNF Parsing
--------------------------------------------------------------------
=================================================================
MODULE -- SimpleParse : A Parser Generator for mx.TextTools
=================================================================
[SimpleParse] is an interesting tool. To use this module, you
need to have the [mx.TextTools] module installed. While there
is nothing you can do with [SimpleParse] that cannot be done
with [mx.TextTools] by itself, [SimpleParse] is often much
easier to work with. There exist other modules to provide
higher-level APIs for [mx.TextTools]; I find [SimpleParse]
to be the most useful of these, and the only one that this
book will present. The examples in this section were written
against [SimpleParse] version 1.0, but the documentation is
updated to include new features of 2.0. Version 2.0 is fully
backward compatible with existing [SimpleParse] code.
[SimpleParse] substitutes an EBNF-style grammar for the
low-level state matching language of [mx.TextTools] tag tables.
Or more accurately, [SimpleParse] is a tool for generating tag
tables based on friendlier and higher-level EBNF grammars. In
principle, [SimpleParse] lets you access and modify tag tables
before passing them to `mx.TextTools.tag()`. But in practice,
you usually want to stick wholly with [SimpleParse]'s EBNF
variant when your processing is amenable to a grammatical
description of the text format.
An application based on [SimpleParse] has two main aspects.
The first aspect is the grammar that defines the structure of a
processed text. The second aspect is the traversal and use of
a generated [mx.TextTools] taglist. [SimpleParse] 2.0 adds
facilities for the traversal aspect, but taglists present a
data structure that is quite easy to work with in any case.
The tree-walking tools in [SimpleParse] 2.0 are not covered
here, but the examples given in the discussion of
[mx.TextTools] illustrate such traversal.
EXAMPLE: Marking up smart ASCII (Redux)
--------------------------------------------------------------------
Elsewhere in this book, applications to process the smart ASCII
format are also presented. Appendix D lists the 'Txt2Html'
utility, which uses a combination of a state machine for parsing
paragraphs and regular expressions for identifying inline markup.
A functionally similar example was given in the discussion of
[mx.TextTools], where a complex and compound tag table was
developed to recognize inline markup elements. Using
[SimpleParse] and an EBNF grammar is yet another way to perform
the same sort of processing. Comparing the several styles will
highlight a number of advantages that [SimpleParse] has--its
grammars are clear and concise, and applications built around it
can be extremely fast.
The application 'simpleTypography.py' is quite simple; most of
the work of programming it lies in creating a grammar to
describe smart ASCII. EBNF grammars are almost self-explanatory
to read, but designing one -does- require a bit of thought and
testing:
#------------------ typography.def ----------------------#
para := (plain / markup)+
plain := (word / whitespace / punctuation)+
', '
'),
'code' : ('', '
'),
'title' : ('', ''),
}
Extending this to other output formats is straightforward.
THE TAGLIST AND THE OUTPUT:
The -tag table- generated from the grammar in 'typography.def'
is surprisingly complicated and includes numerous recursions.
Only the exceptionally brave of heart will want to attempt
manual--let alone automated--modification of tag tables created
by [SimpleParse]. Fortunately, an average user need not even
look at these tags, but simply -use- them, as is done with
'parser' in 'simpleTypography.py'.
The -taglist- produced by applying a grammar, in contrast, can be
remarkably simple. Here is a run of 'simpleTypography.py'
against a small input file:
#*----------- Test run of simpleTypography.py -----------#
% python simpleTypography.py < p.txt > p.html
(1,
[('plain', 0, 15, []),
('markup', 15, 27, [('emph', 15, 27, [('plain', 16, 26, [])])]),
('plain', 27, 42, []),
('markup', 42, 51, [('module', 42, 51, [('plain', 43, 50, [])])]),
('plain', 51, 55, []),
('markup', 55, 70, [('code', 55, 70, [('plain', 56, 69, [])])]),
('plain', 70, 90, []),
('markup', 90, 96, [('strong', 90, 96, [('plain', 91, 95, [])])]),
('plain', 96, 132, []),
('markup', 132, 145, [('title', 132, 145, [('plain',133,144,[])])]),
('plain', 145, 174, [])],
174)
Most productions that were satisfied are not written into the
taglist, because they are not needed for the application. You
can control this aspect simply by defining productions with or
without angle braces on the left side of their declaration.
The output looks like you would expect:
#*-------- Input and output of simpleTypography run -----#
% cat p.txt
Some words are -in italics-, others
name [modules] or 'command lines'.
Still others are *bold* -- that's how
it goes. Maybe some _book titles_.
And some in-fixed dashes.
% cat p.html
Some words are in italics, others
name modules
or command lines
.
Still others are bold -- that's how
it goes. Maybe some book titles.
And some in-fixed dashes.
-*-
GRAMMAR:
The language of [SimpleParse] grammars is itself defined using a
[SimpleParse] EBNF-style grammar. In principle, you could
refine the language [SimpleParse] uses by changing the
variable 'declaration' in 'bootstrap.py', or
'simpleparsegrammar.py' in recent versions. For example,
extended regular expressions, W3C XML Schemas, and some
EBNF variants allow integer occurrence quantification. To
specify that three to seven 'foo' tokens occur, you could use
the following declaration in [SimpleParse]:
#*---------------- 3-to-7 quantification ----------------#
foos := foo, foo, foo, foo?, foo?, foo?, foo?
Hypothetically, it might be more elegant to write something
like:
#*------------- Improved 3-to-7 quantification ----------#
foos := foo{3,7}
In practice, only someone developing a custom/enhanced parsing
module would have any reason to fiddle quite so deeply; "normal"
programmers should use the particular EBNF variant defined by
default. Nonetheless, taking a look at 'simpleparse/bootstrap.py'
can be illustrative in understanding the module.
DECLARATION PATTERNS:
A [SimpleParse] grammar consists of a set of one or more
declarations. Each declaration generally occurs on a line by
itself; within a line, horizontal whitespace may be used as
desired to improve readability. A common strategy is to align
the right sides of declarations, but any other use of internal
whitespace is acceptable. A declaration contains a term,
followed by the assignment symbol ":=", followed by a
definition. An end-of-line comment may be added to a
declaration, following an unquoted "#" (just as in Python).
In contrast to most imperative-style programming, the
declarations within a grammar may occur in any order. When a
parser generator's '.parserbyname()' method is called, the "top
level" of the grammar is given as an argument. The documented
API for [SimpleParse] uses a call of the form:
#*--------- How to create a SimpleParse parser ----------#
from simpleparse import generator
parser = generator.buildParser(decl).parserbyname('toplevel')
from mx.TextTools import TextTools
taglist = TextTools.tag(src, parser)
Under [SimpleParse] 2.0, you may simplify this to:
#*------- How to create a SimpleParse v2 parser ---------#
from simpleparse.parser import Parser
parser = Parser(decl,'toplevel')
taglist = parser.parse(src)
A left side term may be surrounded by angle brackets ("'<'",
"'>'") to prevent that production from being written into a
taglist produced by `mx.TextTools.tag()`. This is called an
"unreported" production. Other than in relation to the final
taglist, an unreported production acts just like a reported one.
Either type of term may be used on the right sides of other
productions in the same manner (without angle brackets when
occurring on the right side).
In [SimpleParse] 2.0 you may also use reversed angle brackets
to report the children of a production, but not the production
itself. As with the standard angle brackets, the production
functions normally in matching inputs; it differs only in
produced taglist. For example:
#*--------- Reported and Unreported productions ---------#
PRODUCTIONS TAGLIST
--------------------------------------------------------
a := (b,c) ('a', l, r, [
b := (d,e) ('b', l, r, [...]),
c := (f,g) ('c', l, r, [...]) ] )
--------------------------------------------------------
a := (b,c) ('a', l, r, [
:= (d,e) # no b, and no children
c := (f,g) ('c', l, r, [...]) ] )
--------------------------------------------------------
# Only in 2.0+ ('a', l, r, [
a := (b,c) # no b, but raise children
>b< := (d,e) ('d', l, r, [...]),
c := (f,g) ('e', l, r, [...]),
('c', l, r, [...]) ] )
--------------------------------------------------------
The remainder of the documentation of the [SimpleParse] module
covers elements that may occur on the right sides of
declarations. In addition to the elements listed, a term from
another production may occur anywhere any element may. Terms
may thus stand in mutually recursive relations to one another.
LITERALS:
Literal string
A string enclosed in single quotes matches the exact string
quoted. Python escaping may be used for the characters
'\a', '\b', '\f', '\n', '\r', '\t', and '\v', and octal
escapes of one to three digits may used. To include a
literal backslash, it should be escaped as '\\'.
#*----------- A character literal declaration -----------#
foo := "bar"
Character class: "[", "]"
Specify a set of characters that may occur at a position.
The list of allowable characters may be enumerated with no
delimiter. A range of characters may be indicated with a
dash ("-"). Multiple ranges are allowed within a class.
To include a "]" character in a character class, make it
the first character. Similarly, a literal "-" character
must be either the first (after the optional "]"
character) or the last character.
#*----------- A character class declaration -------------#
varchar := [a-zA-Z_0-9]
QUANTIFIERS:
Universal quantifier: "*"
Match zero or more occurrences of the preceding expression.
Quantification has a higher precedence than alternation or
sequencing; grouping may be used to clarify quantification
scope as well.
#*---------- Universal quantifier declaration -----------#
any_Xs := "X"*
any_digits := [0-9]*
Existential quantifier: "+"
Match one or more occurrences of the preceding expression.
Quantification has a higher precedence than alternation or
sequencing; grouping may be used to clarify quantification
scope as well.
#*---------- Existential quantifier declaration ---------#
some_Xs := "X"+
some_digits := [0-9]+
Potentiality quantifier: "?"
Match at most one occurrence of the preceding expression.
Quantification has a higher precedence than alternation or
sequencing; grouping may be used to clarify quantification
scope as well.
#*---------- Existential quantifier declaration ---------#
maybe_Xs := "X"?
maybe_digits := [0-9]?
Lookahead quantifier: "?"
In [SimpleParse] 2.0+, you may place a question mark
-before- a pattern to assert that it occurs, but should not
actually claim the pattern. As with regular expressions,
you can create either positive or negative lookahead
assertions.
#*---------- Lookahead quantifier declaration -----------#
next_is_Xs := ?"X"
next_is_not_digits := ?-[0-9]
Error on Failure: "!"
In [SimpleParse] 2.0+, you may cause a descriptive
exception to be raised when a production does not match,
rather than merely stopping parsing at that point.
#*------------ Error on failure declaration ------------#
require_Xs := "X"!
require_code := ([A-Z]+, [0-9])!
contraction := "'", ('clock'/'d'/'ll'/'m'/'re'/'s'/'t'/'ve')!
For example, modifying the 'contraction' production from
the prior discussion could require that every apostrophe is
followed by an ending. Since this doesn't hold, you might
see an exception like:
#*--------- Example error-on-failure exception -----------#
% python typo2.py < p.txt
Traceback (most recent call last):
[...]
simpleparse.error.ParserSyntaxError: ParserSyntaxError:
Failed parsing production "contraction" @pos 84 (~line 1:29).
Expected syntax: ('clock'/'d'/'ll'/'m'/'re'/'s'/'t'/'ve')
Got text: 'command lines'. Still others are *bold*
STRUCTURES:
Alternation operator: "/"
Match the first pattern possible from several alternatives.
This operator allows any of a list of patterns to match.
Some EBNF-style parsers will match the -longest- possible
pattern, but [SimpleParse] more simply matches the -first-
possible pattern. For example:
>>> from mx.TextTools import tag
>>> from simpleparse import generator
>>> decl = '''
... short := "foo", " "*
... long := "foobar", " "*
... sl := (short / long)*
... ls := (long / short)*
... '''
>>> parser = generator.buildParser(decl).parserbyname('sl')
>>> tag('foo foobar foo bar', parser)[1]
[('short', 0, 4, []), ('short', 4, 7, [])]
>>> parser = generator.buildParser(decl).parserbyname('ls')
>>> tag('foo foobar foo bar', parser)[1]
[('short', 0, 4, []), ('long', 4, 11, []), ('short', 11, 15, [])]
Sequence operator: ","
Match the first pattern followed by the second pattern
(followed by the third pattern, if present, ...). Whenever
a definition needs several elements in a specific order, the
comma sequence operator is used.
#*-------------- Sequence declaration --------------------#
term := someterm, [0-9]*, "X"+, (otherterm, stillother)?
Negation operator: "-"
Match anything that the next pattern -does not- match. The
pattern negated can be either a simple term or a compound
expression.
#*------------ Declarations with negation ---------------#
nonletters := -[a-zA-Z]
nonfoo := -foo
notfoobarbaz := -(foo, bar, baz)
An expression modified by the negation operator is very
similar conceptually to a regular expression with a
negative lookahead assertion. For example:
>>> from mx.TextTools import tag
>>> from simpleparse import generator
>>> decl = '''not_initfoo := [ \t]*, -"foo", [a-zA-Z ]+'''
>>> p = generator.buildParser(decl).parserbyname('not_initfoo')
>>> tag(' foobar and baz', p) # no match
(0, [], 0)
>>> tag(' bar, foo and baz', p) # match on part
(1, [], 5)
>>> tag(' bar foo and baz', p) # match on all
(1, [], 17)
Grouping operators: "(", ")"
Parentheses surrounding any pattern turn that pattern into
an expression (possibly within a larger expression).
Quantifiers and operators refer to the immediately
adjacent expression, if one is defined, otherwise to the
adjacent literal string, character class, or term.
>>> from mx.TextTools import tag
>>> from simpleparse import generator
>>> decl = '''
... foo := "foo"
... bar := "bar"
... foo_bars := foo, bar+
... foobars := (foo, bar)+
... '''
>>> p1 = generator.buildParser(decl).parserbyname('foobars')
>>> p2 = generator.buildParser(decl).parserbyname('foo_bars')
>>> tag('foobarfoobar', p1)
(1, [('foo', 0, 3, []), ('bar', 3, 6, []),
('foo', 6, 9, []), ('bar', 9, 12, [])], 12)
>>> tag('foobarfoobar', p2)
(1, [('foo', 0, 3, []), ('bar', 3, 6, [])], 6)
>>> tag('foobarbarbar', p1)
(1, [('foo', 0, 3, []), ('bar', 3, 6, [])], 6)
>>> tag('foobarbarbar', p2)
(1, [('foo', 0, 3, []), ('bar', 3, 6, []),
('bar', 6, 9, []), ('bar', 9, 12, [])], 12)
USEFUL PRODUCTIONS:
In version 2.0+, [SimpleParse] includes a number of useful
productions that may be included in your grammars. See the
examples and documentation that accompany [SimpleParse] for
details on the many included productions and their usage.
The included productions, at the time of this writing, fall
into the categories below:
simpleparse.common.calendar_names
Locale specific names of months, and days of week, including
abbreviated forms.
simpleparse.common.chartypes
Locale-specific categories of characters, such as digits,
uppercase, octdigits, punctuation, locale_decimal_point,
and so on.
simpleparse.common.comments
Productions to match comments in a variety of programming
languages, such as hash ('#') end-of-line comments (Python,
Bash, Perl, etc.); C paired comments ('/* comment */'); and
others.
simpleparse.common.iso_date
Productions for strictly conformant ISO date and time
formats.
simpleparse.common.iso_date_loose
Productions for ISO date and time formats with some leeway
as to common variants in formatting.
simpleparse.common.numbers
Productions for common numeric formats, such as integers,
floats, hex numbers, binary numbers, and so on.
simpleparse.common.phonetics
Productions to match phonetically spelled words.
Currently, the US military style of "alpha, bravo, charlie,
..." spelling is the only style supported (with some leeway
in word spellings).
simpleparse.common.strings
Productions to match quoted strings as used in various
programming languages.
simpleparse.common.timezone_names
Productions to match descriptions of timezones, as you
might find in email headers or other data/time fields.
GOTCHAS:
There are a couple of problems that can easily arise in
constructed [SimpleParse] grammars. If you are having problems
in your application, keep a careful eye out for these issues:
1. Bad recursion. You might fairly naturally construct a
pattern of the form:
#*----------- Bad recursion in SimpleParse --------------#
a := b, a?
Unfortunately, if a long string of 'b' rules are matched,
the repeated recognition can either exceed the C-stack's
recursion limit, or consume inordinate amounts of memory to
construct nested tuples. Use an alternate pattern like:
#*------- Good quantification in SimpleParse ------------#
a := b+
This will grab all the 'b' productions in one tuple
instead (you could separately parse out each 'b' if
necessary).
2. Quantified potentiality. That is a mouthful; consider
patterns like:
#*------ Quantified potentiality in SimpleParse ---------#
a := (b? / c)*
x := (y?, z?)+
The first alternate 'b?' in the first--and both 'y?' and
'z?' in the second--are happy to match zero characters
(if a 'b' or 'y' or 'z' do not occur at the current
position). When you match "as many as possible" zero-width
patterns, you get into an infinite loop. Unfortunately, the
pattern is not always simple; it might not be 'b' that is
qualified as potential, but rather 'b' productions (or the
productions -in- 'b' productions, etc.).
3. No backtracking. Based on working with regular expression,
you might expect [SimpleParse] productions to use
backtracking. They do not. For example:
#*---------------- No backtracking ----------------------#
a := ((b/c)*, b)
If this were a regular expression, it would match a string
of 'b' productions, then back up one to match the final
'b'. As a [SimpleParse] production, this definition can
never match. If any 'b' productions occur, they will be
claimed by '(b/c)*', leaving nothing for the final 'b' to
grab.
TOPIC -- High-Level Programmatic Parsing
--------------------------------------------------------------------
=================================================================
MODULE -- PLY : Python Lex-Yacc
=================================================================
One module that I considered covering to round out this chapter
is John Aycock's [Spark] module. This module is both widely used
in the Python community and extremely powerful. However, I
believe that the audience of this book is better served by
working with David Beazley's [PLY] module than with the older
[Spark] module.
In the documentation accompanying [PLY], Beazley consciously
acknowledges the influence of [Spark] on his design and
development. While the [PLY] module is far from being a clone
of [Spark]--the APIs are significantly different--there is a
very similar -feeling- to working with each module. Both
modules require a very different style of programming and style
of thinking than do [mx.TextTools], [SimpleParse], or the state
machines discussed earlier in this chapter. In particular,
both [PLY] and [Spark] make heavy use of Python introspection
to create underlying state machines out of specially named
variables and functions.
Within an overall similarity, [PLY] has two main advantages over
[Spark] in a text processing context. The first, and probably
greatest, advantage [PLY] has is its far greater speed. Although
[PLY] has implemented some rather clever optimizations--such as
preconstruction of state tables for repeated runs--the main
speed difference lies in the fact that [PLY] uses a far faster,
albeit slightly less powerful, parsing algorithm. For text
processing applications (as opposed to compiler development),
[PLY]'s LR parsing is plenty powerful for almost any requirement.
A second advantage [PLY] has over every other Python parsing
library that I am aware of is a flexible and fine-grained error
reporting and error correction facility. Again, in a text
processing context, this is particularly important. For compiling
a programming language, it is generally reasonable to allow
compilation to fail in the case of even small errors. But for
processing a text file full of data fields and structures, you
usually want to be somewhat tolerant of minor formatting errors;
getting as much data as possible from a text automatically is
frequently the preferred approach. [PLY] does an excellent job of
handling "allowable" error conditions gracefully.
[PLY] consists of two modules: a lexer/tokenizer named 'lex.py',
and a parser named 'yacc.py'. The choice of names is taken from
the popular C-oriented tools 'lex' and 'yacc', and the behavior
is correspondingly similar. Parsing with [PLY] usually consists
of the two steps that were discussed at the beginning of this
chapter: (1) Divide the input string into a set of
nonoverlapping tokens using 'lex.py'. (2) Generate a parse tree
from the series of tokens using 'yacc.py'.
When processing text with [PLY], it is possible to attach "action
code" to any lexing or parsing event. Depending on application
requirements, this is potentially much more powerful than
[SimpleParse]. For example, each time a specific token is
encountered during lexing, you can modify the stored token
according to whatever rule you wish, or even trigger an entirely
different application action. Likewise, during parsing, each time
a node of a parse tree is constructed, the node can be modified
and/or other actions can be taken. In contrast, [SimpleParse]
simply delivers a completed parse tree (called a "taglist") that
must be traversed separately. However, while [SimpleParse]
does not provide the fine-tunable event control that [PLY]
does, [SimpleParse] offers a higher-level and cleaner grammar
language--the choice between the two modules is full of pros and
cons.
EXAMPLE: Marking up smart ASCII (yet again)
--------------------------------------------------------------------
This chapter has returned several times to applications for
processing smart ASCII: a state machine in Appendix D; a
functionally similar example using [mx.TextTools]; an EBNF
grammar with [SimpleParse]. This email-like markup format is not
in itself all that important, but it presents just enough
complications to make for a good comparison between programming
techniques and libraries. In many ways, an application using
[PLY] is similar to the [SimpleParse] version above--both use
grammars and parsing strategies.
GENERATING A TOKEN LIST:
The first step in most [PLY] applications is the creation of a
token stream. Tokens are identified by a series of regular
expressions attached to special pattern names of the form
't_RULENAME'. By convention, the [PLY] token types are in all
caps. In the simple case, a regular expression string is merely
assigned to a variable. If action code is desired when a token is
recognized, the rule name is defined as a function, with the
regular expression string as its docstring; passed to the
function is a 'LexToken' object (with attributes '.value',
'.type', and '.lineno'), which may be modified and returned. The
pattern is clear in practice:
#------------------- wordscanner.py ---------------------#
# List of token names. This is always required.
tokens = [ 'ALPHANUMS','SAFEPUNCT','BRACKET','ASTERISK',
'UNDERSCORE','APOSTROPHE','DASH' ]
# Regular expression rules for simple tokens
t_ALPHANUMS = r"[a-zA-Z0-9]+"
t_SAFEPUNCT = r'[!@#$%^&()+=|\{}:;<>,.?/"]+'
t_BRACKET = r'[][]'
t_ASTERISK = r'[*]'
t_UNDERSCORE = r'_'
t_APOSTROPHE = r"'"
t_DASH = r'-'
# Regular expression rules with action code
def t_newline(t):
r"\n+"
t.lineno += len(t.value)
# Special case (faster) ignored characters
t_ignore = " \t\r"
# Error handling rule
def t_error(t):
sys.stderr.write("Illegal character '%s' (%s)\n"
% (t.value[0], t.lineno))
t.skip(1)
import lex, sys
def stdin2tokens():
lex.input(sys.stdin.read()) # Give the lexer some input
toklst = [] # Tokenize
while 1:
t = lex.token()
if not t: break # No more input
toklst.append(t)
return toklst
if __name__=='__main__':
lex.lex() # Build the lexer
for t in stdin2tokens():
print '%s<%s>' % (t.value.ljust(15), t.type)
You are required to list the token types you wish to recognize,
using the 'tokens' variable. Each such token, and any special
patterns that are not returned as tokens, is defined either as
a variable or as a function. After that, you just initialize
the lexer, read a string, and pull tokens off sequentially.
Let us look at some results:
#*--------- Tokenization of smart ASCII text ------------#
% cat p.txt
-Itals-, [modname]--let's add ~ underscored var_name.
% python wordscanner.py < p.txt
Illegal character '~' (1)
- '] + t[2] + ['
']
def p_title(t):
'''title : UNDERSCORE plain UNDERSCORE'''
t[0] = [''] + t[2] + ['']
def p_error(t):
sys.stderr.write('Syntax error at "%s" (%s)\n'
% (t.value,t.lineno))
if __name__=='__main__':
lex.lex() # Build the lexer
yacc.yacc() # Build the parser
result = yacc.parse(sys.stdin.read())
print result
The output of this script, using the same input as above, is:
#*----------- Parse list of smart ASCII text ------------#
% python markupbuilder.py < p.txt
Illegal character '~' (1)
['', 'Itals', '', ',', '', 'modname',
'', '—', 'let', "'s", 'add', 'underscored',
'var', '_', 'name', '.']
One thing that is less than ideal in the [PLY] grammar is that
it has no quantifiers. In [SimpleParse] or another EBNF
library, we might give, for example, a 'plain' declaration
as:
#*------------ EBNF style plain declaration -------------#
plain := (ALPHANUMS | CONTRACTION | SAFEPUNCT | MDASH | WORDPUNCT)+
Quantification can make declarations more direct. But you can
achieve the same effect by using self-referential rules whose
left-hand terms also occur on the right-hand side. This style
is similar to recursive definitions, for example:
#*----------- Recursive para declaration ----------------#
plain : plain plain
| OTHERSTUFF
For example, 'markupbuilder.py', above, uses this technique.
If a tree structure were generated in this parser, a 'plain'
node might wind up being a subtree containing lower 'plain'
nodes (and terminal leaves of 'ALPHANUMS', 'CONTRACTION',
etc.). Traversal would need to account for this possibility.
The flat list structure used simplifies the issue, in this
case. A particular 'plain' object might result from the
concatenation of several smaller lists, but either way it is a
list by the time another rule includes the object.
LEX:
A [PLY] lexing module that is intended as support for a parsing
application must do four things. A lexing module that
constitutes a stand-alone application must do two additional
things:
1. Import the [lex] module:
#*-------------- Importing lex.py -----------------------#
import lex
2. Define a list or tuple variable 'tokens' that contains
the name of every token type the lexer is allowed to
produce. A list may be modified in-place should you wish
to specialize the lexer in an importing module; for
example:
#*-------------- Defining a 'tokens' list ---------------#
tokens = ['FOO', 'BAR', 'BAZ', 'FLAM']
3. Define one or more regular expression patterns matching
tokens. Each token type listed in 'tokens' should have a
corresponding pattern; other patterns may be defined also,
but the corresponding substrings will not be included in
the token stream.
Token patterns may be defined in one of two ways: (1) By
assigning a regular expression string to a specially named
variable. (2) By defining a specially named function whose
docstring is a regular expression string. In the latter
case, "action code" is run when the token is matched.
In both styles, the token name is preceded by the prefix
't_'. If a function is used, it should return the
'LexToken' object passed to it, possibly after some
modification, unless you do not wish to include the token
in the token stream. For example:
#*-------------- Defining token patterns ----------------#
t_FOO = r"[Ff][Oo]{1,2}"
t_BAR = r"[Bb][Aa][Rr]"
def t_BAZ(t):
r"([Bb][Aa][Zz])+"
t.value = 'BAZ' # canonical caps BAZ
return t
def t_FLAM(t):
r"(FLAM|flam)*"
# flam's are discarded (no return)
Tokens passed into a pattern function have three
attributes: '.type', '.value', and '.lineno'. '.lineno'
contains the current line number within the string being
processed and may be modified to change the reported
position, even if the token is not returned. The attribute
'.value' is normally the string matched by the regular
expression, but a new string, or a compound value like a
tuple or instance, may be assigned instead. The '.type' of
a 'LexToken', by default, is a string naming the token (the
same as the part of the function name after the 't_'
prefix).
There is a special order in which various token patterns
will be considered. Depending on the patterns used,
several patterns could grab the same substring--so it is
important to allow the desired pattern first claim on a
substring. Each pattern defined with a function is
considered in the order it is defined in the lexer file;
all patterns defined by assignment to a variable are
considered -after- every function-defined pattern.
Patterns defined by variable assignment, however, are not
considered in the order they are defined, but rather by
decreasing length. The purpose of this ordering is to let
longer patterns match before their subsequences (e.g., "=="
would be claimed before "=", allowing the former comparison
operator to match correctly, rather than as sequential
assignments).
The special variable 't_ignore' may contain a string of
characters to skip during pattern matching. These
characters are skipped more efficiently than is a token
function that has no return value. The token name 'ignore'
is, therefore, reserved and may not be used as a regular
token (if the all-cap token name convention is followed, it
assures no such conflict).
The special function 't_error()' may be used to process
illegal characters. The '.value' attribute of the
passed-in 'LexToken' will contain the remainder of the
string being processed (after the last match). If you
want to skip past a problem area (perhaps after taking some
corrective action in the body of 't_error()'), use the
'.skip()' method of the passed-in 'LexToken'.
4. Build the lexer. The [lex] module performs a bit of
namespace magic so that you normally do not need to name
the built lexer. Most applications can use just one
default lexer. However, if you wish to--or if you need
multiple lexers in the same application--you may bind a
built lexer to a name. For example:
#*----------------- Building the lexer ------------------#
mylexer = lex.lex() # named lexer
lex.lex() # default lexer
mylexer.input(mytext) # set input for named lexer
lex.input(othertext) # set input for default lexer
5. Give the lexer a string to process. This step is handled
by the parser when [yacc] is used in conjunction with
[lex], and nothing need be done explicitly. For stand-alone
tokenizers, set the input string using 'lex.input()' (or
similarly with the '.input()' method of named lexers).
6. Read the token stream (for stand-alone tokenizers) using
repeated invocation of the default 'lex.token()' function
or the '.token()' method of a named lexer. Unfortunately,
as of version 1.1, [PLY] does not treat the token stream as
a Python 2.2 iterator/generator. You can create an
iterator wrapper with:
#*------------ Wrap token stream in iterator ------------#
from __future__ import generators
# ...define the lexer rules, etc...
def tokeniterator(lexer=lex):
while 1:
t = lexer.token()
if t is None:
raise StopIteration
yield t
# Loop through the tokens
for t in tokeniterator():
# ...do something with each token...
Without this wrapper, or generally in earlier versions of
Python, you should use a 'while 1' loop with a break
condition:
#*--------- While loop with break token stream ----------#
# ...define the lexer rules, etc...
while 1:
t = lex.token()
if t is None: # No more input
break
# ...do something with each token...
YACC:
A [PLY] parsing module must do five things:
1. Import the 'yacc' module:
#*-------------- Importing yacc.py ----------------------#
import yacc
2. Get a token map from a lexer. Suppose a lexer module named
'mylexer.py' includes requirements 1 through 4 in the
above LEX description. You would get the token map with:
#*-------------- Importing the lexer --------------------#
from mylexer import *
Given the special naming convention 't_*' used for token
patterns, the risk of namespace pollution from 'import *'
is minimal.
You could also, of course, simply include the necessary
lexer setup code in the parsing module itself.
3. Define a collection of grammar rules. Grammar rules
are defined in a similar fashion to token functions.
Specially named functions having a 'p_' prefix contain one
or more productions and corresponding action code.
Whenever a production contained in the docstring of a
'p_*()' function matches, the body of that function runs.
Productions in [PLY] are described with a simplified EBNF
notation. In particular, no quantifiers are available in
rules; only sequencing and alternation is used (the rest
must be simulated with recursion and component
productions).
The left side of each rule contains a single rule name.
Following the rule name is one or more spaces, a colon, and
an additional one or more spaces. The right side of a rule
is everything following this. The right side of a rule can
occupy one or more lines; if alternative patterns are
allowed to fulfill a rule name, each such pattern occurs on
a new line, following a pipe symbol ("|"). Within each
right side line, a production is defined by a
space-separated sequence of terms--which may be either
tokens generated by the lexer or parser productions. More
than one production may be included in the same 'p_*()'
function, but it is generally more clear to limit each
function to one production (you are free to create more
functions). For example:
#*------ Parser function with multiple productions ------#
def p_rulename(t):
'''rulename : foo SPACE bar
| foo bar baz
| bar SPACE baz
otherrule : this that other
| this SPACE that '''
#...action code...
The argument to each 'p_*()' function is a 'YaccSlice'
object, which assigns each component of the rule to an
indexed position. The left side rule name is index
position 0, and each term/token on the right side is listed
thereafter, left to right. The list-like 'YaccSlice' is
sized just large enough to contain every term needed; this
might vary depending on which alternative production is
fulfilled on a particular call.
Empty productions are allowed by [yacc] (matching
zero-width); you never need more than one empty production
in a grammar, but this empty production might be a
component of multiple higher-level productions. An empty
production is basically a way of getting around the absence
of (potentiality) quantification in [PLY]; for example:
#*------------- Empty productions in yacc.py ------------#
def p_empty(t):
'''empty : '''
pass
def p_maybefoo(t):
'''foo : FOOTOKEN
| empty '''
t[0] = t[1]
def p_maybebar(t):
'''bar : BARTOKEN
| empty '''
t[0] = t[1]
If a fulfilled production is used in other productions
(including itself recursively), the action code should
assign a meaningful value to index position 0. This
position -is- the value of the production. Moreover what
is returned by the actual parsing is this position 0 of the
top-level production. For example:
#*------------ Assigning YaccSlice indices --------------#
# Sum N different numbers: "1.0 + 3 + 3.14 + 17"
def p_sum(t):
'''sum : number PLUS number'''
# ^ ^ ^ ^
# t[0] t[1] t[2] t[3]
t[0] = t[1] + t[3]
def p_number(t):
'''number : BASICNUMBER
| sum '''
# ^ ^
# t[0] t[1]
t[0] = float(t[1])
# Create the parser and parse some strings
yacc.yacc()
print yacc.parse('1.0')
The example simply assigns a numeric value with every
production, but it could also assign to position 0 of the
'YaccSlice' a list, 'Node' object, or some other data
structure that was useful to higher-level productions.
4. To build the parser the [yacc] module performs a bit of
namespace magic so that you normally do not need to name
the built parser. Most applications can use just one
default parser. However, if you wish to--or if you need
multiple parsers in the same application--you may bind a
built parser to a name. For example:
#*----------------- Building the lexer ------------------#
myparser = yacc.yacc() # named parser
yacc.yacc() # default parser
r1 = myparser.parse(mytext) # set input for named parser
r0 = yacc.parse(othertext) # set input for default parser
When parsers are built, [yacc] will produce diagnostic
messages if any errors are encountered in the grammar.
5. Parse an input string. The lexer is implicitly called
to get tokens as needed by the grammar rules. The return
value of a parsing action can be whatever thing invocation
of matched rules builds. It might be an abstract syntax
tree, if a 'Node' object is used with each parse rule; it
might be a simple list as in the smart ASCII example; it
might be a modified string based on concatenations and
modifications during parsing; or the return value could
simply be 'None' if parsing was done wholly to trigger side
effects in parse rules. In any case, what is returned is
index position 0 of the root rule's 'LexToken'.
MORE ON PLY PARSERS:
Some of the finer points of [PLY] parsers will not be covered
in this book. The documentation accompanying [PLY] contains
some additional implementational discussion, and a book devoted
more systematically to parsing theory will address theoretical
issues. But a few aspects can at least be touched on.
+++
*Error Recovery*
A [PLY] grammar may contain a special 'p_error()' function to
catch tokens that cannot be matched (at the current position)
by any other rule. The first time 'p_error()' is invoked,
[PLY] enters an "error-recovery" mode. If the parser cannot
process the next three tokens successfully, a traceback is
generated. You may include the production 'error' in other
rules to catch errors that occur at specific points in the
input.
To implement recovery within the 'p_error()' function, you may
use the functions/methods 'yacc.token()', 'yacc.restart()', and
'yacc.errok()'. The first grabs the next token from the lexer;
if this token--or some sequence of tokens--meets some recovery
criteria, you may call 'yacc.restart()' or 'yacc.errok()'. The
first of these, 'yacc.restart()', returns the parser to its
initial state--basically, only the final substring of the input
is used in this case (however, a separate data structure you have
built will remain as it was). Calling 'yacc.errok()' tells the
parser to stay in its last state and just ignore any bad tokens
pulled from the lexer (either via the call to 'p_error()' itself,
or via calls to 'yacc.token()' in the body).
+++
*The Parser State Machine*
When a parser is first compiled, the files 'parsetab.py' and
'parser.out' are generated. The first, 'parsetab.py', contains
more or less unreadable compact data structures that are used
by subsequent parser invocations. These structures are used
even during later invocation of the applications; timestamps
and signatures are compared to determine if the grammar has
been changed. Pregenerating state tables speeds up later
operations.
The file 'parser.out' contains a fairly readable description of
the actual state machine generated by [yacc]. Although you
cannot manually modify this state machine, examination of
'parser.out' can help you in understanding error messages and
undesirable behavior you encounter in your grammars.
+++
*Precedence and Associativity*
To resolve ambiguous grammars, you may set the variable
'precedence' to indicate both the precedence and the
associativity of tokens. Absent an explicit indication, [PLY]
always shifts a new symbol rather than reduce a rule where both
are allowable by some grammar rule.
The [PLY] documentation gives an example of an ambiguous
arithmetic expression, such as '3î*î4î+î5'. After the tokens '3',
'*', and '4' have been read from the token list, a 'p_mul()' rule
might allow reduction of the product. But at the same time, a
'p_add()' rule might contain 'NUMBER PLUS NUMBER', which would
allow a lookahead to the 'PLUS' token (since '4' is a 'NUMBER'
token). Moreover, the same token can have different meanings in
different contexts, such as the unary-minus and minus operators
in '3î-î4î*î-5'.
To solve both the precedence ambiguity and the ambiguous
meaning of the token 'MINUS', you can declare an explicit
precedence and associativity such as:
#--------- Declaring precedence and associativity ----------#
precedence = (
('left', 'PLUS', 'MINUS'),
('left', 'TIMES, 'DIVIDE'),
('right', 'UMINUS'),
)
def p_expr_uminus(t):
'expr : MINUS expr % prec UMINUS'
t[0] = -1 * t[2]
def p_expr_minus(t):
'expr : expr MINUS expr'
t[0] = t[1] - t[3]
def p_expr_plus(t):
'expr : expr PLUS expr'
t[0] = t[1] + t[3]