CHAPTER III -- REGULAR EXPRESSIONS
-------------------------------------------------------------------
Regular expressions allow extremely valuable text processing
techniques, but ones that warrant careful explanation. Python's
[re] module, in particular, allows numerous enhancements to basic
regular expressions (such as named backreferences, lookahead
assertions, backreference skipping, non-greedy quantifiers, and
others). A solid introduction to the subtleties of regular
expressions is valuable to programmers engaged in text processing
tasks.
The prequel of this chapter contains a tutorial on regular
expressions that allows a reader unfamiliar with regular
expressions to move quickly from simple to complex elements of
regular expression syntax. This tutorial is aimed primarily at
beginners, but programmers familiar with regular expressions in
other programming tools can benefit from a quick read of the
tutorial, which explicates the particular regular expression
dialect in Python.
It is important to note up-front that regular expressions,
while very powerful, also have limitations. In brief, regular
expressions cannot match patterns that nest to arbitrary
depths. If that statement does not make sense, read Chapter 4,
which discusses parsers--to a large extent, parsing exists to
address the limitations of regular expressions. In general, if
you have doubts about whether a regular expression is
sufficient for your task, try to understand the examples in
Chapter 4, particularly the discussion of how you might spell a
floating point number.
Section 3.1 examines a number of text processing problems that
are solved most naturally using regular expression. As in
other chapters, the solutions presented to problems can
generally be adopted directly as little utilities for performing
tasks. However, as elsewhere, the larger goal in presenting
problems and solutions is to address a style of thinking about
a wider class of problems than those whose solutions are
presented directly in this book. Readers who are interested
in a range of ready utilities and modules will probably want to
check additional resources on the Web, such as the Vaults of
Parnassus and the Python
Cookbook .
Section 3.2 is a "reference with commentary" on the Python
standard library modules for doing regular expression tasks.
Several utility modules and backward-compatibility regular
expression engines are available, but for most readers, the only
important module will be [re] itself. The discussions
interspersed with each module try to give some guidance on why
you would want to use a given module or function, and the
reference documentation tries to contain more examples of actual
typical usage than does a plain reference. In many cases, the
examples and discussion of individual functions address common
and productive design patterns in Python. The cross-references
are intended to contextualize a given function (or other thing)
in terms of related ones (and to help a reader decide which is
right for her). The actual listing of functions, constants,
classes, and the like are in alphabetical order within each
category.
SECTION 0 -- A Regular Expression Tutorial
------------------------------------------------------------------------
Some people, when confronted with a problem, think "I know,
I'll use regular expressions." Now they have two problems.
-- Jamie Zawinski, '' (08/12/1997)
TOPIC -- Just What is a Regular Expression, Anyway?
--------------------------------------------------------------------
Many readers will have some background with regular
expressions, but some will not have any. Those with
experience using regular expressions in other languages (or in
Python) can probably skip this tutorial section. But readers
new to regular expressions (affectionately called 'regexes' by
users) should read this section; even some with experience can
benefit from a refresher.
A regular expression is a compact way of describing complex
patterns in texts. You can use them to search for patterns
and, once found, to modify the patterns in complex ways. They
can also be used to launch programmatic actions that depend on
patterns.
Jamie Zawinski's tongue-in-cheek comment in the epigram is
worth thinking about. Regular expressions are amazingly
powerful and deeply expressive. That is the very reason that
writing them is just as error-prone as writing any other
complex programming code. It is always better to solve a
genuinely simple problem in a simple way; when you go beyond
simple, think about regular expressions.
A large number of tools other than Python incorporate regular
expressions as part of their functionality. Unix-oriented
command-line tools like 'grep', 'sed', and 'awk' are mostly
wrappers for regular expression processing. Many text editors
allow search and/or replacement based on regular expressions.
Many programming languages, especially other scripting languages
such as Perl and TCL, build regular expressions into the heart of
the language. Even most command-line shells, such as Bash or the
Windows-console, allow restricted regular expressions as part of
their command syntax.
There are some variations in regular expression syntax between
different tools that use them, but for the most part regular
expressions are a "little language" that gets embedded inside
bigger languages like Python. The examples in this tutorial
section (and the documentation in the rest of the chapter) will
focus on Python syntax, but most of this chapter transfers
easily to working with other programming languages and tools.
As with most of this book, examples will be illustrated by use of
Python interactive shell sessions that readers can type
themselves, so that they can play with variations on the
examples. However, the [re] module has little reason to include a
function that simply illustrates matches in the shell. Therefore,
the availability of the small wrapper program below is implied in
the examples:
#---------- re_show.py ----------#
import re
def re_show(pat, s):
print re.compile(pat, re.M).sub("{\g<0>}", s.rstrip()),'\n'
s = '''Mary had a little lamb
And everywhere that Mary
went, the lamb was sure
to go'''
Place the code in an external module and 'import' it. Those
new to regular expressions need not worry about what the above
function does for now. It is enough to know that the first
argument to 're_show()' will be a regular expression pattern,
and the second argument will be a string to be matched against.
The matches will treat each line of the string as a separate
pattern for purposes of matching beginnings and ends of lines.
The illustrated matches will be whatever is contained between
curly braces (and is typographically marked for emphasis).
TOPIC -- Matching Patterns in Text: The Basics
--------------------------------------------------------------------
The very simplest pattern matched by a regular expression is a
literal character or a sequence of literal characters. Anything
in the target text that consists of exactly those characters in
exactly the order listed will match. A lowercase character is not
identical with its uppercase version, and vice versa. A space in
a regular expression, by the way, matches a literal space in the
target (this is unlike most programming languages or command-line
tools, where a variable number of spaces separate keywords).
>>> from re_show import re_show, s
>>> re_show('a', s)
M{a}ry h{a}d {a} little l{a}mb.
And everywhere th{a}t M{a}ry
went, the l{a}mb w{a}s sure
to go.
>>> re_show('Mary', s)
{Mary} had a little lamb.
And everywhere that {Mary}
went, the lamb was sure
to go.
-*-
A number of characters have special meanings to regular
expressions. A symbol with a special meaning can be matched,
but to do so it must be prefixed with the backslash character
(this includes the backslash character itself: to match one
backslash in the target, the regular expression should include
'\\'). In Python, a special way of quoting a string is
available that will not perform string interpolation. Since
regular expressions use many of the same backslash-prefixed
codes as do Python strings, it is usually easier to compose
regular expression strings by quoting them as "raw strings"
with an initial "r".
>>> from re_show import re_show
>>> s = '''Special characters must be escaped.*'''
>>> re_show(r'.*', s)
{Special characters must be escaped.*}
>>> re_show(r'\.\*', s)
Special characters must be escaped{.*}
>>> re_show('\\\\', r'Python \ escaped \ pattern')
Python {\} escaped {\} pattern
>>> re_show(r'\\', r'Regex \ escaped \ pattern')
Regex {\} escaped {\} pattern
-*-
Two special characters are used to mark the beginning and end
of a line: caret ("^") and dollarsign ("$"). To match a caret
or dollarsign as a literal character, it must be escaped (i.e.,
precede it by a backslash "\").
An interesting thing about the caret and dollarsign is that
they match zero-width patterns. That is, the length of the
string matched by a caret or dollarsign by itself is zero (but
the rest of the regular expression can still depend on the
zero-width match). Many regular expression tools provide
another zero-width pattern for word-boundary ("\b"). Words
might be divided by whitespace like spaces, tabs, newlines, or
other characters like nulls; the word-boundary pattern matches
the actual point where a word starts or ends, not the
particular whitespace characters.
>>> from re_show import re_show, s
>>> re_show(r'^Mary', s)
{Mary} had a little lamb
And everywhere that Mary
went, the lamb was sure
to go
>>> re_show(r'Mary$', s)
Mary had a little lamb
And everywhere that {Mary}
went, the lamb was sure
to go
>>> re_show(r'$','Mary had a little lamb')
Mary had a little lamb{}
-*-
In regular expressions, a period can stand for any character.
Normally, the newline character is not included, but optional
switches can force inclusion of the newline character also (see
later documentation of [re] module functions). Using a period
in a pattern is a way of requiring that "something" occurs
here, without having to decide what.
Readers who are familiar with DOS command-line wildcards will
know the question mark as filling the role of "some character"
in command masks. But in regular expressions, the
question mark has a different meaning, and the period is used
as a wildcard.
>>> from re_show import re_show, s
>>> re_show(r'.a', s)
{Ma}ry {ha}d{ a} little {la}mb
And everywhere t{ha}t {Ma}ry
went, the {la}mb {wa}s sure
to go
-*-
A regular expression can have literal characters in it and also
zero-width positional patterns. Each literal character or positional
pattern is an atom in a regular expression. One may also group
several atoms together into a small regular expression that is
part of a larger regular expression. One might be inclined to
call such a grouping a "molecule," but normally it is also
called an atom.
In older Unix-oriented tools like grep, subexpressions must be
grouped with escaped parentheses, for example, '\(Mary\)'. In
Python (as with most more recent tools), grouping is done with
bare parentheses, but matching a literal parenthesis requires
escaping it in the pattern.
>>> from re_show import re_show, s
>>> re_show(r'(Mary)( )(had)', s)
{Mary had} a little lamb
And everywhere that Mary
went, the lamb was sure
to go
>>> re_show(r'\(.*\)', 'spam (and eggs)')
spam {(and eggs)}
-*-
Rather than name only a single character, a pattern in a
regular expression can match any of a set of characters.
A set of characters can be given as a simple list inside square
brackets, for example, '[aeiou]' will match any single lowercase
vowel. For letter or number ranges it may also have the first and
last letter of a range, with a dash in the middle; for example,
'[A-Ma-m]' will match any lowercase or uppercase letter in the
first half of the alphabet.
Python (as with many tools) provides escape-style shortcuts to
the most commonly used character class, such as '\s' for a
whitespace character and '\d' for a digit. One could always
define these character classes with square brackets, but the
shortcuts can make regular expressions more compact and more
readable.
>>> from re_show import re_show, s
>>> re_show(r'[a-z]a', s)
Mary {ha}d a little {la}mb
And everywhere t{ha}t Mary
went, the {la}mb {wa}s sure
to go
-*-
The caret symbol can actually have two different meanings in regular
expressions. Most of the time, it means to match the zero-length
pattern for line beginnings. But if it is used at the beginning of a
character class, it reverses the meaning of the character class.
Everything not included in the listed character set is matched.
>>> from re_show import re_show, s
>>> re_show(r'[^a-z]a', s)
{Ma}ry had{ a} little lamb
And everywhere that {Ma}ry
went, the lamb was sure
to go
-*-
Using character classes is a way of indicating that either one
thing or another thing can occur in a particular spot. But
what if you want to specify that either of two whole
subexpressions occur in a position in the regular expression?
For that, you use the alternation operator, the vertical bar
("|"). This is the symbol that is also used to indicate a pipe
in Unix/DOS shells and is sometimes called the pipe character.
The pipe character in a regular expression indicates an
alternation between everything in the group enclosing it. What
this means is that even if there are several groups to the left
and right of a pipe character, the alternation greedily asks
for everything on both sides. To select the scope of the
alternation, you must define a group that encompasses the
patterns that may match. The example illustrates this:
>>> from re_show import re_show
>>> s2 = 'The pet store sold cats, dogs, and birds.'
>>> re_show(r'cat|dog|bird', s2)
The pet store sold {cat}s, {dog}s, and {bird}s.
>>> s3 = '=first first= # =second second= # =first= # =second='
>>> re_show(r'=first|second=', s3)
{=first} first= # =second {second=} # {=first}= # ={second=}
>>> re_show(r'(=)(first)|(second)(=)', s3)
{=first} first= # =second {second=} # {=first}= # ={second=}
>>> re_show(r'=(first|second)=', s3)
=first first= # =second second= # {=first=} # {=second=}
-*-
One of the most powerful and common things you can do with
regular expressions is to specify how many times an atom occurs
in a complete regular expression. Sometimes you want to
specify something about the occurrence of a single character,
but very often you are interested in specifying the occurrence
of a character class or a grouped subexpression.
There is only one quantifier included with "basic" regular
expression syntax, the asterisk ("*"); in English this has the
meaning "some or none" or "zero or more." If you want to
specify that any number of an atom may occur as part of a
pattern, follow the atom by an asterisk.
Without quantifiers, grouping expressions doesn't really serve
as much purpose, but once we can add a quantifier to a
subexpression we can say something about the occurrence of the
subexpression as a whole. Take a look at the example:
>>> from re_show import re_show
>>> s = '''Match with zero in the middle: @@
... Subexpression occurs, but...: @=!=ABC@
... Lots of occurrences: @=!==!==!==!==!=@
... Must repeat entire pattern: @=!==!=!==!=@'''
>>> re_show(r'@(=!=)*@', s)
Match with zero in the middle: {@@}
Subexpression occurs, but...: @=!=ABC@
Lots of occurrences: {@=!==!==!==!==!=@}
Must repeat entire pattern: @=!==!=!==!=@
TOPIC -- Matching Patterns in Text: Intermediate
--------------------------------------------------------------------
In a certain way, the lack of any quantifier symbol after an atom
quantifies the atom anyway: It says the atom occurs exactly once.
Extended regular expressions add a few other useful numbers to
"once exactly" and "zero or more times." The plus sign ("+")
means "one or more times" and the question mark ("?") means
"zero or one times." These quantifiers are by far the most
common enumerations you wind up using.
If you think about it, you can see that the extended regular
expressions do not actually let you "say" anything the basic
ones do not. They just let you say it in a shorter and more
readable way. For example, '(ABC)+' is equivalent to
'(ABC)(ABC)*', and 'X(ABC)?Y' is equivalent to 'XABCY|XY'. If
the atoms being quantified are themselves complicated grouped
subexpressions, the question mark and plus sign can make things
a lot shorter.
>>> from re_show import re_show
>>> s = '''AAAD
... ABBBBCD
... BBBCD
... ABCCD
... AAABBBC'''
>>> re_show(r'A+B*C?D', s)
{AAAD}
{ABBBBCD}
BBBCD
ABCCD
AAABBBC
-*-
Using extended regular expressions, you can specify arbitrary
pattern occurrence counts using a more verbose syntax than the
question mark, plus sign, and asterisk quantifiers. The curly
braces ("{" and "}") can surround a precise count of how many
occurrences you are looking for.
The most general form of the curly-brace quantification uses two
range arguments (the first must be no larger than the second, and
both must be non-negative integers). The occurrence count is
specified this way to fall between the minimum and maximum
indicated (inclusive). As shorthand, either argument may be left
empty: If so, the minimum/maximum is specified as zero/infinity,
respectively. If only one argument is used (with no comma in
there), exactly that number of occurrences are matched.
>>> from re_show import re_show
>>> s2 = '''aaaaa bbbbb ccccc
... aaa bbb ccc
... aaaaa bbbbbbbbbbbbbb ccccc'''
>>> re_show(r'a{5} b{,6} c{4,8}', s2)
{aaaaa bbbbb ccccc}
aaa bbb ccc
aaaaa bbbbbbbbbbbbbb ccccc
>>> re_show(r'a+ b{3,} c?', s2)
{aaaaa bbbbb c}cccc
{aaa bbb c}cc
{aaaaa bbbbbbbbbbbbbb c}cccc
>>> re_show(r'a{5} b{6,} c{4,8}', s2)
aaaaa bbbbb ccccc
aaa bbb ccc
{aaaaa bbbbbbbbbbbbbb ccccc}
-*-
One powerful option in creating search patterns is specifying
that a subexpression that was matched earlier in a regular
expression is matched again later in the expression. We do
this using backreferences. Backreferences are named by the
numbers 1 through 99, preceded by the backslash/escape
character when used in this manner. These backreferences refer
to each successive group in the match pattern, as in
'(one)(two)(three) \1\2\3'. Each numbered backreference refers
to the group that, in this example, has the word corresponding
to the number.
It is important to note something the example illustrates. What
gets matched by a backreference is the same literal string
matched the first time, even if the pattern that matched the
string could have matched other strings. Simply repeating the
same grouped subexpression later in the regular expression does
not match the same targets as using a backreference (but you have
to decide what it is you actually want to match in either case).
Backreferences refer back to whatever occurred in the previous
grouped expressions, in the order those grouped expressions
occurred. Up to 99 numbered backreferences may be used. However,
Python also allows naming backreferences, which can make it much
clearer what the backreferences are pointing to. The initial
pattern group must begin with '?P', and the corresponding
backreference must contain '(?P=name)'.
>>> from re_show import re_show
>>> s2 = '''jkl abc xyz
... jkl xyz abc
... jkl abc abc
... jkl xyz xyz
... '''
>>> re_show(r'(abc|xyz) \1', s2)
jkl abc xyz
jkl xyz abc
jkl {abc abc}
jkl {xyz xyz}
>>> re_show(r'(abc|xyz) (abc|xyz)', s2)
jkl {abc xyz}
jkl {xyz abc}
jkl {abc abc}
jkl {xyz xyz}
>>> re_show(r'(?Pabc|xyz) (?P=let3)', s2)
jkl abc xyz
jkl xyz abc
jkl {abc abc}
jkl {xyz xyz}
-*-
Quantifiers in regular expressions are greedy. That is, they
match as much as they possibly can.
Probably the easiest mistake to make in composing regular
expressions is to match too much. When you use a quantifier,
you want it to match everything (of the right sort) up to the
point where you want to finish your match. But when using the
'*', '+', or numeric quantifiers, it is easy to forget that the
last bit you are looking for might occur later in a line than
the one you are interested in.
>>> from re_show import re_show
>>> s2 = '''-- I want to match the words that start
... -- with 'th' and end with 's'.
... this
... thus
... thistle
... this line matches too much
... '''
>>> re_show(r'th.*s', s2)
-- I want to match {the words that s}tart
-- wi{th 'th' and end with 's}'.
{this}
{thus}
{this}tle
{this line matches} too much
-*-
Often if you find that regular expressions are matching too much,
a useful procedure is to reformulate the problem in your mind.
Rather than thinking about, "What am I trying to match later in
the expression?" ask yourself, "What do I need to avoid matching
in the next part?" This often leads to more parsimonious pattern
matches. Often the way to avoid a pattern is to use the
complement operator and a character class. Look at the example,
and think about how it works.
The trick here is that there are two different ways of
formulating almost the same sequence. Either you can think you
want to keep matching -until- you get to XYZ, or you can think you
want to keep matching -unless- you get to XYZ. These are subtly
different.
For people who have thought about basic probability, the same
pattern occurs. The chance of rolling a 6 on a die in one roll is
1/6. What is the chance of rolling a 6 in six rolls? A naive
calculation puts the odds at 1/6+1/6+1/6+1/6+1/6+1/6, or 100
percent. This is wrong, of course (after all, the chance after
twelve rolls isn't 200 percent). The correct calculation is, "How
do I avoid rolling a 6 for six rolls?" (i.e.,
5/6*5/6*5/6*5/6*5/6*5/6, or about 33 percent). The chance of
getting a 6 is the same chance as not avoiding it (or about 66
percent). In fact, if you imagine transcribing a series of die
rolls, you could apply a regular expression to the written
record, and similar thinking applies.
>>> from re_show import re_show
>>> s2 = '''-- I want to match the words that start
... -- with 'th' and end with 's'.
... this
... thus
... thistle
... this line matches too much
... '''
>>> re_show(r'th[^s]*.', s2)
-- I want to match {the words} {that s}tart
-- wi{th 'th' and end with 's}'.
{this}
{thus}
{this}tle
{this} line matches too much
-*-
Not all tools that use regular expressions allow you to modify
target strings. Some simply locate the matched pattern; the
mostly widely used regular expression tool is probably grep,
which is a tool for searching only. Text editors, for example,
may or may not allow replacement in their regular expression
search facility.
Python, being a general programming language, allows
sophisticated replacement patterns to accompany matches. Since
Python strings are immutable, [re] functions do not modify string
objects in place, but instead return the modified versions. But
as with functions in the [string] module, one can always rebind a
particular variable to the new string object that results from
[re] modification.
Replacement examples in this tutorial will call a function
're_new()' that is a wrapper for the module function `re.sub()`.
Original strings will be defined above the call, and the modified
results will appear below the call and with the same style of
additional markup of changed areas as 're_show()' used. Be
careful to notice that the curly braces in the results displayed
will not be returned by standard [re] functions, but are only
added here for emphasis (as is the typography). Simply import the
following function in the examples below:
#---------- re_new.py ----------#
import re
def re_new(pat, rep, s):
print re.sub(pat, '{'+rep+'}', s)
-*-
Let us take a look at a couple of modification examples that
build on what we have already covered. This one simply
substitutes some literal text for some other literal text. Notice
that `string.replace()` can achieve the same result and will be
faster in doing so.
>>> from re_new import re_new
>>> s = 'The zoo had wild dogs, bobcats, lions, and other wild cats.'
>>> re_new('cat','dog',s)
The zoo had wild dogs, bob{dog}s, lions, and other wild {dog}s.
-*-
Most of the time, if you are using regular expressions to modify a
target text, you will want to match more general patterns than just
literal strings. Whatever is matched is what gets replaced (even if it
is several different strings in the target):
>>> from re_new import re_new
>>> s = 'The zoo had wild dogs, bobcats, lions, and other wild cats.'
>>> re_new('cat|dog','snake',s)
The zoo had wild {snake}s, bob{snake}s, lions, and other wild {snake}s.
>>> re_new(r'[a-z]+i[a-z]*','nice',s)
The zoo had {nice} dogs, bobcats, {nice}, and other {nice} cats.
-*-
It is nice to be able to insert a fixed string everywhere a
pattern occurs in a target text. But frankly, doing that is
not very context sensitive. A lot of times, we do not want
just to insert fixed strings, but rather to insert something
that bears much more relation to the matched patterns.
Fortunately, backreferences come to our rescue here. One can
use backreferences in the pattern matches themselves, but it is
even more useful to be able to use them in replacement
patterns. By using replacement backreferences, one can pick
and choose from the matched patterns to use just the parts of
interest.
As well as backreferencing, the examples below illustrate the
importance of whitespace in regular expressions. In most
programming code, whitespace is merely aesthetic. But the
examples differ solely in an extra space within the arguments
to the second call--and the return value is importantly
different.
>>> from re_new import re_new
>>> s = 'A37 B4 C107 D54112 E1103 XXX'
>>> re_new(r'([A-Z])([0-9]{2,4})',r'\2:\1',s)
{37:A} B4 {107:C} {5411:D}2 {1103:E} XXX
>>> re_new(r'([A-Z])([0-9]{2,4}) ',r'\2:\1 ',s)
{37:A }B4 {107:C }D54112 {1103:E }XXX
-*-
This tutorial has already warned about the danger of matching
too much with regular expression patterns. But the danger is
so much more serious when one does modifications, that it is
worth repeating. If you replace a pattern that matches a
larger string than you thought of when you composed the
pattern, you have potentially deleted some important data from
your target.
It is always a good idea to try out regular expressions on
diverse target data that is representative of production usage.
Make sure you are matching what you think you are matching. A
stray quantifier or wildcard can make a surprisingly wide
variety of texts match what you thought was a specific pattern.
And sometimes you just have to stare at your pattern for a
while, or find another set of eyes, to figure out what is
really going on even after you see what matches. Familiarity
might breed contempt, but it also instills competence.
TOPIC -- Advanced Regular Expression Extensions
--------------------------------------------------------------------
Some very useful enhancements to basic regular expressions are
included with Python (and with many other tools). Many of
these do not strictly increase the power of Python's regular
expressions, but they -do- manage to make expressing them far
more concise and clear.
Earlier in the tutorial, the problems of matching too much were
discussed, and some workarounds were suggested. Python is nice
enough to make this easier by providing optional "non-greedy"
quantifiers. These quantifiers grab as little as possible
while still matching whatever comes next in the pattern
(instead of as much as possible).
Non-greedy quantifiers have the same syntax as regular greedy
ones, except with the quantifier followed by a question mark.
For example, a non-greedy pattern might look like:
'A[A-Z]*?B'. In English, this means "match an A, followed by
only as many capital letters as are needed to find a B."
One little thing to look out for is the fact that the pattern
'[A-Z]*?.' will always match zero capital letters. No longer
matches are ever needed to find the following "any character"
pattern. If you use non-greedy quantifiers, watch out for
matching too little, which is a symmetric danger.
>>> from re_show import re_show
>>> s = '''-- I want to match the words that start
... -- with 'th' and end with 's'.
... this line matches just right
... this # thus # thistle'''
>>> re_show(r'th.*s',s)
-- I want to match {the words that s}tart
-- wi{th 'th' and end with 's}'.
{this line matches jus}t right
{this # thus # this}tle
>>> re_show(r'th.*?s',s)
-- I want to match {the words} {that s}tart
-- wi{th 'th' and end with 's}'.
{this} line matches just right
{this} # {thus} # {this}tle
>>> re_show(r'th.*?s ',s)
-- I want to match {the words }that start
-- with 'th' and end with 's'.
{this }line matches just right
{this }# {thus }# thistle
-*-
Modifiers can be used in regular expressions or as arguments to
many of the functions in [re]. A modifier affects, in one way
or another, the interpretation of a regular expression pattern.
A modifier, unlike an atom, is global to the particular
match--in itself, a modifier doesn't match anything, it instead
constrains or directs what the atoms match.
When used directly within a regular expression pattern, one or
more modifiers begin the whole pattern, as in '(?Limsux)'. For
example, to match the word 'cat' without regard to the case of
the letters, one could use '(?i)cat'. The same modifiers may
be passed in as the last argument as bitmasks (i.e., with a '|'
between each modifier), but only to some functions in the [re]
module, not to all. For example, the two calls below are
equivalent:
>>> import re
>>> re.search(r'(?Li)cat','The Cat in the Hat').start()
4
>>> re.search(r'cat','The Cat in the Hat',re.L|re.I).start()
4
However, some function calls in [re] have no argument for
modifiers. In such cases, you should either use the modifier
prefix pseudo-group or pre-compile the regular expression
rather than use it in string form. For example:
>>> import re
>>> re.split(r'(?i)th','Brillig and The Slithy Toves')
['Brillig and ', 'e Sli', 'y Toves']
>>> re.split(re.compile('th',re.I),'Brillig and the Slithy Toves')
['Brillig and ', 'e Sli', 'y Toves']
See the [re] module documentation for details on which
functions take which arguments.
-*-
The listed modifiers below are used in [re] expressions. Users
of other regular expression tools may be accustomed to a 'g'
option for "global" matching. These other tools take a line of
text as their default unit, and "global" means to match
multiple lines. Python takes the actual passed string as its
unit, so "global" is simply the default. To operate on a
single line, either the regular expressions have to be tailored
to look for appropriate begin-line and end-line characters, or
the strings being operated on should be split first using
`string.split()` or other means.
#*--------- Regular expression modifiers ---------------#
* L (re.L) - Locale customization of \w, \W, \b, \B
* i (re.I) - Case-insensitive match
* m (re.M) - Treat string as multiple lines
* s (re.S) - Treat string as single line
* u (re.U) - Unicode customization of \w, \W, \b, \B
* x (re.X) - Enable verbose regular expressions
The single-line option ("s") allows the wildcard to match a
newline character (it won't otherwise). The multiple-line
option ("m") causes "^" and "$" to match the beginning and end
of each line in the target, not just the begin/end of the
target as a whole (the default). The insensitive option ("i")
ignores differences between the case of letters. The Locale
and Unicode options ("L" and "u") give different
interpretations to the word-boundary ("\b") and alphanumeric
("\w") escaped patterns--and their inverse forms ("\B" and
"\W").
The verbose option ("x") is somewhat different from the others.
Verbose regular expressions may contain nonsignificant
whitespace and inline comments. In a sense, this is also just
a different interpretation of regular expression patterns, but
it allows you to produce far more easily readable complex
patterns. Some examples follow in the sections below.
-*-
Let's take a look first at how case-insensitive and single-line
options change the match behavior.
>>> from re_show import re_show
>>> s = '''MAINE # Massachusetts # Colorado #
... mississippi # Missouri # Minnesota #'''
>>> re_show(r'M.*[ise] ', s)
{MAINE # Massachusetts }# Colorado #
mississippi # {Missouri }# Minnesota #
>>> re_show(r'(?i)M.*[ise] ', s)
{MAINE # Massachusetts }# Colorado #
{mississippi # Missouri }# Minnesota #
>>> re_show(r'(?si)M.*[ise] ', s)
{MAINE # Massachusetts # Colorado #
mississippi # Missouri }# Minnesota #
Looking back to the definition of 're_show()', we can see it
was defined to explicitly use the multiline option. So
patterns displayed with 're_show()' will always be multiline.
Let us look at a couple of examples that use `re.findall()`
instead.
>>> from re_show import re_show
>>> s = '''MAINE # Massachusetts # Colorado #
... mississippi # Missouri # Minnesota #'''
>>> re_show(r'(?im)^M.*[ise] ', s)
{MAINE # Massachusetts }# Colorado #
{mississippi # Missouri }# Minnesota #
>>> import re
>>> re.findall(r'(?i)^M.*[ise] ', s)
['MAINE # Massachusetts ']
>>> re.findall(r'(?im)^M.*[ise] ', s)
['MAINE # Massachusetts ', 'mississippi # Missouri ']
-*-
Matching word characters and word boundaries depends on exactly
what gets counted as being alphanumeric. Character codepages
for letters outside the (US-English) ASCII range differ among
national alphabets. Python versions are configured to a
particular locale, and regular expressions can optionally use
the current one to match words.
Of greater long-term significance is the [re] module's ability
(after Python 2.0) to look at the Unicode categories of
characters, and decide whether a character is alphabetic based on
that category. Locale settings work OK for European diacritics,
but for non-Roman sets, Unicode is clearer and less error prone.
The "u" modifier controls whether Unicode alphabetic characters
are recognized or merely ASCII ones:
>>> import re
>>> alef, omega = unichr(1488), unichr(969)
>>> u = alef +' A b C d '+omega+' X y Z'
>>> u, len(u.split()), len(u)
(u'\u05d0 A b C d \u03c9 X y Z', 9, 17)
>>> ':'.join(re.findall(ur'\b\w\b', u))
u'A:b:C:d:X:y:Z'
>>> ':'.join(re.findall(ur'(?u)\b\w\b', u))
u'\u05d0:A:b:C:d:\u03c9:X:y:Z'
-*-
Backreferencing in replacement patterns is very powerful, but
it is easy to use many groups in a complex regular expression,
which can be confusing to identify. It is often more legible
to refer to the parts of a replacement pattern in sequential
order. To handle this issue, Python's [re] patterns allow
"grouping without backreferencing."
A group that should not also be treated as a backreference has
a question mark colon at the beginning of the group, as in
'(?:pattern)'. In fact, you can use this syntax even when your
backreferences are in the search pattern itself:
>>> from re_new import re_new
>>> s = 'A-xyz-37 # B:abcd:142 # C-wxy-66 # D-qrs-93'
>>> re_new(r'([A-Z])(?:-[a-z]{3}-)([0-9]*)', r'\1\2', s)
{A37} # B:abcd:142 # {C66} # {D93}
>>> # Groups that are not of interest excluded from backref
...
>>> re_new(r'([A-Z])(-[a-z]{3}-)([0-9]*)', r'\1\2', s)
{A-xyz-} # B:abcd:142 # {C-wxy-} # {D-qrs-}
>>> # One could lose track of groups in a complex pattern
...
-*-
Python offers a particularly handy syntax for really complex
pattern backreferences. Rather than just play with the
numbering of matched groups, you can give them a name. Above
we pointed out the syntax for named backreferences in the
pattern space; for example, '(?P=name)'. However, a bit different
syntax is necessary in replacement patterns. For that, we use
the '\g' operator along with angle brackets and a name. For
example:
>>> from re_new import re_new
>>> s = "A-xyz-37 # B:abcd:142 # C-wxy-66 # D-qrs-93"
>>> re_new(r'(?P[A-Z])(-[a-z]{3}-)(?P[0-9]*)',
... r'\g\g',s)
{A37} # B:abcd:142 # {C66} # {D93}
-*-
Another trick of advanced regular expression tools is
"lookahead assertions." These are similar to regular grouped
subexpression, except they do not actually grab what they
match. There are two advantages to using lookahead assertions.
On the one hand, a lookahead assertion can function in a
similar way to a group that is not backreferenced; that is, you
can match something without counting it in backreferences.
More significantly, however, a lookahead assertion can specify
that the next chunk of a pattern has a certain form, but let a
different (more general) subexpression actually grab it
(usually for purposes of backreferencing that other
subexpression).
There are two kinds of lookahead assertions: positive and
negative. As you would expect, a positive assertion specifies
that something does come next, and a negative one specifies
that something does not come next. Emphasizing their
connection with non-backreferenced groups, the syntax for
lookahead assertions is similar: '(?=pattern)' for positive
assertions, and '(?!pattern)' for negative assertions.
>>> from re_new import re_new
>>> s = 'A-xyz37 # B-ab6142 # C-Wxy66 # D-qrs93'
>>> # Assert that three lowercase letters occur after CAP-DASH
...
>>> re_new(r'([A-Z]-)(?=[a-z]{3})([\w\d]*)', r'\2\1', s)
{xyz37A-} # B-ab6142 # C-Wxy66 # {qrs93D-}
>>> # Assert three lowercase letts do NOT occur after CAP-DASH
...
>>> re_new(r'([A-Z]-)(?![a-z]{3})([\w\d]*)', r'\2\1', s)
A-xyz37 # {ab6142B-} # {Wxy66C-} # D-qrs93
-*-
Along with lookahead assertions, Python 2.0+ adds "lookbehind
assertions." The idea is similar--a pattern is of interest
only if it is (or is not) preceded by some other pattern.
Lookbehind assertions are somewhat more restricted than
lookahead assertions because they may only look backwards by a
fixed number of character positions. In other words, no
general quantifiers are allowed in lookbehind assertions.
Still, some patterns are most easily expressed using lookbehind
assertions.
As with lookahead assertions, lookbehind assertions come in a
negative and a positive flavor. The former assures that a certain
pattern does -not- precede the match, the latter assures that
the pattern -does- precede the match.
>>> from re_new import re_new
>>> re_show('Man', 'Manhandled by The Man')
{Man}handled by The {Man}
>>> re_show('(?<=The )Man', 'Manhandled by The Man')
Manhandled by The {Man}
>>> re_show('(?>> from re_show import re_show
>>> s = '''The URL for my site is: http://mysite.com/mydoc.html. You
... might also enjoy ftp://yoursite.com/index.html for a good
... place to download files.'''
>>> pat = r''' (?x)( # verbose identify URLs within text
... (http|ftp|gopher) # make sure we find a resource type
... :// # ...needs to be followed by colon-slash-slash
... [^ \n\r]+ # some stuff then space, newline, tab is URL
... \w # URL always ends in alphanumeric char
... (?=[\s\.,]) # assert: followed by whitespace/period/comma
... ) # end of match group'''
>>> re_show(pat, s)
The URL for my site is: {http://mysite.com/mydoc.html}. You
might also enjoy {ftp://yoursite.com/index.html} for a good
place to download files.
SECTION 1 -- Some Common Tasks
------------------------------------------------------------------------
PROBLEM: Making a text block flush left
--------------------------------------------------------------------
For visual clarity or to identify the role of text, blocks of
text are often indented--especially in prose-oriented documents
(but log files, configuration files, and the like might also
have unused initial fields). For downstream purposes,
indentation is often irrelevant, or even outright
incorrect, since the indentation is not part of the text itself
but only a decoration of the text. However, it often makes
matters even worse to perform the very most naive
transformation of indented text--simply remove leading
whitespace from every line. While block indentation may be
decoration, the relative indentations of lines within blocks
may serve important or essential functions (for example, the
blocks of text might be Python source code).
The general procedure you need to take in maximally unindenting
a block of text is fairly simple. But it is easy to throw more
code at it than is needed, and arrive at some inelegant and
slow nested loops of `string.find()` and `string.replace()`
operations. A bit of cleverness in the use of regular
expressions--combined with the conciseness of a functional
programming (FP) style--can give you a quick, short, and direct
transformation.
#---------- flush_left.py ----------#
# Remove as many leading spaces as possible from whole block
from re import findall,sub
# What is the minimum line indentation of a block?
indent = lambda s: reduce(min,map(len,findall('(?m)^ *(?=\S)',s)))
# Remove the block-minimum indentation from each line?
flush_left = lambda s: sub('(?m)^ {%d}' % indent(s),'',s)
if __name__ == '__main__':
import sys
print flush_left(sys.stdin.read())
The 'flush_left()' function assumes that blocks are indented
with spaces. If tabs are used--or used combined with
spaces--an initial pass through the utility 'untabify.py' (which
can be found at '$PYTHONPATH/tools/scripts/') can convert
blocks to space-only indentation.
A helpful adjunct to 'flush_left()' is likely to be the
'reformat_para()' function that was presented in Chapter 2,
Problem 2. Between the two of these, you could get a good part of
the way towards a "batch-oriented word processor." (What other
capabilities would be most useful?)
PROBLEM: Summarizing command-line option documentation
--------------------------------------------------------------------
Documentation of command-line options to programs is usually
in semi-standard formats in places like manpages, docstrings,
READMEs and the like. In general, within documentation you
expect to see command-line options indented a bit, followed by
a bit more indentation, followed by one or more lines of
description, and usually ended by a blank line. This style is
readable for users browsing documentation, but is of
sufficiently complexity and variability that regular
expressions are well suited to finding the right descriptions
(simple string methods fall short).
A specific scenario where you might want a summary of
command-line options is as an aid to understanding
configuration files that call multiple child commands. The
file '/etc/inetd.conf' on Unix-like systems is a good example
of such a configuration file. Moreover, configuration files
themselves often have enough complexity and variability within
them that simple string methods have difficulty parsing them.
The utility below will look for every service launched by
'/etc/inetd.conf' and present to STDOUT summary documentation
of all the options used when the services are started.
#---------- show_services.py ----------#
import re, os, string, sys
def show_opts(cmdline):
args = string.split(cmdline)
cmd = args[0]
if len(args) > 1:
opts = args[1:]
# might want to check error output, so use popen3()
(in_, out_, err) = os.popen3('man %s | col -b' % cmd)
manpage = out_.read()
if len(manpage) > 2: # found actual documentation
print '\n%s' % cmd
for opt in opts:
pat_opt = r'(?sm)^\s*'+opt+r'.*?(?=\n\n)'
opt_doc = re.search(pat_opt, manpage)
if opt_doc is not None:
print opt_doc.group()
else: # try harder for something relevant
mentions = []
for para in string.split(manpage,'\n\n'):
if re.search(opt, para):
mentions.append('\n%s' % para)
if not mentions:
print '\n ',opt,' '*9,'Option docs not found'
else:
print '\n ',opt,' '*9,'Mentioned in below para:'
print '\n'.join(mentions)
else: # no manpage available
print cmdline
print ' No documentation available'
def services(fname):
conf = open(fname).read()
pat_srv = r'''(?xm)(?=^[^#]) # lns that are not commented out
(?:(?:[\w/]+\s+){6}) # first six fields ignored
(.*$) # to end of ln is servc launch'''
return re.findall(pat_srv, conf)
if __name__ == '__main__':
for service in services(sys.argv[1]):
show_opts(service)
The particular tasks performed by 'show_opts()' and 'services()'
are somewhat specific to Unix-like systems, but the general
techniques are more broadly applicable. For example, the
particular comment character and number of fields in
'/etc/inetd.conf' might be different for other launch scripts,
but the use of regular expressions to find the launch commands
would apply elsewhere. If the 'man' and 'col' utilities are not
on the relevant system, you might do something equivalent, such
as reading in the docstrings from Python modules with similar
option descriptions (most of the samples in '$PYTHONPATH/tools/'
use compatible documentation, for example).
Another thing worth noting is that even where regular expressions
are used in parsing some data, you need not do everything with
regular expressions. The simple `string.split()` operation to
identify paragraphs in 'show_opts()' is still the quickest and
easiest technique, even though `re.split()` could do the same
thing.
Note: Along the lines of paragraph splitting, here is a thought
problem. What is a regular expression that matches every whole
paragraph that contains within it some smaller pattern 'pat'? For
purposes of the puzzle, assume that a paragraph is some text that
both starts and ends with doubled newlines ("\n\n").
PROBLEM: Detecting duplicate words
--------------------------------------------------------------------
A common typo in prose texts is doubled words (hopefully they
have been edited out of this book except in those few cases
where they are intended). The same error occurs to a lesser
extent in programming language code, configuration files, or
data feeds. Regular expressions are well-suited to detecting
this occurrence, which just amounts to a backreference to a
word pattern. It's easy to wrap the regex in a small utility
with a few extra features:
#---------- dupwords.py ----------#
# Detect doubled words and display with context
# Include words doubled across lines but within paras
import sys, re, glob
for pat in sys.argv[1:]:
for file in glob.glob(pat):
newfile = 1
for para in open(file).read().split('\n\n'):
dups = re.findall(r'(?m)(^.*(\b\w+\b)\s*\b\2\b.*$)', para)
if dups:
if newfile:
print '%s\n%s\n' % ('-'*70,file)
newfile = 0
for dup in dups:
print '[%s] -->' % dup[1], dup[0]
This particular version grabs the line or lines on which
duplicates occur and prints them for context (along with a prompt
for the duplicate itself). Variations are straightforward. The
assumption made by 'dupwords.py' is that a doubled word that
spans a line (from the end of one to the beginning of another,
ignoring whitespace) is a real doubling; but a duplicate that
spans paragraphs is not likewise noteworthy.
PROBLEM: Checking for server errors:
--------------------------------------------------------------------
Web servers are a ubiquitous source of information nowadays.
But finding URLs that lead to real documents is largely
hit-or-miss. Every Web maintainer seems to reorganize her site
every month or two, thereby breaking bookmarks and hyperlinks.
As bad as the chaos is for plain Web surfers, it is worse for
robots faced with the difficult task of recognizing the
difference between content and errors. By-the-by, it is easy
to accumulate downloaded Web pages that consist of error
messages rather than desired content.
In principle, Web servers can and should return error codes
indicating server errors. But in practice, Web servers almost
always return dynamically generated results pages for erroneous
requests. Such pages are basically perfectly normal HTML pages
that just happen to contain text like "Error 404: File not
found!" Most of the time these pages are a bit fancier than
this, containing custom graphics and layout, links to site
homepages, JavaScript code, cookies, meta tags, and all sorts
of other stuff. It is actually quite amazing just how much
many Web servers send in response to requests for nonexistent
URLs.
Below is a very simple Python script to examine just what Web
servers return on valid or invalid requests. Getting an error
page is usually as simple as asking for a page called
'http://somewebsite.com/phony-url' or the like (anything that
doesn't really exist). [urllib] is discussed in Chapter 5, but
its details are not important here.
#---------- url_examine.py ----------#
import sys
from urllib import urlopen
if len(sys.argv) > 1:
fpin = urlopen(sys.argv[1])
print fpin.geturl()
print fpin.info()
print fpin.read()
else:
print "No specified URL"
Given the diversity of error pages you might receive, it is
difficult or impossible to create a regular expression (or any
program) that determines with certainty whether a given HTML
document is an error page. Furthermore, some sites choose to
generate pages that are not really quite errors, but not
really quite content either (e.g, generic directories of site
information with suggestions on how to get to content). But
some heuristics come quite close to separating content from
errors. One noteworthy heuristic is that the interesting
errors are almost always 404 or 403 (not a sure thing, but good
enough to make smart guesses). Below is a utility to rate the
"error probability" of HTML documents:
#---------- error_page.py ----------#
import re, sys
page = sys.stdin.read()
# Mapping from patterns to probability contribution of pattern
err_pats = {r'(?is).*?(404|403).*?ERROR.*?': 0.95,
r'(?is).*?ERROR.*?(404|403).*?': 0.95,
r'(?is)ERROR': 0.30,
r'(?is).*?ERROR.*?': 0.10,
r'(?is)': 0.80,
r'(?is)': 0.80,
r'(?is).*?File Not Found.*?': 0.80,
r'(?is).*?Not Found.*?': 0.40,
r'(?is)': 0.10,
r'(?is).*?(404|403).*?
': 0.15,
r'(?is)': 0.10,
r'(?is).*?not found.*?
': 0.15,
r'(?is)': 0.10,
r'(?is)': 0.10,
r'(?is)': 0.10,
r'(?is)': 0.10,
r'(?i)does not exist': 0.10,
}
err_score = 0
for pat, prob in err_pats.items():
if err_score > 0.9: break
if re.search(pat, page):
# print pat, prob
err_score += prob
if err_score > 0.90: print 'Page is almost surely an error report'
elif err_score > 0.75: print 'It is highly likely page is an error report'
elif err_score > 0.50: print 'Better-than-even odds page is error report'
elif err_score > 0.25: print 'Fair indication page is an error report'
else: print 'Page is probably real content'
Tested against a fair number of sites, a collection like this of
regular expression searches and threshold confidences works
quite well. Within the author's own judgment of just what is
really an error page, 'erro_page.py' has gotten no false
positives and always arrived at at least the lowest warning
level for every true error page.
The patterns chosen are all fairly simple, and both the
patterns and their weightings were determined entirely
subjectively by the author. But something like this weighted
hit-or-miss technique can be used to solve many "fuzzy logic"
matching problems (most having nothing to do with Web server
errors).
Code like that above can form a general approach to more
complete applications. But for what it is worth, the scripts
'url_examine.py' and 'error_page.py' may be used directly
together by piping from the first to the second. For example:
#*------ Using ex_error_page.py -----#
% python urlopen.py http://gnosis.cx/nonesuch | python ex_error_page.py
Page is almost surely an error report
PROBLEM: Reading lines with continuation characters
--------------------------------------------------------------------
Many configuration files and other types of computer code are
line oriented, but also have a facility to treat multiple lines
as if they were a single logical line. In processing such a
file it is usually desirable as a first step to turn all these
logical lines into actual newline-delimited lines (or more
likely, to transform both single and continued lines as
homogeneous list elements to iterate through later). A
continuation character is generally required to be the -last-
thing on a line before a newline, or possibly the last thing
other than some whitespace. A small (and very partial) table
of continuation characters used by some common and uncommon
formats is listed below:
#*----- Common continuation characters -----#
\ Python, JavaScript, C/C++, Bash, TCL, Unix config
_ Visual Basic, PAW
& Lyris, COBOL, IBIS
; Clipper, TOP
- XSPEC, NetREXX
= Oracle Express
Most of the formats listed are programming languages, and
parsing them takes quite a bit more than just identifying the
lines. More often, it is configuration files of various sorts
that are of interest in simple parsing, and most of the time
these files use a common Unix-style convention of using
trailing backslashes for continuation lines.
One -could- manage to parse logical lines with a [string]
module approach that looped through lines and performed
concatenations when needed. But a greater elegance is served
by reducing the problem to a single regular expression. The
module below provides this:
#---------- logical_lines.py ----------#
# Determine the logical lines in a file that might have
# continuation characters. 'logical_lines()' returns a
# list. The self-test prints the logical lines as
# physical lines (for all specified files and options).
import re
def logical_lines(s, continuation='\\', strip_trailing_space=0):
c = continuation
if strip_trailing_space:
s = re.sub(r'(?m)(%s)(\s+)$'%[c], r'\1', s)
pat_log = r'(?sm)^.*?$(?>> pat = r'''
... (?x) # This is the verbose version
... (?s) # In the pattern, let "." match newlines, if needed
... (?m) # Allow ^ and $ to match every begin- and end-of-line
... ^ # Start the match at the beginning of a line
... .*? # Non-greedily grab everything until the first place
... # where the rest of the pattern matches (if possible)
... $ # End the match at an end-of-line
... (?)'"\]]) # assert: followed by white or clause ending
) # end of match group
''')
pat_email = re.compile(r'''
(?xm) # verbose identify URLs in text (and multiline)
(?=^.{11} # Mail header matcher
(?)'"\]]) # assert: followed by white or clause ending
) # end of match group
''')
extract_urls = lambda s: [u[0] for u in re.findall(pat_url, s)]
extract_email = lambda s: [(e[1]) for e in re.findall(pat_email, s)]
if __name__ == '__main__':
for line in fileinput.input():
urls = extract_urls(line)
if urls:
for url in urls:
print fileinput.filename(),'=>',url
emails = extract_email(line)
if emails:
for email in emails:
print fileinput.filename(),'->',email
A number of features are notable in the utility above. One point
is that everything interesting is done within the regular
expressions themselves. The actual functions 'extract_urls()' and
'extract_email()' are each a single line, using the conciseness
of functional-style programming, especially list comprehensions
(four or five lines of more procedural code could be used, but
this style helps emphasize where the work is done). The utility
itself prints located resources to STDOUT, but you could do
something else with them just as easily.
A bit of testing of preliminary versions of the regular
expressions led me to add a few complications to them. In part
this lets readers see some more exotic features in action; but in
greater part, this helps weed out what I would consider "false
positives." For URLs we demand at least two domain groups--this
rules out LOCALHOST addresses, if present. However, by allowing a
colon to end a domain group, we allow for specified ports such as
'http://gnosis.cx:8080/resource/'.
Email addresses have one particular special consideration. If
the files you are scanning for email addresses happen to be
actual mail archives, you will also find Message-ID strings.
The form of these headers is very similar to that of email
addresses ('In-Reply-To:' headers also contain Message-IDs).
By combining a negative lookbehind assertion with some
throwaway groups, we can make sure that everything that gets
extracted is not a 'Message-ID:' header line. It gets a little
complicated to combine these things correctly, but the power of
it is quite remarkable.
PROBLEM: Pretty-printing numbers
--------------------------------------------------------------------
In producing human-readable documents, Python's default string
representation of numbers leaves something to be desired.
Specifically, the delimiters that normally occur between powers
of 1,000 in written large numerals are not produced by the
`str()` or `repr()` functions--which makes reading large
numbers difficult. For example:
>>> budget = 12345678.90
>>> print 'The company budget is $%s' % str(budget)
The company budget is $12345678.9
>>> print 'The company budget is %10.2f' % budget
The company budget is 12345678.90
Regular expressions can be used to transform numbers that are
already "stringified" (an alternative would be to process
numeric values by repeated division/remainder operations,
stringifying the chunks). A few basic utility functions are
contained in the module below.
#---------- pretty_nums.py ----------#
# Create/manipulate grouped string versions of numbers
import re
def commify(f, digits=2, maxgroups=5, european=0):
template = '%%1.%df' % digits
s = template % f
pat = re.compile(r'(\d+)(\d{3})([.,]|$)([.,\d]*)')
if european:
repl = r'\1.\2\3\4'
else: # could also use locale.localeconv()['decimal_point']
repl = r'\1,\2\3\4'
for i in range(maxgroups):
s = re.sub(pat,repl,s)
return s
def uncommify(s):
return s.replace(',','')
def eurify(s):
s = s.replace('.','\000') # place holder
s = s.replace(',','.') # change group delimiter
s = s.replace('\000',',') # decimal delimiter
return s
def anglofy(s):
s = s.replace(',','\000') # place holder
s = s.replace('.',',') # change group delimiter
s = s.replace('\000','.') # decimal delimiter
return s
vals = (12345678.90, 23456789.01, 34567890.12)
sample = '''The company budget is $%s.
Its debt is $%s, against assets
of $%s'''
if __name__ == '__main__':
print sample % vals, '\n-----'
print sample % tuple(map(commify, vals)), '\n-----'
print eurify(sample % tuple(map(commify, vals))), '\n-----'
The technique used in 'commify()' has virtues and vices. It is
quick, simple, and it works. It is also slightly kludgey
inasmuch as it loops through the substitution (and with the
default 'maxgroups' argument, it is no good for numbers bigger
than a quintillion; most numbers you encounter are smaller
than this). If purity is a goal--and it probably should not
be--you could probably come up with a single regular expression
to do the whole job. Another quick and convenient technique is
the "place holder" idea that was mentioned in the introductory
discussion of the [string] module.
SECTION 2 -- Standard Modules
------------------------------------------------------------------------
TOPIC -- Versions and Optimizations
--------------------------------------------------------------------
Rules of Optimization:
Rule 1: Don't do it.
Rule 2 (for experts only): Don't do it yet.
-- M.A. Jackson
Python has undergone several changes in its regular expression
support. [regex] was superceded by [pre] in Python 1.5; [pre],
in turn, by [sre] in Python 2.0. Although Python has continued
to include the older modules in its standard library for
backwards compatibility, the older ones are deprecated when the
newer versions are included. From Python 1.5 forward, the
module [re] has served as a wrapper to the underlying regular
expression engine ([sre] or [pre]). But even though Python
2.0+ has used [re] to wrap [sre], [pre] is still available (the
latter along with its own underlying [pcre] C extension
module that can technically be used directly).
Each version has generally improved upon its predecessor, but
with something as complicated as regular expressions there are
always a few losses with each gain. For example, [sre] adds
Unicode support and is faster for most operations--but [pre]
has better optimization of case-insensitive searches. Subtle
details of regular expression patterns might even let the
quite-old [regex] module perform faster than the newer ones.
Moreover, optimizing regular expressions can be extremely
complicated and dependent upon specific small version
differences.
Readers might start to feel their heads swim with these version
details. Don't panic. Other than out of historic interest,
you really do not need to worry about what implementations
underlie regular expression support. The simple rule is just
to use the module [re] and not think about what it wraps--the
interface is compatible between versions.
The real virtue of regular expressions is that they allow a
concise and precise (albeit somewhat cryptic) description of
complex patterns in text. Most of the time, regular expression
operations are -fast enough-; there is rarely any point in
optimizing an application past the point where it does what it
needs to do fast enough that speed is not a problem. As Knuth
famously remarks, "We should forget about small efficiencies, say
about 97% of the time: Premature optimization is the root of all
evil." ("Computer Programming as an Art" in _Literate
Programming_, CSLI Lecture Notes Number 27, Stanford University
Center for the Study of Languages and Information, 1992).
In case regular expression operations prove to be a genuinely
problematic performance bottleneck in an application, there are
four steps you should take in speeding things up. Try these in
order:
1. Think about whether there is a way to simplify the regular
expressions involved. Most especially, is it possible to
reduce the likelihood of backtracking during pattern
matching? You should always test your beliefs about such
simplification, however; performance characteristics rarely
turn out exactly as you expect.
2. Consider whether regular expressions are -really- needed
for the problem at hand. With surprising frequency, faster
and simpler operations in the [string] module (or,
occasionally, in other modules) do what needs to be done.
Actually, this step can often come earlier than the first
one.
3. Write the search or transformation in a faster and
lower-level engine, especially [mx.TextTools]. Low-level
modules will inevitably involve more work and considerably
more intense thinking about the problem. But
order-of-magnitude speed gains are often possible for the
work.
4. Code the application (or the relevant parts of it) in a
different programming language. If speed is the absolutely
first consideration in an application, Assembly, C, or C++
are going to win. Tools like swig--while outside the scope
of this book--can help you create custom extension modules
to perform bottleneck operations. There is a chance also
that if the problem -really must- be solved with regular
expressions that Perl's engine will be faster (but not
always, by any means).
TOPIC -- Simple Pattern Matching
--------------------------------------------------------------------
=================================================================
MODULE -- fnmatch : Glob-style pattern matching
=================================================================
The real purpose of the [fnmatch] module is to match filenames
against a pattern. Most typically, [fnmatch] is used indirectly
through the [glob] module, where the latter returns lists of
matching files (for example to process each matching file). But
[fnmatch] does not itself know anything about filesystems, it
simply provides a way of checking patterns against strings. The
pattern language used by [fnmatch] is much simpler than that used
by [re], which can be either good or bad, depending on your
needs. As a plus, most everyone who has used a DOS, Windows,
OS/2, or Unix command line is already familiar with the [fnmatch]
pattern language, which is simply shell-style expansions.
Four subpatterns are available in [fnmatch] patterns. In contrast
to [re] patterns, there is no grouping and no quantifiers.
Obviously, the discernment of matches is much less with [fnmatch]
than with [re]. The subpatterns are as follows:
#------------- Glob-style subpatterns --------------#
* Match everything that follows (non-greedy).
? Match any single character.
[set] Match one character from a set. A set generally
follows the same rules as a regular expression
character class. It may include zero or more ranges
and zero or more enumerated characters.
[!set] Match any one character that is not in the set.
A pattern is simply the concatenation of one or more
subpatterns.
FUNCTIONS:
fnmatch.fnmatch(s, pat)
Test whether the pattern 'pat' matches the string 's'. On
case-insensitive filesystems, the match is case
insensitive. A cross-platform script should avoid
`fnmatch.fnmatch()` except when used to match actual
filenames.
>>> from fnmatch import fnmatch
>>> fnmatch('this', '[T]?i*') # On Unix-like system
0
>>> fnmatch('this', '[T]?i*') # On Win-like system
1
SEE ALSO, `fnmatch.fnmatchcase()`
fnmatch.fnmatchcase(s, pat)
Test whether the pattern 'pat' matches the string 's'.
The match is case-sensitive regardless of platform.
>>> from fnmatch import fnmatchcase
>>> fnmatchcase('this', '[T]?i*')
0
>>> from string import upper
>>> fnmatchcase(upper('this'), upper('[T]?i*'))
1
SEE ALSO, `fnmatch.fnmatch()`
fnmatch.filter(lst, pat)
Return a new list containing those elements of 'lst' that
match 'pat'. The matching behaves like `fnmatch.fnmatch()`
rather than like `fnmatch.fnmatchcase()`, so the results
can be OS-dependent. The example below shows a (slower)
means of performing a case-sensitive match on all
platforms.
>>> import fnmatch # Assuming Unix-like system
>>> fnmatch.filter(['This','that','other','thing'], '[Tt]?i*')
['This', 'thing']
>>> fnmatch.filter(['This','that','other','thing'], '[a-z]*')
['that', 'other', 'thing']
>>> from fnmatch import fnmatchcase # For all platforms
>>> mymatch = lambda s: fnmatchcase(s, '[a-z]*')
>>> filter(mymatch, ['This','that','other','thing'])
['that', 'other', 'thing']
For an explanation of the built-in function `filter()`
function, see Appendix A.
SEE ALSO, `fnmatch.fnmatch()`, `fnmatch.fnmatchcase()`
SEE ALSO, [glob], [re]
TOPIC -- Regular Expression Modules
--------------------------------------------------------------------
=================================================================
MODULE -- pre : Pre-sre module
=================================================================
MODULE -- pcre : Underlying C module for pre
=================================================================
The Python-written module [pre], and the C-written [pcre]
module that implements the actual regular expression engine,
are the regular expression modules for Python 1.5-1.6. For
complete backwards compatibility, they continue to be included
in Python 2.0+. Importing the symbol space of [pre] is
intended to be equivalent to importing [re] (i.e., [sre] at one
level of indirection) in Python 2.0+, with the exception of the
handling of Unicode strings, which [pre] cannot do. That is,
the lines below are almost equivalent, other than potential
performance differences in specific operations:
>>> import pre as re
>>> import re
However, there is very rarely any reason to use [pre] in Python
2.0+. Anyone deciding to import [pre] should know far more
about the internals of regular expression engines than is
contained in this book. Of course, prior to Python 2.0,
importing [re] simply imports [pcre] itself (and the Python
wrappers later renamed [pre]).
SEE ALSO, [re]
=================================================================
MODULE -- reconvert : Convert [regex] patterns to [re] patterns
=================================================================
This module exists solely for conversion of old regular
expressions from scripts written for pre-1.5 versions of
Python, or possibly from regular expression patterns used with
tools such as sed, awk, or grep. Conversions are not
guaranteed to be entirely correct, but [reconvert] provides a
starting point for a code update.
FUNCTIONS:
reconvert.convert(s)
Return as a string the modern [re]-style pattern that
corresponds to the [regex]-style pattern passed in argument
's'. For example:
>>> import reconvert
>>> reconvert.convert(r'\<\(cat\|dog\)\>')
'\\b(cat|dog)\\b'
>>> import re
>>> re.findall(r'\b(cat|dog)\b', "The dog chased a bobcat")
['dog']
SEE ALSO, [regex]
=================================================================
MODULE -- regex : Deprecated regular expression module
=================================================================
The [regex] module is distributed with recent Python versions
only to ensure strict backwards compatibility of scripts.
Starting with Python 2.1, importing [regex] will produce a
DeprecationWarning:
#*----------- Deprecation warning for regex --------------#
% python -c "import regex"
-c:1: DeprecationWarning: the regex module is deprecated;
please use the re module
For all users of Python 1.5+, [regex] should not be used in new
code, and efforts should be made to convert its usage to [re]
calls.
SEE ALSO, [reconvert]
=================================================================
MODULE -- sre : Secret Labs Regular Expression Engine
=================================================================
Support for regular expressions in Python 2.0+ is provided by
the module [sre]. The module [re] simply wraps [sre] in order
to have a backwards- and forwards-compatible name. There will
almost never be any reason to import [sre] itself; some later
version of Python might eventually deprecate [sre] also. As
with [pre], anyone deciding to import [sre] itself should know
far more about the internals of regular expression engines than
is contained in this book.
SEE ALSO, [re]
=================================================================
MODULE -- re : Regular expression operations
=================================================================
PATTERN SUMMARY:
The chart below lists regular expression patterns; following
that are explanations of each pattern. For more detailed
explanation of patterns in action, consult the tutorial and/or
problems contained in this chapter. The utility function
're_show()' defined in the tutorial is used in some
descriptions.
!!!
#----- Regular expression patterns -----#
<>
ATOMIC OPERATORS:
Plain symbol
Any character not described below as having a special
meaning simply represents itself in the target string. An
"A" matches exactly one "A" in the target, for example.
Escape: "\"
The escape character starts a special sequence. The
special characters listed in this pattern summary must be
escaped to be treated as literal character values
(including the escape character itself). The letters "A",
"b", "B", "d", "D", "s", "S", "w", "W", and "Z" specify
special patterns if preceded by an escape. The escape
character may also introduce a backreference group with up
to two decimal digits. The escape is ignored if it
precedes a character with no special escaped meaning.
Since Python string escapes overlap regular expression
escapes, it is usually better to use raw strings for
regular expressions that potentially include escapes. For
example:
>>> from re_show import re_show
>>> re_show(r'\$ \\ \^', r'\$ \\ \^ $ \ ^')
\$ \\ \^ {$ \ ^}
>>> re_show(r'\d \w', '7 a 6 # ! C')
{7 a} 6 # ! C
Grouping operators: "(", ")"
Parentheses surrounding any pattern turn that pattern into
a group (possibly within a larger pattern). Quantifiers
refer to the immediately preceding group, if one is
defined, otherwise to the preceding character or character
class. For example:
>>> from re_show import re_show
>>> re_show(r'abc+', 'abcabc abc abccc')
{abc}{abc} {abc} {abccc}
>>> re_show(r'(abc)+', 'abcabc abc abccc')
{abcabc} {abc} {abc}cc
Backreference: "\d", "\dd"
A backreference consists of the escape character followed
by one or two decimal digits. The first digit in a back
reference may not be a zero. A backreference refers to
the same string matched by an earlier group, where
the enumeration of previous groups starts with 1. For
example:
>>> from re_show import re_show
>>> re_show(r'([abc])(.*)\1', 'all the boys are coy')
{all the boys a}re coy
An attempt to reference an undefined group will raise an
error.
Character classes: "[", "]"
Specify a set of characters that may occur at a position.
The list of allowable characters may be enumerated with no
delimiter. Predefined character classes, such as "\d", are
allowed within custom character classes. A range of
characters may be indicated with a dash. Multiple ranges
are allowed within a class. If a dash is meant to be
included in the character class itself, it should occur as
the first listed character. A character class may be
complemented by beginning it with a caret ("^"). If a
caret is meant to be included in the character class
itself, it should occur in a noninitial position. Most
special characters, such as "$", ".", and "(", lose their
special meaning inside a character class and are merely
treated as class members. The characters "]", "\", and
"'-'" should be escaped with a backslash, however. For
example:
>>> from re_show import re_show
>>> re_show(r'[a-fA-F]', 'A X c G')
{A} X {c} G
>>> re_show(r'[-A$BC\]]', r'A X - \ ] [ $')
{A} X {-} \ {]} [ {$}
>>> re_show(r'[^A-Fa-f]', r'A X c G')
A{ }{X}{ }c{ }{G}
Digit character class: "\d"
The set of decimal digits. Same as "[0-9]".
Non-digit character class: "\D"
The set of all characters -except- decimal digits. Same as
"[^0-9]".
Alphanumeric character class: "\w"
The set of alphanumeric characters. If re.LOCALE and
re.UNICODE modifiers are -not- set, this is the same as
[a-zA-Z0-9_]. Otherwise, the set includes any other
alphanumeric characters appropriate to the locale or with
an indicated Unicode character property of alphanumeric.
Non-alphanumeric character class: "\W"
The set of nonalphanumeric characters. If re.LOCALE and
re.UNICODE modifiers are -not- set, this is the same as
[^a-zA-Z0-9_]. Otherwise, the set includes any other
characters not indicated by the locale or Unicode character
properties as alphanumeric.
Whitespace character class: "\s"
The set of whitespace characters. Same as "[ \t\n\r\f\v]".
Non-whitespace character class: "\S"
The set of non-whitespace characters. Same as
"[^ \t\n\r\f\v]".
Wildcard character: "."
The period matches any single character at a position. If
the re.DOTALL modifier is specified, "." will match a
newline. Otherwise, it will match anything other than a
newline.
Beginning of line: "^"
The caret will match the beginning of the target string.
If the re.MULTILINE modifier is specified, "^" will match
the beginning of each line within the target string.
Beginning of string: "\A"
The "\A" will match the beginning of the target string.
If the re.MULTILINE modifier is -not- specified, "\A"
behaves the same as "^". But even if the modifier is
used, "\A" will match only the beginning of the entire
target.
End of line: "$"
The dollar sign will match the end of the target string.
If the re.MULTILINE modifier is specified, "$" will match
the end of each line within the target string.
End of string: "\Z"
The "\Z" will match the end of the target string. If the
re.MULTILINE modifier is -not- specified, "\Z" behaves the
same as "$". But even if the modifier is used, "\Z" will
match only the end of the entire target.
Word boundary: "\b"
The "\b" will match the beginning or end of a word (where a
word is defined as a sequence of alphanumeric characters
according to the current modifiers). Like "^" and "$",
"\b" is a zero-width match.
Non-word boundary: "\B"
The "\B" will match any position that is -not- the
beginning or end of a word (where a word is defined as a
sequence of alphanumeric characters according to the
current modifiers). Like "^" and "$", "\B" is a zero-width
match.
Alternation operator: "|"
The pipe symbol indicates a choice of multiple atoms in a
position. Any of the atoms (including groups) separated by
a pipe will match. For example:
>>> from re_show import re_show
>>> re_show(r'A|c|G', r'A X c G')
{A} X {c} {G}
>>> re_show(r'(abc)|(xyz)', 'abc efg xyz lmn')
{abc} efg {xyz} lmn
QUANTIFIERS:
Universal quantifier: "*"
Match zero or more occurrences of the preceding atom. The
"*" quantifier is happy to match an empty string. For
example:
>>> from re_show import re_show
>>> re_show('a* ', ' a aa aaa aaaa b')
{ }{a }{aa }{aaa }{aaaa }b
Non-greedy universal quantifier: "*?"
Match zero or more occurrences of the preceding atom, but
try to match as few occurrences as allowable. For example:
>>> from re_show import re_show
>>> re_show('<.*>', '<> Text')
{<> Text}
>>> re_show('<.*?>', '<> Text')
{<>} {}Text{}
Existential quantifier: "+"
Match one or more occurrences of the preceding atom. A
pattern must actually occur in the target string to satisfy
the "+" quantifier. For example:
>>> from re_show import re_show
>>> re_show('a+ ', ' a aa aaa aaaa b')
{a }{aa }{aaa }{aaaa }b
Non-greedy existential quantifier: "+?"
Match one or more occurrences of the preceding atom, but
try to match as few occurrences as allowable. For example:
>>> from re_show import re_show
>>> re_show('<.+>', '<> Text')
{<> Text}
>>> re_show('<.+?>', '<> Text')
{<> }Text{}
Potentiality quantifier: "?"
Match zero or one occurrence of the preceding atom. The
"?" quantifier is happy to match an empty string. For
example:
>>> from re_show import re_show
>>> re_show('a? ', ' a aa aaa aaaa b')
{ }{a }a{a }aa{a }aaa{a }b
Non-greedy potentiality quantifier: "??"
Match zero or one occurrences of the preceding atom, but
match zero if possible. For example:
>>> from re_show import re_show
>>> re_show(' a?', ' a aa aaa aaaa b')
{ a}{ a}a{ a}aa{ a}aaa{ }b
>>> re_show(' a??', ' a aa aaa aaaa b')
{ }a{ }aa{ }aaa{ }aaaa{ }b
Exact numeric quantifier: "{num}"
Match exactly 'num' occurrences of the preceding atom. For
example:
>>> from re_show import re_show
>>> re_show('a{3} ', ' a aa aaa aaaa b')
a aa {aaa }a{aaa }b
Lower-bound quantifier: "{min,}"
Match -at least- 'min' occurrences of the preceding atom.
For example:
>>> from re_show import re_show
>>> re_show('a{3,} ', ' a aa aaa aaaa b')
a aa {aaa }{aaaa }b
Bounded numeric quantifier: "{min,max}"
Match -at least- 'min' and -no more than- 'max' occurrences
of the preceding atom. For example:
>>> from re_show import re_show
>>> re_show('a{2,3} ', ' a aa aaa aaaa b')
a {aa }{aaa }a{aaa }
Non-greedy bounded quantifier: "{min,max}?"
Match -at least- 'min' and -no more than- 'max' occurrences
of the preceding atom, but try to match as few occurrences
as allowable. Scanning is from the left, so a nonminimal
match may be produced in terms of right-side groupings.
For example:
>>> from re_show import re_show
>>> re_show(' a{2,4}?', ' a aa aaa aaaa b')
a{ aa}{ aa}a{ aa}aa b
>>> re_show('a{2,4}? ', ' a aa aaa aaaa b')
a {aa }{aaa }{aaaa }b
GROUP-LIKE PATTERNS:
Python regular expressions may contain a number of pseudo-group
elements that condition matches in some manner. With the
exception of named groups, pseudo-groups are not counted in
backreferencing. All pseudo-group patterns have the form
"(?...)".
Pattern modifiers: "(?Limsux)"
The pattern modifiers should occur at the very beginning of
a regular expression pattern. One or more letters in the
set "Limsux" may be included. If pattern modifiers are
given, the interpretation of the pattern is changed
globally. See the discussion of modifier constants below
or the tutorial for details.
Comments: "(?#...)"
Create a comment inside a pattern. The comment is not
enumerated in backreferences and has no effect on what is
matched. In most cases, use of the "(?x)" modifier allows
for more clearly formatted comments than does "(?#...)".
>>> from re_show import re_show
>>> re_show(r'The(?#words in caps) Cat', 'The Cat in the Hat')
{The Cat} in the Hat
Non-backreferenced atom: "(?:...)"
Match the pattern "...", but do not include the matched
string as a backreferencable group. Moreover, methods like
`re.match.group()` will not see the pattern inside
non-backreferenced atom.
>>> from re_show import re_show
>>> re_show(r'(?:\w+) (\w+).* \1', 'abc xyz xyz abc')
{abc xyz xyz} abc
>>> re_show(r'(\w+) (\w+).* \1', 'abc xyz xyz abc')
{abc xyz xyz abc}
Positive Lookahead assertion: "(?=...)"
Match the entire pattern only if the subpattern "..."
occurs next. But do not include the target substring
matched by "..." as part of the match (however, some other
subpattern may claim the same characters, or some of them).
>>> from re_show import re_show
>>> re_show(r'\w+ (?=xyz)', 'abc xyz xyz abc')
{abc }{xyz }xyz abc
Negative Lookahead assertion: "(?!...)"
Match the entire pattern only if the subpattern "..." does
-not- occur next.
>>> from re_show import re_show
>>> re_show(r'\w+ (?!xyz)', 'abc xyz xyz abc')
abc xyz {xyz }abc
Positive Lookbehind assertion: "(?<=...)"
Match the rest of the entire pattern only if the subpattern
"..." occurs immediately prior to the current match point.
But do not include the target substring matched by "..." as
part of the match (the same characters may or may not be
claimed by some prior group(s) in the entire pattern). The
pattern "..." must match a fixed number of characters and
therefore not contain general quantifiers.
>>> from re_show import re_show
>>> re_show(r'\w+(?<=[A-Z]) ', 'Words THAT end in capS X')
Words {THAT }end in {capS }X
Negative Lookbehind assertion: "(?>> from re_show import re_show
>>> re_show(r'\w+(?)"
Create a group that can be referred to by the name 'name'
as well as in enumerated backreferences. The forms below
are equivalent.
>>> from re_show import re_show
>>> re_show(r'(\w+) (\w+).* \1', 'abc xyz xyz abc')
{abc xyz xyz abc}
>>> re_show(r'(?P\w+) (\w+).* (?P=first)', 'abc xyz xyz abc')
{abc xyz xyz abc}
>>> re_show(r'(?P\w+) (\w+).* \1', 'abc xyz xyz abc')
{abc xyz xyz abc}
Named group backreference: "(?P=name)"
Backreference a group by the name 'name' rather than by
escaped group number. The group name must have been
defined earlier by "(?P)", or an error is raised.
CONSTANTS:
A number of constants are defined in the [re] modules that act
as modifiers to many [re] functions. These constants are
independent bit-values, so that multiple modifiers may be
selected by bitwise disjunction of modifiers. For example:
>>> import re
>>> c = re.compile('cat|dog', re.IGNORECASE | re.UNICODE)
re.I, re.IGNORECASE
Modifier for case-insensitive matching. Lowercase and
uppercase letters are interchangeable in patterns modified
with this modifier. The prefix '(?i)' may also be used
inside the pattern to achieve the same effect.
re.L, re.LOCALE
Modifier for locale-specific matching of '\w', '\W', '\b',
and '\B'. The prefix '(?L)' may also be used inside the
pattern to achieve the same effect.
re.M, re.MULTILINE
Modifier to make '^' and '$' match the beginning and end,
respectively, of -each- line in the target string rather
than the beginning and end of the entire target string.
The prefix '(?m)' may also be used inside the pattern to
achieve the same effect.
re.S, re.DOTALL
Modifier to allow '.' to match a newline character.
Otherwise, '.' matches every character -except- newline
characters. The prefix '(?s)' may also be used inside the
pattern to achieve the same effect.
re.U, re.UNICODE
Modifier for Unicode-property matching of '\w', '\W', '\b',
and '\B'. Only relevant for Unicode targets. The prefix
'(?u)' may also be used inside the pattern to achieve the
same effect.
re.X, re.VERBOSE
Modifier to allow patterns to contain insignificant
whitespace and end-of-line comments. Can significantly
improve readability of patterns. The prefix '(?x)' may
also be used inside the pattern to achieve the same effect.
re.engine
The regular expression engine currently in use. Only
supported in Python 2.0+, where it normally is set to the
string 'sre'. The presence and value of this constant can
be checked to make sure which underlying implementation is
running, but this check is rarely necessary.
FUNCTIONS:
For all [re] functions, where a regular expression pattern
'pattern' is an argument, 'pattern' may be either a compiled
regular expression or a string.
re.escape(s)
Return a string with all non-alphanumeric characters
escaped. This (slightly scattershot) conversion makes an
arbitrary string suitable for use in a regular expression
pattern (matching all literals in original string).
>>> import re
>>> print re.escape("(*@&^$@|")
\(\*\@\&\^\$\@\|
re.findall(pattern=..., string=...)
Return a list of all nonoverlapping occurrences of
'pattern' in 'string'. If 'pattern' consists of several
groups, return a list of tuples where each tuple contains a
match for each group. Length-zero matches are included in
the returned list, if they occur.
>>> import re
>>> re.findall(r'\b[a-z]+\d+\b', 'abc123 xyz666 lmn-11 def77')
['abc123', 'xyz666', 'def77']
>>> re.findall(r'\b([a-z]+)(\d+)\b', 'abc123 xyz666 lmn-11 def77')
[('abc', '123'), ('xyz', '666'), ('def', '77')]
SEE ALSO, `re.search()`, `mx.TextTools.findall()`
re.purge()
Clear the regular expression cache. The [re] module keeps
a cache of implicitly compiled regular expression patterns.
The number of patterns cached differs between Python
versions, with more recent versions generally keeping 100
items in the cache. When the cache space becomes full, it
is flushed automatically. You could use `re.purge()` to
tune the timing of cache flushes. However, such tuning is
approximate at best: patterns that are used repeatedly are
much better off explicitly compiled with `re.compile()` and
then used explicitly as named objects.
re.split(pattern=..., string=... [,maxsplit=0])
Return a list of substrings of the second argument 'string'.
The first argument 'pattern' is a regular expression that
delimits the substrings. If 'pattern' contains groups, the
groups are included in the resultant list. Otherwise,
those substrings that match 'pattern' are dropped, and only
the substrings between occurrences of 'pattern' are
returned.
If the third argument 'maxsplit' is specified as a positive
integer, no more than 'maxsplit' items are parsed into the
list, with any leftover contained in the final list
element.
>>> import re
>>> re.split(r'\s+', 'The Cat in the Hat')
['The', 'Cat', 'in', 'the', 'Hat']
>>> re.split(r'\s+', 'The Cat in the Hat', maxsplit=3)
['The', 'Cat', 'in', 'the Hat']
>>> re.split(r'(\s+)', 'The Cat in the Hat')
['The', ' ', 'Cat', ' ', 'in', ' ', 'the', ' ', 'Hat']
>>> re.split(r'(a)(t)', 'The Cat in the Hat')
['The C', 'a', 't', ' in the H', 'a', 't', '']
>>> re.split(r'a(t)', 'The Cat in the Hat')
['The C', 't', ' in the H', 't', '']
SEE ALSO, `string.split()`
re.sub(pattern=..., repl=..., string=... [,count=0])
Return the string produced by replacing every
nonoverlapping occurrence of the first argument 'pattern'
with the second argument 'repl' in the third argument
'string'. If the fourth argument 'count' is specified, no
more than 'count' replacements will be made.
The second argument 'repl' is most often a regular
expression pattern as a string. Backreferences to groups
matched by 'pattern' may be referred to by enumerated
backreferences using the usual escaped numbers. If
backreferences in 'pattern' are named, they may also be
referred to using the form "\g" (where 'name' is the
name given the group in 'pat'). As well, enumerated
backreferences may optionally be referred to using the
form "\g", where 'num' is an integer between 1 and 99.
Some examples:
>>> import re
>>> s = 'abc123 xyz666 lmn-11 def77'
>>> re.sub(r'\b([a-z]+)(\d+)', r'\2\1 :', s)
'123abc : 666xyz : lmn-11 77def :'
>>> re.sub(r'\b(?P[a-z]+)(?P\d+)', r'\g\g<1> :', s)
'123abc : 666xyz : lmn-11 77def :'
>>> re.sub('A', 'X', 'AAAAAAAAAA', count=4)
'XXXXAAAAAA'
A variant manner of calling `re.sub()` uses a function
object as the second argument 'repl'. Such a callback
function should take a MatchObject as an argument and
return a string. The 'repl' function is invoked for each
match of 'pattern', and the string it returns is
substituted in the result for whatever 'pattern' matched.
For example:
>>> import re
>>> sub_cb = lambda pat: '('+`len(pat.group())`+')'+pat.group()
>>> re.sub(r'\w+', sub_cb, 'The length of each word')
'(3)The (6)length (2)of (4)each (4)word'
Of course, if 'repl' is a function object, you can take
advantage of side effects rather than (or instead of)
simply returning modified strings. For example:
>>> import re
>>> def side_effects(match):
... # Arbitrarily complicated behavior could go here...
... print len(match.group()), match.group()
... return match.group() # unchanged match
...
>>> new = re.sub(r'\w+', side_effects, 'The length of each word')
3 The
6 length
2 of
4 each
4 word
>>> new
'The length of each word'
Variants on callbacks with side effects could be turned
into complete string-driven programs (in principle, a
parser and execution environment for a whole programming
language could be contained in the callback function, for
example).
SEE ALSO, `string.replace()`
re.subn(pattern=..., repl=..., string=... [,count=0])
Identical to `re.sub()`, except return a 2-tuple with the
new string and the number of replacements made.
>>> import re
>>> s = 'abc123 xyz666 lmn-11 def77'
>>> re.subn(r'\b([a-z]+)(\d+)', r'\2\1 :', s)
('123abc : 666xyz : lmn-11 77def :', 3)
SEE ALSO, `re.sub()`
CLASS FACTORIES:
As with some other Python modules, primarily ones written in C,
[re] does not contain true classes that can be specialized.
Instead, [re] has several factory-functions that return
instance objects. The practical difference is small for most
users, who will simply use the methods and attributes of
returned instances in the same manner as those produced by
true classes.
re.compile(pattern=... [,flags=...])
Return a PatternObject based on pattern string 'pattern'. If
the second argument 'flags' is specified, use the modifiers
indicated by 'flags'. A PatternObject is interchangeable
with a pattern string as an argument to [re] functions.
However, a pattern that will be used frequently within an
application should be compiled in advance to assure that it
will not need recompilation during execution. Moreover, a
compiled PatternObject has a number of methods and
attributes that achieve effects equivalent to [re]
functions, but which are somewhat more readable in some
contexts. For example:
>>> import re
>>> word = re.compile('[A-Za-z]+')
>>> word.findall('The Cat in the Hat')
['The', 'Cat', 'in', 'the', 'Hat']
>>> re.findall(word, 'The Cat in the Hat')
['The', 'Cat', 'in', 'the', 'Hat']
re.match(pattern=..., string=... [,flags=...])
Return a MatchObject if an initial substring of the second
argument 'string' matches the pattern in the first argument
'pattern'. Otherwise return None. A MatchObject, if
returned, has a variety of methods and attributes to
manipulate the matched pattern--but notably a MatchObject
is -not- itself a string.
Since `re.match()` only matches initial substrings,
`re.search()` is more general. `re.search()` can be
constrained to itself match only initial substrings by
prepending "\A" to the pattern matched.
SEE ALSO, `re.search()`, `re.compile.match()`
re.search(pattern=..., string=... [,flags=...])
Return a MatchObject corresponding to the leftmost
substring of the second argument 'string' that matches the
pattern in the first argument 'pattern'. If no match is
possible, return None. A matched string can be of zero
length if the pattern allows that (usually not what is
actually desired). A MatchObject, if returned, has a
variety of methods and attributes to manipulate the matched
pattern--but notably a MatchObject is -not- itself a
string.
SEE ALSO, `re.match()`, `re.compile.search()`
METHODS AND ATTRIBUTES:
re.compile.findall(s)
Return a list of nonoverlapping occurrences of the
PatternObject in 's'. Same as 're.findall()' called with
the PatternObject.
SEE ALSO `re.findall()`
re.compile.flags
The numeric sum of the flags passed to `re.compile()`
in creating the PatternObject. No formal guarantee is
given by Python as to the values assigned to modifier
flags, however. For example:
>>> import re
>>> re.I,re.L,re.M,re.S,re.X
(2, 4, 8, 16, 64)
>>> c = re.compile('a', re.I | re.M)
>>> c.flags
10
re.compile.groupindex
A dictionary mapping group names to group numbers. If no
named groups are used in the pattern, the dictionary is
empty. For example:
>>> import re
>>> c = re.compile(r'(\d+)(\[A-Z]+)([a-z]+)')
>>> c.groupindex
{}
>>> c =
re.compile(r'(?P\d+)(?P\[A-Z]+)(?P[a-z]+)')
>>> c.groupindex
{'nums': 1, 'caps': 2, 'lowers': 3}
re.compile.match(s [,start [,end]])
Return a MatchObject if an initial substring of the first
argument 's' matches the PatternObject. Otherwise, return
None. A MatchObject, if returned, has a variety of methods
and attributes to manipulate the matched pattern--but
notably a MatchObject is -not- itself a string.
In contrast to the similar function `re.match()`, this
method accepts optional second and third arguments 'start'
and 'end' that limit the match to substring within 's'.
In most respects specifying 'start' and 'end' is similar to
taking a slice of 's' as the first argument. But when
'start' and 'end' are used, "^" will only match the true
start of 's'. For example:
>>> import re
>>> s = 'abcdefg'
>>> c = re.compile('^b')
>>> print c.match(s, 1)
None
>>> c.match(s[1:])
>>> c = re.compile('.*f$')
>>> c.match(s[:-1])
>>> c.match(s,1,6)
SEE ALSO, `re.match()`, `re.compile.search()`
re.compile.pattern
The pattern string underlying the compiled MatchObject.
>>> import re
>>> c = re.compile('^abc$')
>>> c.pattern
'^abc$'
re.compile.search(s [,start [,end]])
Return a MatchObject corresponding to the leftmost
substring of the first argument 'string' that matches the
PatternObject. If no match is possible, return None. A
matched string can be of zero length if the pattern allows
that (usually not what is actually desired). A
MatchObject, if returned, has a variety of methods and
attributes to manipulate the matched pattern--but notably a
MatchObject is -not- itself a string.
In contrast to the similar function `re.search()`, this
method accepts optional second and third arguments 'start'
and 'end' that limit the match to a substring within 's'.
In most respects specifying 'start' and 'end' is similar to
taking a slice of 's' as the first argument. But when
'start' and 'end' are used, "^" will only match the true
start of 's'. For example:
>>> import re
>>> s = 'abcdefg'
>>> c = re.compile('^b')
>>> c = re.compile('^b')
>>> print c.search(s, 1),c.search(s[1:])
None
>>> c = re.compile('.*f$')
>>> print c.search(s[:-1]),c.search(s,1,6)
SEE ALSO, `re.search()`, `re.compile.match()`
re.compile.split(s [,maxsplit])
Return a list of substrings of the first argument 's'. If
thePatternObject contains groups, the groups are included
in the resultant list. Otherwise, those substrings that
match PatternObject are dropped, and only the substrings
between occurrences of 'pattern' are returned.
If the second argument 'maxsplit' is specified as a
positive integer, no more than 'maxsplit' items are parsed
into the list, with any leftover contained in the final
list element.
`re.compile.split()` is identical in behavior to
`re.split()`, simply spelled slightly differently. See the
documentation of the latter for examples of usage.
SEE ALSO, `re.split()`
re.compile.sub(repl, s [,count=0])
Return the string produced by replacing every
nonoverlapping occurrence of the PatternObject with the
first argument 'repl' in the second argument 'string'. If
the third argument 'count' is specified, no more than
'count' replacements will be made.
The first argument 'repl' may be either a regular
expression pattern as a string or a callback function.
Backreferences may be named or enumerated.
`re.compile.sub()` is identical in behavior to `re.sub()`,
simply spelled slightly differently. See the documentation
of the latter for a number of examples of usage.
SEE ALSO, `re.sub()`, `re.compile.subn()`
re.compile.subn()
Identical to `re.compile.sub()`, except return a 2-tuple
with the new string and the number of replacements made.
`re.compile.subn()` is identical in behavior to
`re.subn()`, simply spelled slightly differently. See the
documentation of the latter for examples of usage.
SEE ALSO, `re.subn()`, `re.compile.sub()`
Note: The arguments to each "MatchObject" method are listed on
the `re.match()` line, with ellipses given on the `re.search()`
line. All arguments are identical since `re.match()` and
`re.search()` return the very same type of object.
re.match.end([group])
re.search.end(...)
The index of the end of the target substring matched by the
MatchObject. If the argument 'group' is specified, return
the ending index of that specific enumerated group.
Otherwise, return the ending index of group 0 (i.e., the
whole match). If 'group' exists but is the part of an
alternation operator that is not used in the current
match, return -1. If `re.search.end()` returns the same
non-negative value as `re.search.start()`, then 'group' is
a zero-width substring.
>>> import re
>>> m = re.search('(\w+)((\d*)| )(\w+)','The Cat in the Hat')
>>> m.groups()
('The', ' ', None, 'Cat')
>>> m.end(0), m.end(1), m.end(2), m.end(3), m.end(4)
(7, 3, 4, -1, 7)
re.match.endpos, re.search.endpos
The end position of the search. If `re.compile.search()`
specified an 'end' argument, this is the value, otherwise
it is the length of the target string. If `re.search()` or
`re.match()` are used for the search, the value is always
the length of the target string.
SEE ALSO, `re.compile.search()`, `re.search()`, `re.match()`
re.match.expand(template)
re.search.expand(...)
Expand backreferences and escapes in the argument 'template'
based on the patterns matched by the MatchObject. The
expansion rules are the same as for the 'repl' argument to
`re.sub()`. Any nonescaped characters may also be
included as part of the resultant string. For example:
>>> import re
>>> m = re.search('(\w+) (\w+)','The Cat in the Hat')
>>> m.expand(r'\g<2> : \1')
'Cat : The'
re.match.group([group [,...]])
re.search.group(...)
Return a group or groups from the MatchObject. If no
arguments are specified, return the entire matched
substring. If one argument 'group' is specified, return
the corresponding substring of the target string. If
multiple arguments 'group1, group2, ...' are specified,
return a tuple of corresponding substrings of the target.
>>> import re
>>> m = re.search(r'(\w+)(/)(\d+)','abc/123')
>>> m.group()
'abc/123'
>>> m.group(1)
'abc'
>>> m.group(1,3)
('abc', '123')
SEE ALSO, `re.search.groups()`, `re.search.groupdict()`
re.match.groupdict([defval])
re.search.groupdict(...)
Return a dictionary whose keys are the named groups in the
pattern used for the match. Enumerated but unnamed groups
are not included in the returned dictionary. The values of
the dictionary are the substrings matched by each group in
the MatchObject. If a named group is part of an
alternation operator that is not used in the current match,
the value corresponding to that key is None, or 'defval' if
an argument is specified.
>>> import re
>>> m = re.search(r'(?P\w+)((?P\t)|( ))(?P\d+)','abc 123')
>>> m.groupdict()
{'one': 'abc', 'tab': None, 'two': '123'}
>>> m.groupdict('---')
{'one': 'abc', 'tab': '---', 'two': '123'}
SEE ALSO, `re.search.groups()`
re.match.groups([defval])
re.search.groups(...)
Return a tuple of the substrings matched by groups in the
MatchObject. If a group is part of an alternation operator
that is not used in the current match, the tuple element at
that index is None, or 'defval' if an argument is
specified.
>>> import re
>>> m = re.search(r'(\w+)((\t)|(/))(\d+)','abc/123')
>>> m.groups()
('abc', '/', None, '/', '123')
>>> m.groups('---')
('abc', '/', '---', '/', '123')
SEE ALSO, `re.search.group()`, `re.search.groupdict()`
re.match.lastgroup, re.search.lastgroup
The name of the last matching group, or None if the last
group is not named or if no groups compose the match.
re.match.lastindex, re.search.lastindex
The index of the last matching group, or None if no groups
compose the match.
re.match.pos, re.search.pos
The start position of the search. If `re.compile.search()`
specified a 'start' argument, this is the value, otherwise
it is 0. If `re.search()` or `re.match()` are used for the
search, the value is always 0.
SEE ALSO, `re.compile.search()`, `re.search()`, `re.match()`
re.match.re, re.search.re
The PatternObject used to produce the match. The actual
regular expression pattern string must be retrieved from
the PatternObject's 'pattern' method:
>>> import re
>>> m = re.search('a','The Cat in the Hat')
>>> m.re.pattern
'a'
re.match.span([group])
re.search.span(...)
Return the tuple composed of the return values of
're.search.start(group)' and 're.search.end(group)'. If
the argument 'group' is not specified, it defaults to 0.
>>> import re
>>> m = re.search('(\w+)((\d*)| )(\w+)','The Cat in the Hat')
>>> m.groups()
('The', ' ', None, 'Cat')
>>> m.span(0), m.span(1), m.span(2), m.span(3), m.span(4)
((0, 7), (0, 3), (3, 4), (-1, -1), (4, 7))
re.match.start([group])
re.search.start(...)
The index of the end of the target substring matched by the
MatchObject. If the argument 'group' is specified, return
the ending index of that specific enumerated group.
Otherwise, return the ending index of group 0 (i.e., the
whole match). If 'group' exists but is the part of an
alternation operator that is not used in the current
match, return -1. If `re.search.end()` returns the same
non-negative value as `re.search.start()`, then 'group' is
a zero-width substring.
>>> import re
>>> m = re.search('(\w+)((\d*)| )(\w+)','The Cat in the Hat')
>>> m.groups()
('The', ' ', None, 'Cat')
>>> m.start(0), m.start(1), m.start(2), m.start(3), m.start(4)
(0, 0, 3, -1, 4)
re.match.string, re.search.string
The target string in which the match occurs.
>>> import re
>>> m = re.search('a','The Cat in the Hat')
>>> m.string
'The Cat in the Hat'
EXCEPTIONS:
re.error
Exception raised when an invalid regular expression string
is passed to a function that would produce a compiled
regular expression (including implicitly).