Note: You are welcome to read this older introductory article, of course; but if you found your way here, you might be interested in reading the book I wrote with the same main title [http://gnosis.cx/TPiP/]. |
David Mertz, Ph.D.
Assistant Snake Handler, Gnosis Software, Inc.
June 2000
Python shares a strength in text processing with several popular scripting languages. Python excels as a tool for searching, modifying, and otherwise manipulating textual data. This article reviews for a programmer fist learning Python the various text processing facilities built into Python. Some general concepts of regular expressions are explained, as well as some advice given on when to use, and not to use, regular expressions in text processing tasks.
Python is a freely available, very-high-level, interpreted language developed by Guido van Rossum. It combines a clear syntax with powerful (but optional) object-oriented semantics. Python is available for almost every computer platform you might find yourself working on, and has strong portability between platforms.
As in most programming languages, strings are a basic type in Python. In common with most high-level languages (and especially scripting languages), Python strings are of indefinite length. All issues of declarations and memory allocation to hold strings (or other values) goes on "behind the scenes" where a Python programmer does not need to give much thought to it. Python also has several convenient behaviors surrounding string variables that do not exist in other high-level languages.
In Python, strings are "immutable sequences." One can refer to elements or subsequences of strings in the same manner as with any sequence. However, strings (like tuples) cannot be modified "in place." A great flexibility with Python sequences comes with the "slice" operation. In a natural-looking way (similar to a spreadsheet format), one can refer to a slice (i.e. subsequence) of a string. The below interacive session illustrates the use of strings and slicing:
>>> s = "mary had a little lamb" >>> s[0] # index is zero-based 'm' >>> s[3] = 'x' # changing element in-place fails Traceback (innermost last): File "<stdin>", line 1, in ? TypeError: object doesn't support item assignment >>> s[11:18] # 'slice' a subsequence 'little ' >>> s[:4] # empty slice-begin assumes zero 'mary' >>> s[4] # index 4 is not included in slice [:4] ' ' >>> s[5:-5] # can use "from end" index with negs 'had a little' >>> s[:5]+s[5:] # slice-begin & slice-end are complementary 'mary had a little lamb'
Another powerful operation on strings is the simple in
keyword. Two intuitive and useful constructs on strings come
with the keyword:
>>> s = "mary had a little lamb" >>> for c in s[11:18]: print c, # print each char in slice ... l i t t l e >>> if 'x' in s: print 'got x' # test for char occurence ... >>> if 'y' in s: print 'got y' # test for char occurence ... got y
There are several variations on composing string literals in Python. Single and double quotes may both be used, just so long as opening and closing tokens match. Python offers two variations on quoting that are frequently useful. Triple-quoting is often the easiest means of composing strings that contain line breaks (or contain quotes as literals), for example:
>>> s2 = """Mary had a little lamb ... its fleece was white as snow ... and everywhere that Mary went ... the lamb was sure to go""" >>> print s2 Mary had a little lamb its fleece was white as snow and everywhere that Mary went the lamb was sure to go
Either single quoted or triple-quoted strings may be preceded by the letter "r" to indicate that regular expression special characters should not be interpreted by Python. I.e.:
>>> s3 = "this \n and \n that" >>> print s3 this and that >>> s4 = r"this \n and \n that" >>> print s4 this \n and \n that
In r-strings, the backslash that might otherwise compose an escaped character in a Python string is treated as a regular backslash. See the below discussion of regular expressions to see why this is useful.
Most of the time when we talk about "text processing," what we
want to process is the content of a file. It is quite easy in
Python to pull the contents out of a text file and into string
variables (which is where they need to be for most
manipulations, at some point). File objects have three methods
related to reading: .read()
, .readline()
, .readlines()
.
Each of these may take an argument to limit the amount of data
read at one time, but the most common use is without an
argument. .read()
reads in a file's entire contents at once,
generally in the context of placing those contents into a
string variable. For sequential line-oriented processing, or
if a file is likely to be larger than available memory, don't
use this method. But use .read()
to get the most direct
string representation of a file's contents. .readline()
and
.readlines()
are very similar. They are both used in
constructs like:
fh = open('c:\\autoexec.bat') for line in fh.readlines(): print line
The difference between .readline()
and .readlines()
is that
the latter, like .read()
, reads in an entire file at once.
.readlines()
automatically parses the read contents into a
list of lines, thereby enabling the for ... in ...
construct
common in Python. Using .readline()
reads in just a single
line from a file at a time, and is generally much slower than
.readlines()
. Really the only reason to use the
.readline()
version is if you expect to read very large
files that might exceed available memory.
Sometimes one wants to "reverse" the usual process of reading (or
writing) strings from files, and instead treat strings
themselves in a file-like manner. This would usually occur in
a context where one has a high-level function (including a
number of standard modules) that wants to do something with a
file object. Fortunately, creating a "virtual file" in memory
may be easily done using the cStringIO
module (the StringIO
module can be used instead in cases where subclassing the
module is required; but a beginner is unlikely to need to do
this).
>>> import cStringIO >>> fh = cStringIO.StringIO() >>> fh.write("mary had a little lamb") >>> fh.getvalue() 'mary had a little lamb' >>> fh.seek(5) >>> fh.write('ATE') >>> fh.getvalue() 'mary ATE a little lamb'
Keep in mind, however, that a cStringIO
"virtual file",
unlike a real file, is not persistent. It will be gone when
the program completes execution if other steps are not taken to
save it (such as saving it to a real file, or using the
shelve
module, or using a database system).
string
The string
module is probably the most generally useful
module in Python 1.5.* standard distributions. In fact, it
appears that many of the facilities of the string
module will
exist as built-in methods of strings in 1.6 and above (but
those have not been released at the time of this writing).
Most certainly, any program performing text processing
tasks should probably begin with the line:
import string
A general rule-of-thumb is that if you can do a task using the
string
module, that is the right way to do it. In contrast
to re
, string
functions are generally much faster, and in
most cases they are easier to understand and maintain.
Third-party Python modules (some fast ones written in C) are
available for specialized tasks. But portability and
familiarity still suggest sticking with string
wherever
possible (which is not always, but is probably more often than
programmers coming from some other languages think is
possible).
The string
module contains several types of things. One type
of thing in string
is strings of common constants. For
example,
>>> import string >>> string.whitespace '\011\012\013\014\015 ' >>> string.uppercase 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
Although one could write these constants by hand, the string
versions more-or-less assure that the constants used will be
correct for the national language and platform the Python
script gets run on.
The next type of useful thing in string
is functions to
transform strings in common ways (and uncommon ways can
generally be composed of several common transformations). For
example:
>>> import string >>> s = "mary had a little lamb" >>> string.capwords(s) 'Mary Had A Little Lamb' >>> string.replace(s, 'little', 'ferocious') 'mary had a ferocious lamb'
There are many other tranformations that are not specifically illustrated, and the Python manuals contain details on them.
Yet another useful type of thing in string
is functions to
report features of strings without themselves returning
strings. These functions return numbers indicating various
features, e.g.:
>>> import string >>> s = "mary had a little lamb" >>> string.find(s, 'had') 5 >>> string.count(s, 'a') 4
The final type of thing in string
is a very Pythonic oddball.
The pair .split()
and .join()
provide a quick way to
convert between strings and tuples. This is useful to do
remarkably often. Usage is straightforward:
>>> import string >>> s = "mary had a little lamb" >>> L = string.split(s) >>> L ['mary', 'had', 'a', 'little', 'lamb'] >>> string.join(L, "-") 'mary-had-a-little-lamb'
Of course, in real-life usage, we would be likely to do
something else with a list besides .join()
it right back
together (probably something involving our familiar for ... in
...
construct).
re
The re
module obsoletes the regex
and regsub
modules that
you may see used in some older Python code. While there are a
few, limited advantages to regex
still, they are minor and
not worth using in new code. The obsolete modules are likely
to be dropped from future Python releases, and 1.6 is also
likely to have an interface-compatible improved re
module.
So stick with re
for regular expressions.
Regular expressions are a complicated topic. One could write a book on such a topic; in fact, a number of people have! However, this article will try to capture the "gestalt" of regular expressions, and let the reader work futher from there. A regular expression is a way of describing a pattern that might occur in a text. Do these characters occur? In this order? Are subpatterns repeated the right number of times? Do other subpatterns exclude a match? Conceptually, regular expressions are actually very close to the way one would intuitively describe a pattern in a natural language. The trick is encoding this description in the compact syntax of regular expressions.
When approaching a regular expression, treat it as its own little (or big) programming problem. Even though only one or two lines of code may be involved, those lines will effectively incorporate a small program. The first thing to start with is the smallest bits. Any regular expression, at its lowest level, will involve matching particular "character classes." The simplest character class is a single character, which is just included in the pattern as a literal. Frequently, we want to allow matching of a class of characters. One means of indicating a class is by surrounding it in square braces; within the braces both an enumeration of characters and ranges indicated with a dash may be used. There are also a number of named character classes that may be abbreviated, and that will be accurate for platform and national language. Some examples:
>>> import re >>> s = "mary had a little lamb" >>> if re.search("m", s): print "Match!" # char literal ... Match! >>> if re.search("[@A-Z]", s): print "Match!" # char class ... # match either at-sign or capital letter ... >>> if re.search("\d", s): print "Match!" # digits class ...
Character classes are "atomic" in regular expressions. Usually what we want to do in useful expressions is compose "molecules" out of different character classes. We compose larger expressions by a combination of grouping and by indicating repetition. Grouping is performed with parentheses: any subexpression contained in parentheses is treated as if it were atomic for purposes of further grouping or repetition. Repetition is indicated by one of several operators. "*" means "zero or more"; "+" means "one or more"; "?" means "zero or one". For example, look at the expression:
ABC([d-w]*\d\d?)+XYZ
For a string to match this expression, it must contain
something that starts with "ABC" and ends with "XYZ"--but what
else must it have? The subexpression in the middle is
([d-w]*\d\d?)
, and that is followed by the "one or more"
operator. So at least one thing matching the subexpression
must occur... or it could be a thousand things matching the
subexpression. So the string, "ABCXYZ" will not match, because
it does not have the requisite stuff in the middle.
Just what is the requisite middle subexpression? It must
contain zero or more letters in the range d-w
. It is
important to notice that zero letters is a valid match, which
may be counterintuitive if you use the English word "some" to
describe it. Next we must have exactly one digit; then zero
or one additional digits. The first digit character class has
no repitition operator, so it simply occurs once. The second
digit character class has the "?" operator. Overall, it
amounts to either one or two digits. Some strings matched by
the regular expression are:
ABC1234567890XYZ ABCd12e1f37g3XYZ ABC1XYZ
A few expressions not matched by the regular expression are below (try to think through why these do not match):
ABC123456789dXYZ ABCdefghijklmnopqrstuvwXYZ ABcd12e1f37g3XYZ ABC12345%67890XYZ ABCD12E1F37G3XYZ
It takes a bit of practice to get used to creating and
understanding regular expressions. But once they are mastered,
a great deal of expressive power is obtained. That said, it is
often easy to jump into using a regular expression to solve a
problem that could actually be solved using simpler (and
faster) tools, such as string
.
Friedl, Jeffrey E. F., Mastering Regular Expressions, O'Reilly, Cambridge, MA 1997 is a fairly standard and definitive reference on RegEx's.
David Mertz has been a programmer and a writer for nearly two decades; but David Mertz has only written about programming of late (and enjoys it greatly). David Mertz, in "real life," is a wayward humanities academic, lured by lucre to IT. David Mertz is fond of anaphora (and of alliteration). David may be reached at [email protected]; his life pored over at http://gnosis.cx/publish/. Suggestions and recommendations on this, past, or future, columns are welcomed.