=============================================================== NOTE: You are welcome to read this older introductory article, of course; but if you found your way here, you might be interested in reading the book I wrote with the same main title, at http://gnosis.cx/TPiP/ =============================================================== CHARMING PYTHON #5 Text Processing in Python: Tips for Beginners David Mertz, Ph.D. Assistant Snake Handler, Gnosis Software, Inc. June 2000 Python shares a strength in text processing with several popular scripting languages. Python excels as a tool for searching, modifying, and otherwise manipulating textual data. This article reviews for a programmer fist learning Python the various text processing facilities built into Python. Some general concepts of regular expressions are explained, as well as some advice given on when to use, and not to use, regular expressions in text processing tasks. WHAT IS PYTHON? ------------------------------------------------------------------------ Python is a freely available, very-high-level, interpreted language developed by Guido van Rossum. It combines a clear syntax with powerful (but optional) object-oriented semantics. Python is available for almost every computer platform you might find yourself working on, and has strong portability between platforms. LANGUAGE FEATURES ------------------------------------------------------------------------ As in most programming languages, strings are a basic type in Python. In common with most high-level languages (and especially scripting languages), Python strings are of indefinite length. All issues of declarations and memory allocation to hold strings (or other values) goes on "behind the scenes" where a Python programmer does not need to give much thought to it. Python also has several convenient behaviors surrounding string variables that do not exist in other high-level languages. In Python, strings are "immutable sequences." One can refer to elements or subsequences of strings in the same manner as with any sequence. However, strings (like tuples) cannot be modified "in place." A great flexibility with Python sequences comes with the "slice" operation. In a natural-looking way (similar to a spreadsheet format), one can refer to a slice (i.e. subsequence) of a string. The below interacive session illustrates the use of strings and slicing: #--------------- Python interactive session ------------# >>> s = "mary had a little lamb" >>> s[0] # index is zero-based 'm' >>> s[3] = 'x' # changing element in-place fails Traceback (innermost last): File "", line 1, in ? TypeError: object doesn't support item assignment >>> s[11:18] # 'slice' a subsequence 'little ' >>> s[:4] # empty slice-begin assumes zero 'mary' >>> s[4] # index 4 is not included in slice [:4] ' ' >>> s[5:-5] # can use "from end" index with negs 'had a little' >>> s[:5]+s[5:] # slice-begin & slice-end are complementary 'mary had a little lamb' Another powerful operation on strings is the simple 'in' keyword. Two intuitive and useful constructs on strings come with the keyword: #--------------- Python interactive session ------------# >>> s = "mary had a little lamb" >>> for c in s[11:18]: print c, # print each char in slice ... l i t t l e >>> if 'x' in s: print 'got x' # test for char occurence ... >>> if 'y' in s: print 'got y' # test for char occurence ... got y There are several variations on composing string literals in Python. Single and double quotes may both be used, just so long as opening and closing tokens match. Python offers two variations on quoting that are frequently useful. Triple-quoting is often the easiest means of composing strings that contain line breaks (or contain quotes as literals), for example: #--------------- Python interactive session ------------# >>> s2 = """Mary had a little lamb ... its fleece was white as snow ... and everywhere that Mary went ... the lamb was sure to go""" >>> print s2 Mary had a little lamb its fleece was white as snow and everywhere that Mary went the lamb was sure to go Either single quoted or triple-quoted strings may be preceded by the letter "r" to indicate that regular expression special characters should not be interpreted by Python. I.e.: #--------------- Python interactive session ------------# >>> s3 = "this \n and \n that" >>> print s3 this and that >>> s4 = r"this \n and \n that" >>> print s4 this \n and \n that In r-strings, the backslash that might otherwise compose an escaped character in a Python string is treated as a regular backslash. See the below discussion of regular expressions to see why this is useful. FILES AND STRING VARIABLES ------------------------------------------------------------------------ Most of the time when we talk about "text processing," what we want to process is the content of a file. It is quite easy in Python to pull the contents out of a text file and into string variables (which is where they need to be for most manipulations, at some point). File objects have three methods related to reading: '.read()', '.readline()', '.readlines()'. Each of these may take an argument to limit the amount of data read at one time, but the most common use is without an argument. '.read()' reads in a file's entire contents at once, generally in the context of placing those contents into a string variable. For sequential line-oriented processing, or if a file is likely to be larger than available memory, don't use this method. But use '.read()' to get the most direct string representation of a file's contents. '.readline()' and '.readlines()' are very similar. They are both used in constructs like: #------------- Python .readlines() example -------------# fh = open('c:\\autoexec.bat') for line in fh.readlines(): print line The difference between '.readline()' and '.readlines()' is that the latter, like '.read()', reads in an entire file at once. '.readlines()' automatically parses the read contents into a list of lines, thereby enabling the 'for ... in ...' construct common in Python. Using '.readline()' reads in just a single line from a file at a time, and is generally much slower than '.readlines()'. Really the only reason to use the '.readline()' version is if you expect to read very large files that might exceed available memory. Sometimes one wants to "reverse" the usual process of reading (or writing) strings from files, and instead treat strings themselves in a file-like manner. This would usually occur in a context where one has a high-level function (including a number of standard modules) that wants to do something with a file object. Fortunately, creating a "virtual file" in memory may be easily done using the [cStringIO] module (the [StringIO] module can be used instead in cases where subclassing the module is required; but a beginner is unlikely to need to do this). #--------------- Python interactive session ------------# >>> import cStringIO >>> fh = cStringIO.StringIO() >>> fh.write("mary had a little lamb") >>> fh.getvalue() 'mary had a little lamb' >>> fh.seek(5) >>> fh.write('ATE') >>> fh.getvalue() 'mary ATE a little lamb' Keep in mind, however, that a [cStringIO] "virtual file", unlike a real file, is not persistent. It will be gone when the program completes execution if other steps are not taken to save it (such as saving it to a real file, or using the [shelve] module, or using a database system). STANDARD MODULE [string] ------------------------------------------------------------------------ The [string] module is probably the most generally useful module in Python 1.5.* standard distributions. In fact, it appears that many of the facilities of the [string] module will exist as built-in methods of strings in 1.6 and above (but those have not been released at the time of this writing). Most certainly, any program performing text processing tasks should probably begin with the line: import string A general rule-of-thumb is that if you *can* do a task using the [string] module, that is the *right* way to do it. In contrast to [re], [string] functions are generally much faster, and in most cases they are easier to understand and maintain. Third-party Python modules (some fast ones written in C) are available for specialized tasks. But portability and familiarity still suggest sticking with [string] wherever possible (which is not always, but is probably more often than programmers coming from some other languages think is possible). The [string] module contains several types of things. One type of thing in [string] is strings of common constants. For example, #--------------- Python interactive session ------------# >>> import string >>> string.whitespace '\011\012\013\014\015 ' >>> string.uppercase 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' Although one could write these constants by hand, the [string] versions more-or-less assure that the constants used will be correct for the national language and platform the Python script gets run on. The next type of useful thing in [string] is functions to transform strings in common ways (and uncommon ways can generally be composed of several common transformations). For example: #--------------- Python interactive session ------------# >>> import string >>> s = "mary had a little lamb" >>> string.capwords(s) 'Mary Had A Little Lamb' >>> string.replace(s, 'little', 'ferocious') 'mary had a ferocious lamb' There are many other tranformations that are not specifically illustrated, and the Python manuals contain details on them. Yet another useful type of thing in [string] is functions to report features of strings without themselves returning strings. These functions return numbers indicating various features, e.g.: #--------------- Python interactive session ------------# >>> import string >>> s = "mary had a little lamb" >>> string.find(s, 'had') 5 >>> string.count(s, 'a') 4 The final type of thing in [string] is a very Pythonic oddball. The pair '.split()' and '.join()' provide a quick way to convert between strings and tuples. This is useful to do remarkably often. Usage is straightforward: #--------------- Python interactive session ------------# >>> import string >>> s = "mary had a little lamb" >>> L = string.split(s) >>> L ['mary', 'had', 'a', 'little', 'lamb'] >>> string.join(L, "-") 'mary-had-a-little-lamb' Of course, in real-life usage, we would be likely to do something else with a list besides '.join()' it right back together (probably something involving our familiar 'for ... in ...' construct). STANDARD MODULE [re] ------------------------------------------------------------------------ The [re] module obsoletes the [regex] and [regsub] modules that you may see used in some older Python code. While there are a few, limited advantages to [regex] still, they are minor and not worth using in new code. The obsolete modules are likely to be dropped from future Python releases, and 1.6 is also likely to have an interface-compatible improved [re] module. So stick with [re] for regular expressions. Regular expressions are a complicated topic. One could write a book on such a topic; in fact, a number of people have! However, this article will try to capture the "gestalt" of regular expressions, and let the reader work futher from there. A regular expression is a way of describing a pattern that might occur in a text. Do these characters occur? In this order? Are subpatterns repeated the right number of times? Do other subpatterns exclude a match? Conceptually, regular expressions are actually very close to the way one would intuitively describe a pattern in a natural language. The trick is encoding this description in the compact syntax of regular expressions. When approaching a regular expression, treat it as its own little (or big) programming problem. Even though only one or two lines of code may be involved, those lines will effectively incorporate a small program. The first thing to start with is the smallest bits. Any regular expression, at its lowest level, will involve matching particular "character classes." The simplest character class is a single character, which is just included in the pattern as a literal. Frequently, we want to allow matching of a class of characters. One means of indicating a class is by surrounding it in square braces; within the braces both an enumeration of characters and ranges indicated with a dash may be used. There are also a number of named character classes that may be abbreviated, and that will be accurate for platform and national language. Some examples: #--------------- Python interactive session ------------# >>> import re >>> s = "mary had a little lamb" >>> if re.search("m", s): print "Match!" # char literal ... Match! >>> if re.search("[@A-Z]", s): print "Match!" # char class ... # match either at-sign or capital letter ... >>> if re.search("\d", s): print "Match!" # digits class ... Character classes are "atomic" in regular expressions. Usually what we want to do in useful expressions is compose "molecules" out of different character classes. We compose larger expressions by a combination of *grouping* and by indicating *repetition*. Grouping is performed with parentheses: any subexpression contained in parentheses is treated as if it were atomic for purposes of further grouping or repetition. Repetition is indicated by one of several operators. "*" means "zero or more"; "+" means "one or more"; "?" means "zero or one". For example, look at the expression: ABC([d-w]*\d\d?)+XYZ For a string to match this expression, it must contain something that starts with "ABC" and ends with "XYZ"--but what else must it have? The subexpression in the middle is '([d-w]*\d\d?)', and that is followed by the "one or more" operator. So at least one thing matching the subexpression must occur... or it could be a thousand things matching the subexpression. So the string, "ABCXYZ" will not match, because it does not have the requisite stuff in the middle. Just what is the requisite middle subexpression? It must contain *zero or more* letters in the range 'd-w'. It is important to notice that zero letters is a valid match, which may be counterintuitive if you use the English word "some" to describe it. Next we must have *exactly one* digit; then *zero or one* additional digits. The first digit character class has no repitition operator, so it simply occurs once. The second digit character class has the "?" operator. Overall, it amounts to either one or two digits. Some strings matched by the regular expression are: ABC1234567890XYZ ABCd12e1f37g3XYZ ABC1XYZ A few expressions *not* matched by the regular expression are below (try to think through why these do not match): ABC123456789dXYZ ABCdefghijklmnopqrstuvwXYZ ABcd12e1f37g3XYZ ABC12345%67890XYZ ABCD12E1F37G3XYZ It takes a bit of practice to get used to creating and understanding regular expressions. But once they are mastered, a great deal of expressive power is obtained. That said, it is often easy to jump into using a regular expression to solve a problem that could actually be solved using simpler (and faster) tools, such as [string]. RESOURCES ------------------------------------------------------------------------ Friedl, Jeffrey E. F., _Mastering Regular Expressions_, O'Reilly, Cambridge, MA 1997 is a fairly standard and definitive reference on RegEx's. ABOUT THE AUTHOR ------------------------------------------------------------------------ {Picture of Author: http://gnosis.cx/cgi-bin/img_dqm.cgi} David Mertz has been a programmer and a writer for nearly two decades; but David Mertz has only written *about* programming of late (and enjoys it greatly). David Mertz, in "real life," is a wayward humanities academic, lured by lucre to IT. David Mertz is fond of anaphora (and of alliteration). David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/publish/. Suggestions and recommendations on this, past, or future, columns are welcomed.