CHAPTER II -- BASIC STRING OPERATIONS ------------------------------------------------------------------- The cheapest, fastest and most reliable components of a computer system are those that aren't there. --Gordon Bell, Encore Computer Corporation If you are writing programs in Python to accomplish text processing tasks, most of what you need to know is in this chapter. Sure, you will probably need to know how to do some basic things with pipes, files, and arguments to get your text to process (covered in Chapter 1); but for actually -processing- the text you have gotten, the [string] module and string methods--and Python's basic data structures--do most all of what you need done, almost all the time. To a lesser extent, the various custom modules to perform encodings, encryptions, and compressions are handy to have around (and you certainly do not want the work of implementing them yourself). But at the heart of text processing are basic transformations of bits of text. That's what [string] functions and string methods do. There are a lot of interesting techniques elsewhere in this book. I wouldn't have written about them if I did not find them important. But be cautious before doing interesting things. Specifically, given a fixed task in mind, before cracking this book open to any of the other chapters, consider very carefully whether your problem can be solved using the techniques in this chapter. If you can answer this question affirmatively, you should usually eschew the complications of using the higher-level modules and techniques that other chapters discuss. By all means read all of this book for the insight and edification that I hope it provides; but still focus on the "Zen of Python," and prefer simple to complex when simple is enough. This chapter does several things. Section 2.1 looks at a number of common problems in text processing that can (and should) be solved using (predominantly) the techniques documented in this chapter. Each of these "Problems" presents working solutions that can often be adopted with little change to real-life jobs. But a larger goal is to provide readers with a starting point for adaptation of the examples. It is not my goal to provide mere collections of packaged utilities and modules--plenty of those exist on the Web, and resources like the Vaults of Parnassus and the Python Cookbook are worth investigating as part of any project/task (and new and better utilities will be written between the time I write this and when you read it). It is better for readers to receive a solid foundation and starting point from which to develop the functionality they need for their own projects and tasks. And even better than spurring adaptation, these examples aim to encourage contemplation. In presenting examples, this book tries to embody a way of thinking about problems and an attitude towards solving them. More than any individual technique, such ideas are what I would most like to share with readers. Section 2.2 is a "reference with commentary" on the Python standard library modules for doing basic text manipulations. The discussions interspersed with each module try to give some guidance on why you would want to use a given module or function, and the reference documentation tries to contain more examples of actual typical usage than does a plain reference. In many cases, the examples and discussion of individual functions addresses common and productive design patterns in Python. The cross-references are intended to contextualize a given function (or other thing) in terms of related ones (and to help you decide which is right for you). The actual listing of functions, constants, classes, and the like is in alphabetical order within type of thing. Section 2.3 in many ways continues Section 2.1, but also provides some aids for using this book in a learning context. The problems and solutions presented in Section 2.3 are somewhat more open-ended than those in Section 2.1. As well, each section labeled as "Discussion" is followed by one labeled "Questions." These questions are ones that could be assigned by a teacher to students; but they are also intended to be issues that general readers will enjoy and benefit from contemplating. In many cases, the questions point to limitations of the approaches initially presented, and ask readers to think about ways to address or move beyond these limitations--exactly what readers need to do when writing their own custom code to accomplish outside tasks. However, each Discussion in Section 2.3 should stand on its own, even if the Questions are skipped over by the reader. SECTION 1 -- Some Common Tasks ------------------------------------------------------------------------ PROBLEM: Quickly sorting lines on custom criteria -------------------------------------------------------------------- Sorting is one of the real meat-and-potatoes algorithms of text processing and, in fact, of most programming. Fortunately for Python developers, the native `[].sort` method is extraordinarily fast. Moreover, Python lists with almost any heterogeneous objects as elements can be sorted--Python cannot rely on the uniform arrays of a language like C (an unfortunate exception to this general power was introduced in recent Python versions where comparisons of complex numbers raise a 'TypeError'; and '[1+1j,2+2j].sort()' dies for the same reason; Unicode strings in lists can cause similar problems). SEE ALSO, [complex] +++ The list sort method is wonderful when you want to sort items in their "natural" order--or in the order that Python considers natural, in the case of items of varying types. Unfortunately, a lot of times, you want to sort things in "unnatural" orders. For lines of text, in particular, any order that is not simple alphabetization of the lines is "unnatural." But often text lines contain meaningful bits of information in positions other than the first character position: A last name may occur as the second word of a list of people (for example, with first name as the first word); an IP address may occur several fields into a server log file; a money total may occur at position 70 of each line; and so on. What if you want to sort lines based on this style of meaningful order that Python doesn't quite understand? The list sort method `[].sort()` supports an optional custom comparison function argument. The job this function has is to return -1 if the first thing should come first, return 0 if the two things are equal order-wise, and return 1 if the first thing should come second. The built-in function `cmp()` does this in a manner identical to the default `[].sort()` (except in terms of speed, 'lst.sort()' is much faster than 'lst.sort(cmp)'). For short lists and quick solutions, a custom comparison function is probably the best thing. In a lot of cases, one can even get by with an in-line 'lambda' function as the custom comparison function, which is a pleasant and handy idiom. When it comes to speed, however, use of custom comparison functions is fairly awful. Part of the problem is Python's function call overhead, but a lot of other factors contribute to the slowness. Fortunately, a technique called "Schwartzian Transforms" can make for much faster custom sorts. Schwartzian Transforms are so named after Randal Schwartz, who proposed the technique for working with Perl; but the technique is equally applicable to Python. The pattern involved in the Schwartzian Transform technique consists of three steps (these can more precisely be called the Guttman-Rosler Transform, which is based on the Schwartzian Transform): 1. Transform the list in a reversible way into one that sorts "naturally." 2. Call Python's native `[].sort()` method. 3. Reverse the transformation in (1) to restore the original list items (in new sorted order). The reason this technique works is that, for a list of size N, it only requires O(2N) transformation operations, which is easy to amortize over the necessary O(N log N) compare/flip operations for large lists. The sort dominates computational time, so anything that makes the sort more efficient is a win in the limit case (this limit is reached quickly). Below is an example of a simple, but plausible, custom sorting algorithm. The sort is on the fourth and subsequent words of a list of input lines. Lines that are shorter than four words sort to the bottom. Running the test against a file with about 20,000 lines--about 1 megabyte--performed the Schwartzian Transform sort in less than 2 seconds, while taking over 12 seconds for the custom comparison function sort (outputs were verified as identical). Any number of factors will change the exact relative timings, but a better than six times gain can generally be expected. #---------- schwartzian_sort.py ----------# # Timing test for "sort on fourth word" # Specifically, two lines >= 4 words will be sorted # lexographically on the 4th, 5th, etc.. words. # Any line with fewer than four words will be sorted to # the end, and will occur in "natural" order. import sys, string, time wrerr = sys.stderr.write # naive custom sort def fourth_word(ln1,ln2): lst1 = string.split(ln1) lst2 = string.split(ln2) #-- Compare "long" lines if len(lst1) >= 4 and len(lst2) >= 4: return cmp(lst1[3:],lst2[3:]) #-- Long lines before short lines elif len(lst1) >= 4 and len(lst2) < 4: return -1 #-- Short lines after long lines elif len(lst1) < 4 and len(lst2) >= 4: return 1 else: # Natural order return cmp(ln1,ln2) # Don't count the read itself in the time lines = open(sys.argv[1]).readlines() # Time the custom comparison sort start = time.time() lines.sort(fourth_word) end = time.time() wrerr("Custom comparison func in %3.2f secs\n" % (end-start)) # open('tmp.custom','w').writelines(lines) # Don't count the read itself in the time lines = open(sys.argv[1]).readlines() # Time the Schwartzian sort start = time.time() for n in range(len(lines)): # Create the transform lst = string.split(lines[n]) if len(lst) >= 4: # Tuple w/ sort info first lines[n] = (lst[3:], lines[n]) else: # Short lines to end lines[n] = (['\377'], lines[n]) lines.sort() # Native sort for n in range(len(lines)): # Restore original lines lines[n] = lines[n][1] end = time.time() wrerr("Schwartzian transform sort in %3.2f secs\n" % (end-start)) # open('tmp.schwartzian','w').writelines(lines) Only one particular example is presented, but readers should be able to generalize this technique to any sort they need to perform frequently or on large files. PROBLEM: Reformatting paragraphs of text -------------------------------------------------------------------- While I mourn the decline of plaintext ASCII as a communication format--and its eclipse by unnecessarily complicated and large (and often proprietary) formats--there is still plenty of life left in text files full of prose. READMEs, HOWTOs, email, Usenet posts, and this book itself are written in plaintext (or at least something close enough to plaintext that generic processing techniques are valuable). Moreover, many formats like HTML and LaTeX are frequently enough hand-edited that their plaintext appearance is important. One task that is extremely common when working with prose text files is reformatting paragraphs to conform to desired margins. Python 2.3 adds the module [textwrap], which performs more limited reformatting than the code below. Most of the time, this task gets done within text editors, which are indeed quite capable of performing the task. However, sometimes it would be nice to automate the formatting process. The task is simple enough that it is slightly surprising that Python has no standard module function to do this. There -is- the class `formatter.DumbWriter`, or the possibility of inheriting from and customizing `formatter.AbstractWriter`. These classes are discussed in Chapter 5; but frankly, the amount of customization and sophistication needed to use these classes and their many methods is way out of proportion for the task at hand. Below is a simple solution that can be used either as a command-line tool (reading from STDIN and writing to STDOUT) or by import to a larger application. #---------- reformat_para.py ----------# # Simple paragraph reformatter. Allows specification # of left and right margins, and of justification style # (using constants defined in module). LEFT,RIGHT,CENTER = 'LEFT','RIGHT','CENTER' def reformat_para(para='',left=0,right=72,just=LEFT): words = para.split() lines = [] line = '' word = 0 end_words = 0 while not end_words: if len(words[word]) > right-left: # Handle very long words line = words[word] word +=1 if word >= len(words): end_words = 1 else: # Compose line of words while len(line)+len(words[word]) <= right-left: line += words[word]+' ' word += 1 if word >= len(words): end_words = 1 break lines.append(line) line = '' if just==CENTER: r, l = right, left return '\n'.join([' '*left+ln.center(r-l) for ln in lines]) elif just==RIGHT: return '\n'.join([line.rjust(right) for line in lines]) else: # left justify return '\n'.join([' '*left+line for line in lines]) if __name__=='__main__': import sys if len(sys.argv) <> 4: print "Please specify left_margin, right_marg, justification" else: left = int(sys.argv[1]) right = int(sys.argv[2]) just = sys.argv[3].upper() # Simplistic approach to finding initial paragraphs for p in sys.stdin.read().split('\n\n'): print reformat_para(p,left,right,just),'\n' A number of enhancements are left to readers, if needed. You might want to allow hanging indents or indented first lines, for example. Or paragraphs meeting certain criteria might not be appropriate for wrapping (e.g., headers). A custom application might also determine the input paragraphs differently, either by a different parsing of an input file, or by generating paragraphs internally in some manner. PROBLEM: Column statistics for delimited or flat-record files -------------------------------------------------------------------- Data feeds, DBMS dumps, log files, and flat-file databases all tend to contain ontologically similar records--one per line--with a collection of fields in each record. Usually such fields are separated either by a specified delimiter or by specific column positions where fields are to occur. Parsing these structured text records is quite easy, and performing computations on fields is equally straightforward. But in working with a variety of such "structured text databases," it is easy to keep writing almost the same code over again for each variation in format and computation. The example below provides a generic framework for every similar computation on a structured text database. #---------- fields_stats.py ----------# # Perform calculations on one or more of the # fields in a structured text database. import operator from types import * from xreadlines import xreadlines # req 2.1, but is much faster... # could use .readline() meth < 2.1 #-- Symbolic Constants DELIMITED = 1 FLATFILE = 2 #-- Some sample "statistical" func (in functional programming style) nillFunc = lambda lst: None toFloat = lambda lst: map(float, lst) avg_lst = lambda lst: reduce(operator.add, toFloat(lst))/len(lst) sum_lst = lambda lst: reduce(operator.add, toFloat(lst)) max_lst = lambda lst: reduce(max, toFloat(lst)) class FieldStats: """Gather statistics about structured text database fields text_db may be either string (incl. Unicode) or file-like object style may be in (DELIMITED, FLATFILE) delimiter specifies the field separator in DELIMITED style text_db column_positions lists all field positions for FLATFILE style, using one-based indexing (first column is 1). E.g.: (1, 7, 40) would take fields one, two, three from columns 1, 7, 40 respectively. field_funcs is a dictionary with column positions as keys, and functions on lists as values. E.g.: {1:avg_lst, 4:sum_lst, 5:max_lst} would specify the average of column one, the sum of column 4, and the max of column 5. All other cols--incl 2,3, >=6-- are ignored. """ def __init__(self, text_db='', style=DELIMITED, delimiter=',', column_positions=(1,), field_funcs={} ): self.text_db = text_db self.style = style self.delimiter = delimiter self.column_positions = column_positions self.field_funcs = field_funcs def calc(self): """Calculate the column statistics """ #-- 1st, create a list of lists for data (incl. unused flds) used_cols = self.field_funcs.keys() used_cols.sort() # one-based column naming: column[0] is always unused columns = [] for n in range(1+used_cols[-1]): # hint: '[[]]*num' creates refs to same list columns.append([]) #-- 2nd, fill lists used for calculated fields # might use a string directly for text_db if type(self.text_db) in (StringType,UnicodeType): for line in self.text_db.split('\n'): fields = self.splitter(line) for col in used_cols: field = fields[col-1] # zero-based index columns[col].append(field) else: # Something file-like for text_db for line in xreadlines(self.text_db): fields = self.splitter(line) for col in used_cols: field = fields[col-1] # zero-based index columns[col].append(field) #-- 3rd, apply the field funcs to column lists results = [None] * (1+used_cols[-1]) for col in used_cols: results[col] = \ apply(self.field_funcs[col],(columns[col],)) #-- Finally, return the result list return results def splitter(self, line): """Split a line into fields according to curr inst specs""" if self.style == DELIMITED: return line.split(self.delimiter) elif self.style == FLATFILE: fields = [] # Adjust offsets to Python zero-based indexing, # and also add final position after the line num_positions = len(self.column_positions) offsets = [(pos-1) for pos in self.column_positions] offsets.append(len(line)) for pos in range(num_positions): start = offsets[pos] end = offsets[pos+1] fields.append(line[start:end]) return fields else: raise ValueError, \ "Text database must be DELIMITED or FLATFILE" #-- Test data # First Name, Last Name, Salary, Years Seniority, Department delim = ''' Kevin,Smith,50000,5,Media Relations Tom,Woo,30000,7,Accounting Sally,Jones,62000,10,Management '''.strip() # no leading/trailing newlines # Comment First Last Salary Years Dept flat = ''' tech note Kevin Smith 50000 5 Media Relations more filler Tom Woo 30000 7 Accounting yet more... Sally Jones 62000 10 Management '''.strip() # no leading/trailing newlines #-- Run self-test code if __name__ == '__main__': getdelim = FieldStats(delim, field_funcs={3:avg_lst,4:max_lst}) print 'Delimited Calculations:' results = getdelim.calc() print ' Average salary -', results[3] print ' Max years worked -', results[4] getflat = FieldStats(flat, field_funcs={3:avg_lst,4:max_lst}, style=FLATFILE, column_positions=(15,25,35,45,52)) print 'Flat Calculations:' results = getflat.calc() print ' Average salary -', results[3] print ' Max years worked -', results[4] The example above includes some efficiency considerations that make it a good model for working with large data sets. In the first place, class 'FieldStats' can (optionally) deal with a file-like object, rather than keeping the whole structured text database in memory. The generator `xreadlines.xreadlines()` is an extremely fast and efficient file reader, but it requires Python 2.1+--otherwise use `FILE.readline()` or `FILE.readlines()` (for either memory or speed efficiency, respectively). Moreover, only the data that is actually of interest is collected into lists, in order to save memory. However, rather than require multiple passes to collect statistics on multiple fields, as many field columns and summary functions as wanted can be used in one pass. One possible improvement would be to allow multiple summary functions against the same field during a pass. But that is left as an exercise to the reader, if she desires to do it. PROBLEM: Counting characters, words, lines, and paragraphs -------------------------------------------------------------------- There is a wonderful utility under Unix-like systems called 'wc'. What it does is so basic, and so obvious, that it is hard to imagine working without it. 'wc' simply counts the characters, words, and lines of files (or STDIN). A few command-line options control which results are displayed, but I rarely use them. In writing this chapter, I found myself on a system without 'wc', and felt a remedy was in order. The example below is actually an "enhanced" 'wc' since it also counts paragraphs (but it lacks the command-line switches). Unlike the external 'wc', it is easy to use the technique directly within Python and is available anywhere Python is. The main trick--inasmuch as there is one--is a compact use of the `"".join()` and `"".split()` methods (`string.join()` and `string.split()` could also be used, for example, to be compatible with Python 1.5.2 or below). #---------- wc.py ----------# # Report the chars, words, lines, paragraphs # on STDIN or in wildcard filename patterns import sys, glob if len(sys.argv) > 1: c, w, l, p = 0, 0, 0, 0 for pat in sys.argv[1:]: for file in glob.glob(pat): s = open(file).read() wc = len(s), len(s.split()), \ len(s.split('\n')), len(s.split('\n\n')) print '\t'.join(map(str, wc)),'\t'+file c, w, l, p = c+wc[0], w+wc[1], l+wc[2], p+wc[3] wc = (c,w,l,p) print '\t'.join(map(str, wc)), '\tTOTAL' else: s = sys.stdin.read() wc = len(s), len(s.split()), len(s.split('\n')), \ len(s.split('\n\n')) print '\t'.join(map(str, wc)), '\tSTDIN' This little functionality could be wrapped up in a function, but it is almost too compact to bother with doing so. Most of the work is in the interaction with the shell environment, with the counting basically taking only two lines. The solution above is quite likely the "one obvious way to do it," and therefore Pythonic. On the other hand a slightly more adventurous reader might consider this assignment (if only for fun): >>> wc = map(len,[s]+map(s.split,(None,'\n','\n\n'))) A real daredevil might be able to reduce the entire program to a single 'print' statement. PROBLEM: Transmitting binary data as ASCII -------------------------------------------------------------------- Many channels require that the information that travels over them is 7-bit ASCII. Any bytes with a high-order first bit of one will be handled unpredictably when transmitting data over protocols like Simple Mail Transport Protocol (SMTP), Network News Transport Protocol (NNTP), or HTTP (depending on content encoding), or even just when displaying them in many standard tools like editors. In order to encode 8-bit binary data as ASCII, a number of techniques have been invented over time. An obvious, but obese, encoding technique is to translate each binary byte into its hexadecimal digits. UUencoding is an older standard that developed around the need to transmit binary files over the Usenet and on BBSs. Binhex is similar technique from the MacOS world. In recent years, base64--which is specified by RFC1521--has edged out the other styles of encoding. All of the techniques are basically 4/3 encodings--that is, four ASCII bytes are used to represent three binary bytes--but they differ somewhat in line ending and header conventions (as well as in the encoding as such). Quoted printable is yet another format, but of variable encoding length. In quoted printable encoding, most plain ASCII bytes are left unchanged, but a few special characters and all high-bit bytes are escaped. Python provides modules for all the encoding styles mentioned. The high-level wrappers [uu], [binhex], [base64], and [quopri] all operate on input and output file-like objects, encoding the data therein. They also each have slightly different method names and arguments. [binhex], for example, closes its output file after encoding, which makes it unusable in conjunction with a [cStringIO] file-like object. All of the high-level encoders utilize the services of the low-level C module [binascii]. [binascii], in turn, implements the actual low-level block conversions, but assumes that it will be passed the right size blocks for a given encoding. The standard library, therefore, does not contain quite the right intermediate-level functionality for when the goal is just encoding the binary data in arbitrary strings. It is easy to wrap that up though: #---------- encode_binary.py ----------# # Provide encoders for arbitrary binary data # in Python strings. Handles block size issues # transparently, and returns a string. # Precompression of the input string can reduce # or eliminate any size penalty for encoding. import sys import zlib import binascii UU = 45 BASE64 = 57 BINHEX = sys.maxint def ASCIIencode(s='', type=BASE64, compress=1): """ASCII encode a binary string""" # First, decide the encoding style if type == BASE64: encode = binascii.b2a_base64 elif type == UU: encode = binascii.b2a_uu elif type == BINHEX: encode = binascii.b2a_hqx else: raise ValueError, "Encoding must be in UU, BASE64, BINHEX" # Second, compress the source if specified if compress: s = zlib.compress(s) # Third, encode the string, block-by-block offset = 0 blocks = [] while 1: blocks.append(encode(s[offset:offset+type])) offset += type if offset > len(s): break # Fourth, return the concatenated blocks return ''.join(blocks) def ASCIIdecode(s='', type=BASE64, compress=1): """Decode ASCII to a binary string""" # First, decide the encoding style if type == BASE64: s = binascii.a2b_base64(s) elif type == BINHEX: s = binascii.a2b_hqx(s) elif type == UU: s = ''.join([binascii.a2b_uu(line) for line in s.split('\n')]) # Second, decompress the source if specified if compress: s = zlib.decompress(s) # Third, return the decoded binary string return s # Encode/decode STDIN for self-test if __name__ == '__main__': decode, TYPE = 0, BASE64 for arg in sys.argv: if arg.lower()=='-d': decode = 1 elif arg.upper()=='UU': TYPE=UU elif arg.upper()=='BINHEX': TYPE=BINHEX elif arg.upper()=='BASE64': TYPE=BASE64 if decode: print ASCIIdecode(sys.stdin.read(),type=TYPE) else: print ASCIIencode(sys.stdin.read(),type=TYPE) The example above does not attach any headers or delimit the encoded block (by design); for that, a wrapper like [uu], [mimify], or [MimeWriter] is a better choice. Or a custom wrapper around 'encode_binary.py'. PROBLEM: Creating word or letter histograms -------------------------------------------------------------------- A histogram is an analysis of the relative occurrence frequency of each of a number of possible values. In terms of text processing, the occurrences in question are almost always either words or byte values. Creating histograms is quite simple using Python dictionaries, but the technique is not always immediately obvious to people thinking about it. The example below has a good generality, provides several utility functions associated with histograms, and can be used in a command-line operation mode. #---------- histogram.py ----------# # Create occurrence counts of words or characters # A few utility functions for presenting results # Avoids requirement of recent Python features from string import split, maketrans, translate, punctuation, digits import sys from types import * import types def word_histogram(source): """Create histogram of normalized words (no punct or digits)""" hist = {} trans = maketrans('','') if type(source) in (StringType,UnicodeType): # String-like src for word in split(source): word = translate(word, trans, punctuation+digits) if len(word) > 0: hist[word] = hist.get(word,0) + 1 elif hasattr(source,'read'): # File-like src try: from xreadlines import xreadlines # Check for module for line in xreadlines(source): for word in split(line): word = translate(word, trans, punctuation+digits) if len(word) > 0: hist[word] = hist.get(word,0) + 1 except ImportError: # Older Python ver line = source.readline() # Slow but mem-friendly while line: for word in split(line): word = translate(word, trans, punctuation+digits) if len(word) > 0: hist[word] = hist.get(word,0) + 1 line = source.readline() else: raise TypeError, \ "source must be a string-like or file-like object" return hist def char_histogram(source, sizehint=1024*1024): hist = {} if type(source) in (StringType,UnicodeType): # String-like src for char in source: hist[char] = hist.get(char,0) + 1 elif hasattr(source,'read'): # File-like src chunk = source.read(sizehint) while chunk: for char in chunk: hist[char] = hist.get(char,0) + 1 chunk = source.read(sizehint) else: raise TypeError, \ "source must be a string-like or file-like object" return hist def most_common(hist, num=1): pairs = [] for pair in hist.items(): pairs.append((pair[1],pair[0])) pairs.sort() pairs.reverse() return pairs[:num] def first_things(hist, num=1): pairs = [] things = hist.keys() things.sort() for thing in things: pairs.append((thing,hist[thing])) pairs.sort() return pairs[:num] if __name__ == '__main__': if len(sys.argv) > 1: hist = word_histogram(open(sys.argv[1])) else: hist = word_histogram(sys.stdin) print "Ten most common words:" for pair in most_common(hist, 10): print '\t', pair[1], pair[0] print "First ten words alphabetically:" for pair in first_things(hist, 10): print '\t', pair[0], pair[1] # a more practical command-line version might use: # for pair in most_common(hist,len(hist)): # print pair[1],'\t',pair[0] Several of the design choices are somewhat arbitrary. Words have all their punctuation stripped to identify "real" words. But on the other hand, words are still case-sensitive, which may not be what is desired. The sorting functions 'first_things()' and 'most_common()' only return an initial sublist. Perhaps it would be better to return the whole list, and let the user slice the result. It is simple to customize around these sorts of issues, though. PROBLEM: Reading a file backwards by record, line, or paragraph -------------------------------------------------------------------- Reading a file line by line is a common task in Python, or in most any language. Files like server logs, configuration files, structured text databases, and others frequently arrange information into logical records, one per line. Very often, the job of a program is to perform some calculation on each record in turn. Python provides a number of convenient methods on file-like objects for such line-by-line reading. `FILE.readlines()` reads a whole file at once and returns a list of lines. The technique is very fast, but requires the whole contents of the file be kept in memory. For very large files, this can be a problem. `FILE.readline()` is memory-friendly--it just reads a line at a time and can be called repeatedly until the EOF is reached--but it is also much slower. The best solution for recent Python versions is `xreadlines.xreadlines()` or `FILE.xreadlines()` in Python 2.1+. These techniques are memory-friendly, while still being fast and presenting a "virtual list" of lines (by way of Python's new generator/iterator interface). The above techniques work nicely for reading a file in its natural order, but what if you want to start at the end of a file and work backwards from there? This need is frequently encountered when you want to read log files that have records appended over time (and when you want to look at the most recent records first). It comes up in other situations also. There is a very easy technique if memory usage is not an issue: >>> open('lines','w').write('\n'.join([`n` for n in range(100)])) >>> fp = open('lines') >>> lines = fp.readlines() >>> lines.reverse() >>> for line in lines[1:5]: ... # Processing suite here ... print line, ... 98 97 96 95 For large input files, however, this technique is not feasible. It would be nice to have something analogous to [xreadlines] here. The example below provides a good starting point (the example works equally well for file-like objects). #---------- read_backwards.py ----------# # Read blocks of a file from end to beginning. # Blocks may be defined by any delimiter, but the # constants LINE and PARA are useful ones. # Works much like the file object method '.readline()': # repeated calls continue to get "next" part, and # function returns empty string once BOF is reached. # Define constants from os import linesep LINE = linesep PARA = linesep*2 READSIZE = 1000 # Global variables buffer = '' def read_backwards(fp, mode=LINE, sizehint=READSIZE, _init=[0]): """Read blocks of file backwards (return empty string when done)""" # Trick of mutable default argument to hold state between calls if not _init[0]: fp.seek(0,2) _init[0] = 1 # Find a block (using global buffer) global buffer while 1: # first check for block in buffer delim = buffer.rfind(mode) if delim <> -1: # block is in buffer, return it block = buffer[delim+len(mode):] buffer = buffer[:delim] return block+mode #-- BOF reached, return remainder (or empty string) elif fp.tell()==0: block = buffer buffer = '' return block else: # Read some more data into the buffer readsize = min(fp.tell(),sizehint) fp.seek(-readsize,1) buffer = fp.read(readsize) + buffer fp.seek(-readsize,1) #-- Self test of read_backwards() if __name__ == '__main__': # Let's create a test file to read in backwards fp = open('lines','wb') fp.write(LINE.join(['--- %d ---'%n for n in range(15)])) # Now open for reading backwards fp = open('lines','rb') # Read the blocks in, one per call (block==line by default) block = read_backwards(fp) while block: print block, block = read_backwards(fp) Notice that -anything- could serve as a block delimiter. The constants provided just happened to work for lines and block paragraphs (and block paragraphs only with current OS's style of line breaks). But other delimiters could be used. It would -not- be immediately possible to read backwards word-by-word--a space delimiter would come close, but would not be quite right for other whitespace. However, reading a line (and maybe reversing its words) is generally good enough. Another enhancement is possible with Python 2.2+. Using the new 'yield' keyword, 'read_backwards()' could be programmed as an iterator rather than as a multi-call function. The performance will not differ significantly, but the function might be expressed more clearly (and a "list-like" interface like `FILE.readlines()` makes the application's loop simpler). QUESTIONS: 1. Write a generator-based version of 'read_backwards()' that uses the 'yield' keyword. Modify the self-test code to utilize the generator instead. 2. Explore and explain some pitfalls with the use of a mutable default value as a function argument. Explain also how the style allows functions to encapsulate data and contrast with the encapsulation of class instances. SECTION 2 -- Standard Modules ------------------------------------------------------------------------ TOPIC -- Basic String Transformations -------------------------------------------------------------------- The module [string] forms the core of Python's text manipulation libraries. That module is certainly the place to look before other modules. Most of the methods in the [string] module, you should note, have been copied to methods of string objects from Python 1.6+. Moreover, methods of string objects are a little bit faster to use than are the corresponding module functions. A few new methods of string objects do not have equivalents in the [string] module, but are still documented here. SEE ALSO, [str], [UserString] ================================================================= MODULE -- string : A collection of string operations ================================================================= There are a number of general things to notice about the functions in the [string] module (which is composed entirely of functions and constants; no classes). 1. Strings are immutable (as discussed in Chapter 1). This means that there is no such thing as changing a string "in place" (as we might do in many other languages, such as C, by changing the bytes at certain offsets within the string). Whenever a [string] module function takes a string object as an argument, it returns a brand-new string object and leaves the original one as is. However, the very common pattern of binding the same name on the left of an assignment as was passed on the right side within the [string] module function somewhat conceals this fact. For example: >>> import string >>> str = "Mary had a little lamb" >>> str = string.replace(str, 'had', 'ate') >>> str 'Mary ate a little lamb' The first string object never gets modified per se; but since the first string object is no longer bound to any name after the example runs, the object is subject to garbage collection and will disappear from memory. In short, calling a [string] module function will not change any existing strings, but rebinding a name can make it look like they changed. 2. Many [string] module functions are now also available as string object methods. To use these string object methods, there is no need to import the [string] module, and the expression is usually slightly more concise. Moreover, using a string object method is usually slightly faster than the corresponding [string] module function. However, the most thorough documentation of each function/method that exists as both a [string] module function and a string object method is contained in this reference to the [string] module. 3. The form 'string.join(string.split(...))' is a frequent Python idiom. A more thorough discussion is contained in the reference items for `string.join()` and `string.split()`, but in general, combining these two functions is very often a useful way of breaking down a text, processing the parts, then putting together the pieces. 4. Think about clever `string.replace()` patterns. By combining multiple `string.replace()` calls with use of "place holder" string patterns, a surprising range of results can be achieved (especially when also manipulating the intermediate strings with other techniques). See the reference item for `string.replace()` for some discussion and examples. 5. A mutable string of sorts can be obtained by using built-in lists, or the [array] module. Lists can contain a collection of substrings, each one of which may be replaced or modified individually. The [array] module can define arrays of individual characters, each position modifiable, included with slice notation. The function `string.join()` or the method `"".join()` may be used to re-create true strings; for example: >>> lst = ['spam','and','eggs'] >>> lst[2] = 'toast' >>> print ''.join(lst) spamandtoast >>> print ' '.join(lst) spam and toast Or: >>> import array >>> a = array.array('c','spam and eggs') >>> print ''.join(a) spam and eggs >>> a[0] = 'S' >>> print ''.join(a) Spam and eggs >>> a[-4:] = array.array('c','toast') >>> print ''.join(a) Spam and toast CONSTANTS: The [string] module contains constants for a number of frequently used collections of characters. Each of these constants is itself simply a string (rather than a list, tuple, or other collection). As such, it is easy to define constants alongside those provided by the [string] module, should you need them. For example: >>> import string >>> string.brackets = "[]{}()<>" >>> print string.brackets []{}()<> string.digits The decimal numerals ("0123456789"). string.hexdigits The hexadecimal numerals ("0123456789abcdefABCDEF"). string.octdigits The octal numerals ("01234567"). string.lowercase The lowercase letters; can vary by language. In English versions of Python (most systems): >>> import string >>> string.lowercase 'abcdefghijklmnopqrstuvwxyz' You should not modify `string.lowercase` for a source text language, but rather define a new attribute, such as 'string.spanish_lowercase' with an appropriate string (some methods depend on this constant). string.uppercase The uppercase letters; can vary by language. In English versions of Python (most systems): >>> import string >>> string.uppercase 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' You should not modify `string.uppercase` for a source text language, but rather define a new attribute, such as 'string.spanish_uppercase' with an appropriate string (some methods depend on this constant). string.letters All the letters (string.lowercase+string.uppercase). string.punctuation The characters normally considered as punctuation; can vary by language. In English versions of Python (most systems): >>> import string >>> string.punctuation '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~' string.whitespace The "empty" characters. Normally these consist of tab, linefeed, vertical tab, formfeed, carriage return, and space (in that order): >>> import string >>> string.whitespace '\011\012\013\014\015 ' You should not modify `string.whitespace` (some methods depend on this constant). string.printable All the characters that can be printed to any device; can vary by language (string.digits+string.letters+string.punctuation+string.whitespace) FUNCTIONS: string.atof(s=...) Deprecated. Use `float()`. Converts a string to a floating point value. SEE ALSO, `eval()`, `float()` string.atoi(s=... [,base=10]) Deprecated with Python 2.0. Use `int()` if no custom base is needed or if using Python 2.0+. Converts a string to an integer value (if the string should be assumed to be in a base other than 10, the base may be specified as the second argument). SEE ALSO, `eval()`, `int()`, `long()` string.atol(s=... [,base=10]) Deprecated with Python 2.0. Use `long()` if no custom base is needed or if using Python 2.0+. Converts a string to an unlimited length integer value (if the string should be assumed to be in a base other than 10, the base may be specified as the second argument). SEE ALSO, `eval()`, `long()`, `int()` string.capitalize(s=...) "".capitalize() Return a string consisting of the initial character converted to uppercase (if applicable), and all other characters converted to lowercase (if applicable): >>> import string >>> string.capitalize("mary had a little lamb!") 'Mary had a little lamb!' >>> string.capitalize("Mary had a Little Lamb!") 'Mary had a little lamb!' >>> string.capitalize("2 Lambs had Mary!") '2 lambs had mary!' For Python 1.6+, use of a string object method is marginally faster and is stylistically preferred in most cases: >>> "mary had a little lamb".capitalize() 'Mary had a little lamb' SEE ALSO, `string.capwords()`, `string.lower()` string.capwords(s=...) "".title() Return a string consisting of the capitalized words. An equivalent expression is: #*----- equivalent expression -----# string.join(map(string.capitalize,string.split(s)) But `string.capwords()` is a clearer way of writing it. An effect of this implementation is that whitespace is "normalized" by the process: >>> import string >>> string.capwords("mary HAD a little lamb!") 'Mary Had A Little Lamb!' >>> string.capwords("Mary had a Little Lamb!") 'Mary Had A Little Lamb!' With the creation of string methods in Python 1.6, the module function `string.capwords()` was renamed as a string method to `"".title()`. SEE ALSO, `string.capitalize()`, `string.lower()`, `"".istitle()` string.center(s=..., width=...) "".center(width) Return a string with 's' padded with symmetrical leading and trailing spaces (but not truncated) to occupy length 'width' (or more). >>> import string >>> string.center(width=30,s="Mary had a little lamb") ' Mary had a little lamb ' >>> string.center("Mary had a little lamb", 5) 'Mary had a little lamb' For Python 1.6+, use of a string object method is stylistically preferred in many cases: >>> "Mary had a little lamb".center(25) ' Mary had a little lamb ' SEE ALSO, `string.ljust()`, `string.rjust()` string.count(s, sub [,start [,end]]) "".count(sub [,start [,end]]) Return the number of nonoverlapping occurrences of 'sub' in 's'. If the optional third or fourth arguments are specified only the corresponding slice of 's' is examined. >>> import string >>> string.count("mary had a little lamb", "a") 4 >>> string.count("mary had a little lamb", "a", 3, 10) 2 For Python 1.6+, use of a string object method is stylistically preferred in many cases: >>> 'mary had a little lamb'.count("a") 4 "".endswith(suffix [,start [,end]]) This string method does not have an equivalent in the [string] module. Return a Boolean value indicating whether the string ends with the suffix 'suffix'. If the optional second argument 'start' is specified, only consider the terminal substring after offset 'start'. If the optional third argument 'end' is given, only consider the slice '[start:end]'. SEE ALSO, `"".startswith()`, `string.find()` string.expandtabs(s=... [,tabsize=8]) "".expandtabs([,tabsize=8]) Return a string with tabs replaced by a variable number of spaces. The replacement causes text blocks to line up at "tab stops." If no second argument is given, the new string will line up at multiples of 8 spaces. A newline implies a new set of tab stops. >>> import string >>> s = 'mary\011had a little lamb' >>> print s mary had a little lamb >>> string.expandtabs(s, 16) 'mary had a little lamb' >>> string.expandtabs(tabsize=1, s=s) 'mary had a little lamb' For Python 1.6+, use of a string object method is stylistically preferred in many cases: >>> 'mary\011had a little lamb'.expandtabs(25) 'mary had a little lamb' string.find(s, sub [,start [,end]]) "".find(sub [,start [,end]]) Return the index position of the first occurrence of 'sub' in 's'. If the optional third or fourth arguments are specified, only the corresponding slice of 's' is examined (but result is position in s as a whole). Return -1 if no occurrence is found. Position is zero-based, as with Python list indexing: >>> import string >>> string.find("mary had a little lamb", "a") 1 >>> string.find("mary had a little lamb", "a", 3, 10) 6 >>> string.find("mary had a little lamb", "b") 21 >>> string.find("mary had a little lamb", "b", 3, 10) -1 For Python 1.6+, use of a string object method is stylistically preferred in many cases: >>> 'mary had a little lamb'.find("ad") 6 SEE ALSO, `string.index()`, `string.rfind()` string.index(s, sub [,start [,end]]) "".index(sub [,start [,end]]) Return the same value as does `string.find()` with same arguments, except raise 'ValueError' instead of returning -1 when sub does not occur in s. >>> import string >>> string.index("mary had a little lamb", "b") 21 >>> string.index("mary had a little lamb", "b", 3, 10) Traceback (most recent call last): File "", line 1, in ? File "d:/py20sl/lib/string.py", line 139, in index return s.index(*args) ValueError: substring not found in string.index For Python 1.6+, use of a string object method is stylistically preferred in many cases: >>> 'mary had a little lamb'.index("ad") 6 SEE ALSO, `string.find()`, `string.rindex()` Several string methods that return Boolean values indicating whether a string has a certain property. None of the '.is*()' methods, however, have equivalents in the [string] module: "".isalpha() Return a true value if all the characters are alphabetic. "".isalnum() Return a true value if all the characters are alphanumeric. "".isdigit() Return a true value if all the characters are digits. "".islower() Return a true value if all the characters are lowercase and there is at least one cased character: >>> "ab123".islower(), '123'.islower(), 'Ab123'.islower() (1, 0, 0) SEE ALSO, `"".lower()` "".isspace() Return a true value if all the characters are whitespace. "".istitle() Return a true value if all the string has title casing (each word capitalized). SEE ALSO, `"".title()` "".isupper() Return a true value if all the characters are uppercase and there is at least one cased character. SEE ALSO, `"".upper()` string.join(words=... [,sep=" "]) "".join(words) Return a string that results from concatenating the elements of the list 'words' together, with 'sep' between each. The function `string.join()` differs from all other [string] module functions in that it takes a list (of strings) as a primary argument, rather than a string. It is worth noting `string.join()` and `string.split()` are inverse functions if 'sep' is specified to both; in other words, 'string.join(string.split(s,sep),sep)==s' for all 's' and 'sep'. Typically, `string.join()` is used in contexts where it is natural to generate lists of strings. For example, here is a small program to output the list of all-capital words from STDIN to STDOUT, one per line: #---------- list_capwords.py ----------# import string,sys capwords = [] #*--- fix linebreak ---# for line in sys.stdin.readlines(): for word in line.split(): if word == word.upper() and word.isalpha(): capwords.append(word) print string.join(capwords, '\n') The technique in the sample 'list_capwords.py' script can be considerably more efficient than building up a string by direct concatenation. However, Python 2.0's augmented assignment reduces the performance difference: >>> import string >>> s = "Mary had a little lamb" >>> t = "its fleece was white as snow" >>> s = s +" "+ t # relatively "expensive" for big strings >>> s += " " + t # "cheaper" than Python 1.x style >>> lst = [s] >>> lst.append(t) # "cheapest" way of building long string >>> s = string.join(lst) For Python 1.6+, use of a string object method is stylistically preferred in some cases. However, just as `string.join()` is special in taking a list as a first argument, the string object method `"".join()` is unusual in being an operation on the (optional) 'sep' string, not on the (required) 'words' list (this surprises many new Python programmers). SEE ALSO, `string.split()` string.joinfields(...) Identical to `string.join()`. string.ljust(s=..., width=...) "".ljust(width) Return a string with 's' padded with trailing spaces (but not truncated) to occupy length 'width' (or more). >>> import string >>> string.ljust(width=30,s="Mary had a little lamb") 'Mary had a little lamb ' >>> string.ljust("Mary had a little lamb", 5) 'Mary had a little lamb' For Python 1.6+, use of a string object method is stylistically preferred in many cases: >>> "Mary had a little lamb".ljust(25) 'Mary had a little lamb ' SEE ALSO, `string.rjust()`, `string.center()` string.lower(s=...) "".lower() Return a string with any uppercase letters converted to lowercase. >>> import string >>> string.lower("mary HAD a little lamb!") 'mary had a little lamb!' >>> string.lower("Mary had a Little Lamb!") 'mary had a little lamb!' For Python 1.6+, use of a string object method is stylistically preferred in many cases: >>> "Mary had a Little Lamb!".lower() 'mary had a little lamb!' SEE ALSO, `string.upper()` string.lstrip(s=...) "".lstrip([chars=string.whitespace]) Return a string with leading whitespace characters removed. For Python 1.6+, use of a string object method is stylistically preferred in many cases: >>> import string >>> s = """ ... Mary had a little lamb \011""" >>> string.lstrip(s) 'Mary had a little lamb \011' >>> s.lstrip() 'Mary had a little lamb \011' Python 2.3+ accepts the optional argument 'chars' to the string object method. All characters in the string 'chars' will be removed. SEE ALSO, `string.rstrip(), `string.strip()` string.maketrans(from, to) Return a translation table string, for use with `string.translate()`. The strings 'from' and 'to' must be the same length. A translation table is a string of 256 successive byte values, where each position defines a translation from the `chr()` value of the index to the character contained at that index position. >>> import string >>> ord('A') 65 >>> ord('z') 122 >>> string.maketrans('ABC','abc')[65:123] 'abcDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz' >>> string.maketrans('ABCxyz','abcXYZ')[65:123] 'abcDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwXYZ' SEE ALSO, `string.translate()` string.replace(s=..., old=..., new=... [,maxsplit=...]) "".replace(old, new [,maxsplit]) Return a string based on 's' with occurrences of 'old' replaced by 'new'. If the fourth argument 'maxsplit' is specified, only replace 'maxsplit' initial occurrences. >>> import string >>> string.replace("Mary had a little lamb", "a little", "some") 'Mary had some lamb' For Python 1.6+, use of a string object method is stylistically preferred in many cases: >>> "Mary had a little lamb".replace("a little", "some") 'Mary had some lamb' A common "trick" involving `string.replace()` is to use it multiple times to achieve a goal. Obviously, simply to replace several different substrings in a string, multiple `string.replace()` operations are almost inevitable. But there is another class of cases where `string.replace()` can be used to create an intermediate string with "placeholders" for an original substring in a particular context. The same goal can always be achieved with regular expressions, but sometimes staged `string.replace()` operations are both faster and easier to program: >>> import string >>> line = 'variable = val # see comments #3 and #4' >>> # we'd like '#3' and '#4' spelled out within comment >>> string.replace(line,'#','number ') # doesn't work 'variable = val number see comments number 3 and number 4' >>> place_holder=string.replace(line,' # ',' !!! ') # insrt plcholder >>> place_holder 'variable = val !!! see comments #3 and #4' >>> place_holder=place_holder.replace('#','number ') # almost there >>> place_holder 'variable = val !!! see comments number 3 and number 4' >>> line = string.replace(place_holder,'!!!','#') # restore orig >>> line 'variable = val # see comments number 3 and number 4' Obviously, for jobs like this, a place holder must be chosen so as not ever to occur within the strings undergoing "staged transformation"; but that should be possible generally since place holders may be as long as needed. SEE ALSO, `string.translate()`, `mx.TextTools.replace()` string.rfind(s, sub [,start [,end]]) "".rfind(sub [,start [,end]]) Return the index position of the last occurrence of 'sub' in 's'. If the optional third or fourth arguments are specified only the corresponding slice of 's' is examined (but result is position in 's' as a whole). Return -1 if no occurrence is found. Position is zero-based, as with Python list indexing: >>> import string >>> string.rfind("mary had a little lamb", "a") 19 >>> string.rfind("mary had a little lamb", "a", 3, 10) 9 >>> string.rfind("mary had a little lamb", "b") 21 >>> string.rfind("mary had a little lamb", "b", 3, 10) -1 For Python 1.6+, use of a string object method stylistically preferred in many cases: >>> 'mary had a little lamb'.rfind("ad") 6 SEE ALSO, `string.rindex()`, `string.find()` string.rindex(s, sub [,start [,end]]) "".rindex(sub [,start [,end]]) Return the same value as does `string.rfind()` with same arguments, except raise 'ValueError' instead of returning -1 when sub does not occur in 's'. >>> import string >>> string.rindex("mary had a little lamb", "b") 21 >>> string.rindex("mary had a little lamb", "b", 3, 10) Traceback (most recent call last): File "", line 1, in ? File "d:/py20sl/lib/string.py", line 148, in rindex return s.rindex(*args) ValueError: substring not found in string.rindex For Python 1.6+, use of a string object method is stylistically preferred in many cases: >>> 'mary had a little lamb'.index("ad") 6 SEE ALSO, `string.rfind()`, `string.index()` string.rjust(s=..., width=...) "".rjust(width) Return a string with 's' padded with leading spaces (but not truncated) to occupy length 'width' (or more). >>> import string >>> string.rjust(width=30,s="Mary had a little lamb") ' Mary had a little lamb' >>> string.rjust("Mary had a little lamb", 5) 'Mary had a little lamb' For Python 1.6+, use of a string object method is stylistically preferred in many cases: >>> "Mary had a little lamb".rjust(25) ' Mary had a little lamb' SEE ALSO, `string.ljust()`, `string.center()` string.rstrip(s=...) "".rstrip() Return a string with trailing whitespace characters removed. For Python 1.6+, use of a string object method is stylistically preferred in many cases: >>> import string >>> s = """ ... Mary had a little lamb \011""" >>> string.rstrip(s) '\012 Mary had a little lamb' >>> s.rstrip() '\012 Mary had a little lamb' Python 2.3+ accepts the optional argument 'chars' to the string object method. All characters in the string 'chars' will be removed. SEE ALSO, `string.lstrip(), `string.strip()` string.split(s=... [,sep=... [,maxsplit=...]]) "".split([,sep [,maxsplit]]) Return a list of nonoverlapping substrings of 's'. If the second argument 'sep' is specified, the substrings are divided around the occurrences of 'sep'. If 'sep' is not specified, the substrings are divided around -any- whitespace characters. The dividing strings do not appear in the resultant list. If the third argument 'maxsplit' is specified, everything "left over" after splitting 'maxsplit' parts is appended to the list, giving the list length 'maxsplit'+1. >>> import string >>> s = 'mary had a little lamb ...with a glass of sherry' >>> string.split(s, ' a ') ['mary had', 'little lamb ...with', 'glass of sherry'] >>> string.split(s) ['mary', 'had', 'a', 'little', 'lamb', '...with', 'a', 'glass', 'of', 'sherry'] >>> string.split(s,maxsplit=5) ['mary', 'had', 'a', 'little', 'lamb', '...with a glass of sherry'] For Python 1.6+, use of a string object method is stylistically preferred in many cases: >>> "Mary had a Little Lamb!".split() ['Mary', 'had', 'a', 'Little', 'Lamb!'] The `string.split()` function (and corresponding string object method) is surprisingly versatile for working with texts, especially ones that resemble prose. Its default behavior of treating all whitespace as a single divider allows `string.split()` to act as a quick-and-dirty word parser: >>> wc = lambda s: len(s.split()) >>> wc("Mary had a Little Lamb") 5 >>> s = """Mary had a Little Lamb ... its fleece as white as snow. ... And everywhere that Mary went ... the lamb was sure to go.""" >>> print s Mary had a Little Lamb its fleece as white as snow. And everywhere that Mary went ... the lamb was sure to go. >>> wc(s) 23 The function `string.split()` is very often used in conjunction with `string.join()`. The pattern involved is "pull the string apart, modify the parts, put it back together." Often the parts will be words, but this also works with lines (dividing on '\n') or other chunks. For example: >>> import string >>> s = """Mary had a Little Lamb ... its fleece as white as snow. ... And everywhere that Mary went ... the lamb was sure to go.""" >>> string.join(string.split(s)) 'Mary had a Little Lamb its fleece as white as snow. And everywhere ... that Mary went the lamb was sure to go.' A Python 1.6+ idiom for string object methods expresses this technique compactly: >>> "-".join(s.split()) 'Mary-had-a-Little-Lamb-its-fleece-as-white-as-snow.-And-everywhere ...-that-Mary-went--the-lamb-was-sure-to-go.' SEE ALSO, `string.join()`, `mx.TextTools.setsplit()`, `mx.TextTools.charsplit()`, `mx.TextTools.splitat()`, `mx.TextTools.splitlines()` string.splitfields(...) Identical to `string.split()`. "".splitlines([keepends=0]) This string method does not have an equivalent in the [string] module. Return a list of lines in the string. The optional argument 'keepends' determines whether line break character(s) are included in the line strings. "".startswith(prefix [,start [,end]]) This string method does not have an equivalent in the [string] module. Return a Boolean value indicating whether the string begins with the prefix 'prefix'. If the optional second argument 'start' is specified, only consider the terminal substring after the offset 'start'. If the optional third argument 'end' is given, only consider the slice '[start:end]'. SEE ALSO, `"".endswith()`, `string.find()` string.strip(s=...) "".strip() Return a string with leading and trailing whitespace characters removed. For Python 1.6+, use of a string object method is stylistically preferred in many cases: >>> import string >>> s = """ ... Mary had a little lamb \011""" >>> string.strip(s) 'Mary had a little lamb' >>> s.strip() 'Mary had a little lamb' Python 2.3+ accepts the optional argument 'chars' to the string object method. All characters in the string 'chars' will be removed. >>> s = "MARY had a LITTLE lamb STEW" >>> s.strip("ABCDEFGHIJKLMNOPQRSTUVWXYZ") # strip caps ' had a LITTLE lamb ' SEE ALSO, `string.rstrip(), `string.lstrip()` string.swapcase(s=...) "".swapcase() Return a string with any uppercase letters converted to lowercase and any lowercase letters converted to uppercase. >>> import string >>> string.swapcase("mary HAD a little lamb!") 'MARY had A LITTLE LAMB!' For Python 1.6+, use of a string object method is stylistically preferred in many cases: >>> "Mary had a Little Lamb!".swapcase() 'MARY had A LITTLE LAMB!' SEE ALSO, `string.upper()`, `string.lower()` string.translate(s=..., table=... [,deletechars=""]) "".translate(table [,deletechars=""]) Return a string, based on 's', with 'deletechars' deleted (if third argument is specified) and with any remaining characters translated according to translation 'table'. >>> import string >>> tab = string.maketrans('ABC','abc') >>> string.translate('MARY HAD a little LAMB', tab, 'Atl') 'MRY HD a ie LMb' For Python 1.6+, use of a string object method is stylistically preferred in many cases. However, if `string.maketrans()` is used to create the translation table, one will need to import the [string] module anyway: >>> 'MARY HAD a little LAMB'.translate(tab, 'Atl') 'MRY HD a ie LMb' The `string.translate()` function is a -very- fast way to modify a string. Setting up the translation table takes some getting used to, but the resultant transformation is much faster than a procedural technique such as: >>> (new,frm,to,dlt) = ("",'ABC','abc','Alt') >>> for c in 'MARY HAD a little LAMB': ... if c not in dlt: ... pos = frm.find(c) ... if pos == -1: new += c ... else: new += to[pos] ... >>> new 'MRY HD a ie LMb' SEE ALSO, `string.maketrans()` string.upper(s=...) "".upper() Return a string with any lowercase letters converted to uppercase. >>> import string >>> string.upper("mary HAD a little lamb!") 'MARY HAD A LITTLE LAMB!' >>> string.upper("Mary had a Little Lamb!") 'MARY HAD A LITTLE LAMB!' For Python 1.6+, use of a string object method is stylistically preferred in many cases: >>> "Mary had a Little Lamb!".upper() 'MARY HAD A LITTLE LAMB!' SEE ALSO, `string.lower()` string.zfill(s=..., width=...) Return a string with 's' padded with leading zeros (but not truncated) to occupy length 'width' (or more). If a leading sign is present, it "floats" to the beginning of the return value. In general, `string.zfill()` is designed for alignment of numeric values, but no checking is done that a string looks number-like. >>> import string >>> string.zfill("this", 20) '0000000000000000this' >>> string.zfill("-37", 20) '-0000000000000000037' >>> string.zfill("+3.7", 20) '+00000000000000003.7' Based on the example of `string.rjust()`, one might expect a string object method `"".zfill()`; however, no such method exists. SEE ALSO, `string.rjust()` TOPIC -- Strings as Files, and Files as Strings -------------------------------------------------------------------- In many ways, strings and files do a similar job. Both provide a storage container for an unlimited amount of (textual) information that is directly structured only by linear position of the bytes. A first inclination is to suppose that the difference between files and strings is one of persistence--files hang around when the current program is no longer running. But that distinction is not really tenable. On the one hand, standard Python modules like [shelve], [pickle], and [marshal]--and third-party modules like [xml_pickle] and [ZODB]--provide simple ways of making strings persist (but not thereby correspond in any direct way to a filesystem). On the other hand, many files are not particularly persistent: Special files like STDIN and STDOUT under Unix-like systems exist only for program life; other peculiar files like '/dev/cua0' and similar "device files" are really just streams; and even files that live on transient memory disks, or get deleted with program cleanup, are not very persistent. The real difference between files and strings in Python is no more or less than the set of techniques available to operate on them. File objects can do things like '.read()' and '.seek()' on themselves. Notably, file objects have a concept of a "current position" that emulates an imaginary "read-head" passing over the physical storage media. Strings, on the other hand, can be sliced and indexed--for example 'str[4:10]' or 'for c in str:'--and can be processed with string object methods and by functions of modules like [string] and [re]. Moreover, a number of special-purpose Python objects act "file-like" without quite being files; for example `gzip.open()` and `urllib.urlopen()`. Of course, Python itself does not impose any strict condition for just how "file-like" something has to be to work in a file-like context. A programmer has to figure that out for each type of object she wishes to apply techniques to (but most of the time things "just work" right). Happily, Python provides some standard modules to make files and strings easily interoperable. ================================================================= MODULE -- mmap : Memory-mapped file support ================================================================= The [mmap] module allows a programmer to create "memory-mapped" file objects. These special [mmap] objects enable most of the techniques you might apply to "true" file objects and simultaneously most of the techniques one might apply to "true" strings. Keep in mind the hinted caveat about "most," however: Many [string] module functions are implemented using the corresponding string object methods. Since a [mmap] object is only somewhat "string-like," it basically only implements the '.find()' method and those "magic" methods associated with slicing and indexing. This is enough to support most string object idioms. When a string-like change is made to a [mmap] object, that change is propagated to the underlying file, and the change is persistent (assuming the underlying file is persistent, and that the object called '.flush()' before destruction). [mmap] thereby provides an efficient route to "persistent strings." Some examples of working with memory-mapped file objects are worth looking at: >>> # Create a file with some test data >>> open('test','w').write(' #'.join(map(str, range(1000)))) >>> fp = open('test','r+') >>> import mmap >>> mm = mmap.mmap(fp.fileno(),1000) >>> len(mm) 1000 >>> mm[-20:] '218 #219 #220 #221 #' >>> import string # apply a string module method >>> mm.seek(string.find(mm, '21')) >>> mm.read(10) '21 #22 #23' >>> mm.read(10) # next ten bytes ' #24 #25 #' >>> mm.find('21') # object method to find next occurrence 402 >>> try: string.rfind(mm, '21') ... except AttributeError: print "Unsupported string function" ... Unsupported string function >>> '/'.join(re.findall('..21..',mm)) # regex's work nicely ' #21 #/#121 #/ #210 / #212 / #214 / #216 / #218 /#221 #' It is worth emphasizing that the bytes in a file on disk are in fixed positions. You may use the `mmap.mmap.resize()` method to write into different portions of a file, but you cannot expand the file from the middle, only by adding to the end. CLASSES: mmap.mmap(fileno, length [,tagname]) (Windows) mmap.mmap(fileno, length [,flags=MAP_SHARED, -¯ prot=PROT_READ|PROT_WRITE]) -¯ (Unix) Create a new memory-mapped file object. 'fileno' is the numeric file handle to base the mapping on. Generally this number should be obtained using the '.fileno()' method of a file object. 'length' specifies the length of the mapping. Under Windows, the value 0 may be given for 'length' to specify the current length of the file. If 'length' smaller than the current file is specified, only the initial portion of the file will be mapped. If 'length' larger than the current file is specified, the file can be extended with additional string content. The underlying file for a memory-mapped file object must be opened for updating, using the "+" mode modifier. According to the official Python documentation for Python 2.1, a third argument 'tagname' may be specified. If it is, multiple memory-maps against the same file are created. In practice, however, each instance of `mmap.mmap()` creates a new memory-map whether or not a 'tagname' is specified. In any case, this allows multiple file-like updates to the same underlying file, generally at different positions in the file. >>> open('test','w').write(' #'.join([str(n) for n in range(1000)])) >>> fp = open('test','r+') >>> import mmap >>> mm1 = mmap.mmap(fp.fileno(),1000) >>> mm2 = mmap.mmap(fp.fileno(),1000) >>> mm1.seek(500) >>> mm1.read(10) '122 #123 #' >>> mm2.read(10) '0 #1 #2 #3' Under Unix, the third argument 'flags' may be MAP_PRIVATE or MAP_SHARED. If MAP_SHARED is specified for 'flags', all processes mapping the file will see the changes made to a [mmap] object. Otherwise, the changes are restricted to the current process. The fourth argument, 'prot', may be used to disallow certain types of access by other processes to the mapped file regions. METHODS: mmap.mmap.close() Close the memory-mapped file object. Subsequent calls to the other methods of the [mmap] object will raise an exception. Under Windows, the behavior of a [mmap] object after '.close()' is somewhat erratic, however. Note that closing the memory-mapped file object is not the same as closing the underlying file object. Closing the underlying file will make the contents inaccessible, but closing the memory-mapped file object will not affect the underlying file object. SEE ALSO, `FILE.close()` mmap.mmap.find(sub [,pos]) Similar to `string.find()`. Return the index position of the first occurrence of 'sub' in the [mmap] object. If the optional second argument 'pos' is specified, the result is the offset returned relative to 'pos'. Return -1 if no occurrence is found: >>> open('test','w').write(' #'.join([str(n) for n in range(1000)])) >>> fp = open('test','r+') >>> import mmap >>> mm = mmap.mmap(fp.fileno(), 0) >>> mm.find('21') 74 >>> mm.find('21',100) -26 >>> mm.tell() 0 SEE ALSO, `mmap.mmap.seek()`, `string.find()` mmap.mmap.flush([offset, size]) Writes changes made in memory to [mmap] object back to disk. The first argument 'offset' and second argument 'size' must either both be specified or both omitted. If 'offset' and 'size' are specified, only the position starting at 'offset' or length 'size' will be written back to disk. `mmap.mmap.flush()` is necessary to guarantee that changes are written to disk; however, no guarantee is given that changes -will not- be written to disk as part of normal Python interpreter housekeeping. [mmap] should not be used for systems with "cancelable" changes (since changes may not be cancelable). SEE ALSO, `FILE.flush()` mmap.mmap.move(target, source, length) Copy a substring within a memory-mapped file object. The length of the substring is the third argument 'length'. The target location is the first argument 'target'. The substring is copied from the position 'source'. It is allowable to have the substring's original position overlap its target range, but it must not go past the last position of the [mmap] object. >>> open('test','w').write(''.join([c*10 for c in 'ABCDE'])) >>> fp = open('test','r+') >>> import mmap >>> mm = mmap.mmap(fp.fileno(),0) >>> mm[:] 'AAAAAAAAAABBBBBBBBBBCCCCCCCCCCDDDDDDDDDDEEEEEEEEEE' >>> mm.move(40,0,5) >>> mm[:] 'AAAAAAAAAABBBBBBBBBBCCCCCCCCCCDDDDDDDDDDAAAAAEEEEE' mmap.mmap.read(num) Return a string containing 'num' bytes, starting at the current file position. The file position is moved to the end of the read string. In contrast to the '.read()' method of file objects, `mmap.mmap.read()` always requires that a byte count be specified, which makes a memory-map file object not fully substitutable for a file object when data is read. However, the following is safe for both true file objects and memory-mapped file objects: >>> open('test','w').write(' #'.join([str(n) for n in range(1000)])) >>> fp = open('test','r+') >>> import mmap >>> mm = mmap.mmap(fp.fileno(),0) >>> def safe_readall(file): ... try: ... length = len(file) ... return file.read(length) ... except TypeError: ... return file.read() ... >>> s1 = safe_readall(fp) >>> s2 = safe_readall(mm) >>> s1 == s2 1 SEE ALSO, `mmap.mmap.read_byte()`, `mmap.mmap.readline()`, `mmap.mmap.write()`, `FILE.read()` mmap.mmap.read_byte() Return a one-byte string from the current file position and advance the current position by one. Same as 'mmap.mmap.read(1)'. SEE ALSO, `mmap.mmap.read()`, `mmap.mmap.readline()` mmap.mmap.readline() Return a string from the memory-mapped file object, starting from the current file position and going to the next newline character. Advance the current file position by the amount read. SEE ALSO, `mmap.mmap.read()`, `mmap.mmap.read_byte()`, `FILE.readline()` mmap.mmap.resize(newsize) Change the size of a memory-mapped file object. This may be used to expand the size of an underlying file or merely to expand the area of a file that is memory-mapped. An expanded file is padded with null bytes ('\000') unless otherwise filled with content. As with other operations on [mmap] objects, changes to the underlying file system may not occur until a '.flush()' is performed. SEE ALSO, `mmap.mmap.flush()` mmap.mmap.seek(offset [,mode]) Change the current file position. If a second argument 'mode' is given, a different seek mode can be selected. The default is 0, absolute file positioning. Mode 1 seeks relative to the current file position. Mode 2 is relative to the end of the memory-mapped file (which may be smaller than the whole size of the underlying file). The first argument 'offset' specifies the distance to move the current file position--in mode 0 it should be positive, in mode 2 it should be negative, in mode 1 the current position can be moved either forward or backward. SEE ALSO, `FILE.seek()` mmap.mmap.size() Return the length of the underlying file. The size of the actual memory-map may be smaller if less than the whole file is mapped: >>> open('test','w').write('X'*100) >>> fp = open('test','r+') >>> import mmap >>> mm = mmap.mmap(fp.fileno(),50) >>> mm.size() 100 >>> len(mm) 50 SEE ALSO, `len()`, `mmap.mmap.seek()`, `mmap.mmap.tell()` mmap.mmap.tell() Return the current file position. >>> open('test','w').write('X'*100) >>> fp = open('test','r+') >>> import mmap >>> mm = mmap.mmap(fp.fileno(), 0) >>> mm.tell() 0 >>> mm.seek(20) >>> mm.tell() 20 >>> mm.read(20) 'XXXXXXXXXXXXXXXXXXXX' >>> mm.tell() 40 SEE ALSO, `FILE.tell()`, `mmap.mmap.seek()` mmap.mmap.write(s) Write 's' into the memory-mapped file object at the current file position. The current file position is updated to the position following the write. The method `mmap.mmap.write()` is useful for functions that expect to be passed a file-like object with a '.write()' method. However, for new code, it is generally more natural to use the string-like index and slice operations to write contents. For example: >>> open('test','w').write('X'*50) >>> fp = open('test','r+') >>> import mmap >>> mm = mmap.mmap(fp.fileno(), 0) >>> mm.write('AAAAA') >>> mm.seek(10) >>> mm.write('BBBBB') >>> mm[30:35] = 'SSSSS' >>> mm[:] 'AAAAAXXXXXBBBBBXXXXXXXXXXXXXXXSSSSSXXXXXXXXXXXXXXX' >>> mm.tell() 15 SEE ALSO, `FILE.write()`, `mmap.mmap.read()` mmap.mmap.write_byte(c) Write a one-byte string to the current file position, and advance the current position by one. Same as 'mmap.mmap.write(c)' where 'c' is a one-byte string. SEE ALSO, `mmap.mmap.write()` ================================================================= MODULE -- StringIO : File-like objects that read from or write to a string buffer ================================================================= MODULE -- cStringIO : Fast, but incomplete, StringIO replacement ================================================================= The [StringIO] and [cStringIO] modules allow a programmer to create "memory files," that is, "string buffers." These special [StringIO] objects enable most of the techniques you might apply to "true" file objects, but without any connection to a filesystem. The most common use of string buffer objects is when some existing techniques for working with byte-streams in files are to be applied to strings that do not come from files. A string buffer object behaves in a file-like manner and can "drop in" to most functions that want file objects. [cStringIO] is much faster than [StringIO] and should be used in most cases. Both modules provide a 'StringIO' class whose instances are the string buffer objects. `cStringIO.StringIO` cannot be subclassed (and therefore cannot provide additional methods), and it cannot handle Unicode strings. One rarely needs to subclass [StringIO], but the absence of Unicode support in [cStringIO] could be a problem for many developers. As well, [cStringIO] does not support write operations, which makes its string buffers less general (the effect of a write against an in-memory file can be accomplished by normal string operations). A string buffer object may be initialized with a string (or Unicode for [StringIO]) argument. If so, that is the initial content of the buffer. Below are examples of usage (including Unicode handling): >>> from cStringIO import StringIO as CSIO >>> from StringIO import StringIO as SIO >>> alef, omega = unichr(1488), unichr(969) >>> sentence = "In set theory, the Greek "+omega+" represents the \n"+\ ... "ordinal limit of the integers, while the Hebrew \n"+\ ... alef+" represents their cardinality." >>> sio = SIO(sentence) >>> try: ... csio = CSIO(sentence) ... print "New string buffer from raw string" ... except TypeError: ... csio = CSIO(sentence.encode('utf-8')) ... print "New string buffer from ENCODED string" ... New string buffer from ENCODED string >>> sio.getvalue() == unicode(csio.getvalue(),'utf-8') 1 >>> try: ... sio.getvalue() == csio.getvalue() ... except UnicodeError: ... print "Cannot even compare Unicode with string, in general" ... Cannot even compare Unicode with string, in general >>> lines = csio.readlines() >>> len(lines) 3 >>> sio.seek(0) >>> print sio.readline().encode('utf-8'), In set theory, the Greek ω represents the ordinal >>> sio.tell(), csio.tell() (51, 124) CONSTANTS: cStringIO.InputType The type of a `cStringIO.StringIO` instance that has been opened in "read" mode. All `StringIO.StringIO` instances are simply InstanceType. SEE ALSO, `cStringIO.StringIO` cStringIO.OutputType The type of `cStringIO.StringIO` instance that has been opened in "write" mode (actually read/write). All `StringIO.StringIO` instances are simply InstanceType. SEE ALSO, `cStringIO.StringIO` CLASSES: StringIO.StringIO([buf=...]) cStringIO.StringIO([buf]) Create a new string buffer. If the first argument 'buf' is specified, the buffer is initialized with a string content. If the [cStringIO] module is used, the presence of the 'buf' argument determines whether write access to the buffer is enabled. A `cStringIO.StringIO` buffer with write access must be initialized with no argument, otherwise it becomes read-only. A `StringIO.StringIO` buffer, however, is always read/write. METHODS: StringIO.StringIO.close() cStringIO.StringIO.close() Close the string buffer. No access is permitted after close. SEE ALSO, `FILE.close()` StringIO.StringIO.flush() cStringIO.StringIO.flush() Compatibility method for file-like behavior. Data in a string buffer is already in memory, so there is no need to finalize a write to disk. SEE ALSO, `FILE.close()` StringIO.StringIO.getvalue() cStringIO.StringIO.getvalue() Return the entire string held by the string buffer. Does not affect the current file position. Basically, this is the way you convert back from a string buffer to a string. StringIO.StringIO.isatty() cStringIO.StringIO.isatty() Return 0. Compatibility method for file-like behavior. SEE ALSO, `FILE.isatty()` StringIO.StringIO.read([num]) cStringIO.StringIO.read([num]) If the first argument 'num' is specified, return a string containing the next 'num' characters. If 'num' characters are not available, return as many as possible. If 'num' is not specified, return all the characters from current file position to end of string buffer. Advance the current file position by the amount read. SEE ALSO, `FILE.read()`, `mmap.mmap.read()`, `StringIO.StringIO.readline()` StringIO.StringIO.readline([length=...]) cStringIO.StringIO.readline([length]) Return a string from the string buffer, starting from the current file position and going to the next newline character. Advance the current file position by the amount read. SEE ALSO, `mmap.mmap.readline()`, `StringIO.StringIO.read()`, `StringIO.StringIO.readlines()`, `FILE.readline()` StringIO.StringIO.readlines([sizehint=...]) cStringIO.StringIO.readlines([sizehint] Return a list of strings from the string buffer. Each list element consists of a single line, including the trailing newline character(s). If an argument 'sizehint' is specified, only read approximately 'sizehint' characters worth of lines (full lines will always be read). SEE ALSO, `StringIO.StringIO.readline()`, `FILE.readlines()` cStringIO.StringIO.reset() Sets the current file position to the beginning of the string buffer. Same as 'cStringIO.StringIO.seek(0)'. SEE ALSO, `StringIO.StringIO.seek()` StringIO.StringIO.seek(offset [,mode=0]) cStringIO.StringIO.seek(offset [,mode]) Change the current file position. If the second argument 'mode' is given, a different seek mode can be selected. The default is 0, absolute file positioning. Mode 1 seeks relative to the current file position. Mode 2 is relative to the end of the string buffer. The first argument 'offset' specifies the distance to move the current file position--in mode 0 it should be positive, in mode 2 it should be negative, in mode 1 the current position can be moved either forward or backward. SEE ALSO, `FILE.seek()`, `mmap.mmap.seek()` StringIO.StringIO.tell() cStringIO.StringIO.tell() Return the current file position in the string buffer. SEE ALSO, `StringIO.StringIO.seek()` StringIO.StringIO.truncate([len=0]) cStringIO.StringIO.truncate([len]) Reduce the length of the string buffer to the first argument 'len' characters. Truncate can only reduce characters later than the current file position (an initial 'cStringIO.StringIO.reset()' can be used to assure truncation from the beginning). SEE ALSO, `StringIO.StringIO.seek()`, `cStringIO.StringIO.reset()`, `StringIO.StringIO.close()` StringIO.StringIO.write(s=...) cStringIO.StringIO.write(s) Write the first argument 's' into the string buffer at the current file position. The current file position is updated to the position following the write. SEE ALSO, `StringIO.StringIO.writelines()`, `mmap.mmap.write()`, `StringIO.StringIO.read()`, `FILE.write()` StringIO.StringIO.writelines(list=...) cStringIO.StringIO.writelines(list) Write each element of 'list' into the string buffer at the current file position. The current file position is updated to the position following the write. For the [cStringIO] method, 'list' must be an actual list. For the [StringIO] method, other sequence types are allowed. To be safe, it is best to coerce an argument into an actual list first. In either case, 'list' must contain only strings, or a 'TypeError' will occur. Contrary to what might be expected from the method name, `StringIO.StringIO.writelines()` never inserts newline characters. For the list elements actually to occupy separate lines in the string buffer, each element string must already have a newline terminator. Consider the following variants on writing a list to a string buffer: >>> from StringIO import StringIO >>> sio = StringIO() >>> lst = [c*5 for c in 'ABC'] >>> sio.writelines(lst) >>> sio.write(''.join(lst)) >>> sio.write('\n'.join(lst)) >>> print sio.getvalue() AAAAABBBBBCCCCCAAAAABBBBBCCCCCAAAAA BBBBB CCCCC SEE ALSO, `FILE.writelines()`, `StringIO.StringIO.write()` TOPIC -- Converting Between Binary and ASCII -------------------------------------------------------------------- The Python standard library provides several modules for converting between binary data and 7-bit ASCII. At the low level, [binascii] is a C extension to produce fast string conversions. At a high level, [base64], [binhex], [quopri], and [uu] provide file-oriented wrappers to the facilities in [binascii]. ================================================================= MODULE -- base64 : Convert to/from base64 encoding (RFC1521) ================================================================= The [base64] module is a wrapper around the functions `binascii.a2b_base64()` and `binascii.b2a_base64()`. As well as providing a file-based interface on top of the underlying string conversions, [base64] handles the chunking of binary files into base64 line blocks and provides for the direct encoding of arbitrary input strings. Unlike [uu], [base64] adds no content headers to encoded data; MIME standards for headers and message-wrapping are handled by other modules that utilize [base64]. Base64 encoding is specified in RFC1521. FUNCTIONS: base64.encode(input=..., output=...) Encode the contents of the first argument 'input' to the second argument 'output'. Arguments 'input' and 'output' should be file-like objects; 'input' must be readable and 'output' must be writable. base64.encodestring(s=...) Return the base64 encoding of the string passed in the first argument 's'. base64.decode(input=..., output=...) Decode the contents of the first argument 'input' to the second argument 'output'. Arguments 'input' and 'output' should be file-like objects; 'input' must be readable and 'output' must be writable. base64.decodestring(s=...) Return the decoding of the base64-encoded string passed in the first argument 's'. SEE ALSO, [email], `rfc822`, `mimetools`, [mimetypes], `MimeWriter`, `mimify`, [binascii], [quopri] ================================================================= MODULE -- binascii : Convert between binary data and ASCII ================================================================= The [binascii] module is a C implementation of a number of styles of ASCII encoding of binary data. Each function in the [binascii] module takes either encoded ASCII or raw binary strings as an argument, and returns the string result of converting back or forth. Some restrictions apply to the length of strings passed to some functions in the module (for encodings that operate on specific block sizes). FUNCTIONS: binascii.a2b_base64(s) Return the decoded version of a base64-encoded string. A string consisting of one or more encoding blocks should be passed as the argument 's'. binascii.a2b_hex(s) Return the decoded version of a hexadecimal-encoded string. A string consisting of an even number of hexadecimals digits should be passed as the argument 's'. binascii.a2b_hqx(s) Return the decoded version of a binhex-encoded string. A string containing a complete number of encoded binary bytes should be passed as the argument 's'. binascii.a2b_qp(s [,header=0]) Return the decoded version of a quoted printable string. A string containing a complete number of encoded binary bytes should be passed as the argument 's'. If the optional argument 'header' is specified, underscores will be decoded as spaces. New to Python 2.2. binascii.a2b_uu(s) Return the decoded version of a UUencoded string. A string consisting of exactly one encoding block should be passed as the argument 's' (for a full block, 62 bytes input, 45 bytes returned). binascii.b2a_base64(s) Return the based64 encoding of a binary string (including the newline after block). A binary string no longer than 57 bytes should be passed as the argument 's'. binascii.b2a_hex(s) Return the hexadecimal encoding of a binary string. A binary string of any length should be passed as the argument 's'. binascii.b2a_hqx(s) Return the binhex4 encoding of a binary string. A binary string of any length should be passed as the argument 's'. Run-length compression of 's' is not performed by this function (use `binascii.rlecode_hqx()` first, if needed). binascii.b2a_qp(s [,quotetabs=0 [,istext=1 [header=0]]]) Return the quoted printable encoding of a binary string. A binary string of any length should be passed as the argument 's'. The optional argument 'quotetabs' specified whether to escape spaces and tabs; 'istext' specifies -not- to newlines; 'header' specifies whether to encode spaces as underscores (and escape underscores). New to Python 2.2. binascii.b2a_uu(s) Return the UUencoding of a binary string (including the initial block specifier--"M" for full blocks--and newline after block). A binary string no longer than 45 bytes should be passed as the argument 's'. binascii.crc32(s [,crc]) Return the CRC32 checksum of the first argument 's'. If the second argument 'crc' is specified, it will be used as an initial checksum. This allows partial computation of a checksum and continuation. For example: >>> import binascii >>> crc = binascii.crc32('spam') >>> binascii.crc32(' and eggs', crc) 739139840 >>> binascii.crc32('spam and eggs') 739139840 binascii.crc_hqx(s, crc) Return the binhex4 checksum of the first argument 's', using initial checksum value in second argument. This allows partial computation of a checksum and continuation. For example: >>> import binascii >>> binascii.crc_hqx('spam and eggs', 0) 17918 >>> crc = binascii.crc_hqx('spam', 0) >>> binascii.crc_hqx(' and eggs', crc) 17918 SEE ALSO, `binascii.crc32` binascii.hexlify(s) Identical to `binascii.b2a_hex()`. binascii.rlecode_hqx(s) Return the binhex4 run-length encoding (RLE) of first argument 's'. Under this RLE technique, '0x90' is used as an indicator byte. Independent of the binhex4 standard, this is a poor choice of precompression for encoded strings. SEE ALSO, `zlib.compress()` binascii.rledecode_hqx(s) Return the expansion of a binhex4 run-length encoded string. binascii.unhexlify(s) Identical to `binascii.a2b_hex()` EXCEPTIONS: binascii.Error Generic exception that should only result from programming errors. binascii.Incomplete Exception raised when a data block is incomplete. Usually this results from programming errors in reading blocks, but it could indicate data or channel corruption. SEE ALSO, [base64], [binhex], [uu] ================================================================= MODULE -- binhex : Encode and decode binhex4 files ================================================================= The [binhex] module is a wrapper around the functions `binascii.a2b_hqx()`, `binascii.b2a_hqx()`, `binascii.rlecode_hqx()`, `binascii.rledecode_hqx()`, and `binascii.crc_hqx()`. As well as providing a file-based interface on top of the underlying string conversions, [binhex] handles run-length encoding of encoded files and attaches the needed header and footer information. Under MacOS, the resource fork of a file is encoded along with the data fork (not applicable under other platforms). FUNCTIONS: binhex.binhex(inp=..., out=...) Encode the contents of the first argument 'inp' to the second argument 'out'. Argument 'inp' is a filename; 'out' may be either a filename or a file-like object. However, a `cStringIO.StringIO` object is not "file-like" enough since it will be closed after the conversion--and therefore, its value lost. You could override the '.close()' method in a subclass of `StringIO.StringIO` to solve this limitation. binhex.hexbin(inp=... [,out=...]) Decode the contents of the first argument to an output file. If the second argument 'out' is specified, it will be used as the output filename, otherwise the filename will be taken from the binhex header. The argument 'inp' may be either a filename or a file-like object. CLASSES: A number of internal classes are used by [binhex]. They are not documented here, but can be examined in '$PYTHONHOME/lib/binhex.py' if desired (it is unlikely readers will need to do this). SEE ALSO, [binascii] ================================================================= MODULE -- quopri : Convert to/from quoted printable encoding (RFC1521) ================================================================= The [quopri] module is a wrapper around the functions `binascii.a2b_qp()` and `binascii.b2a_qp()`. The module [quopri] has the same methods as [base64]. Unlike [uu], [base64] adds no content headers to encoded data; MIME standards for headers and message wrapping are handled by other modules that utilize [quopri]. Quoted printable encoding is specified in RFC1521. FUNCTIONS: quopri.encode(input, output, quotetabs) Encode the contents of the first argument 'input' to the second argument 'output'. Arguments 'input' and 'output' should be file-like objects; 'input' must be readable and 'output' must be writable. If 'quotetabs' is a true value, escape tabs and spaces. quopri.encodestring(s [,quotetabs=0]) Return the quoted printable encoding of the string passed in the first argument 's'. If 'quotetabs' is a true value, escape tabs and spaces. quopri.decode(input=..., output=... [,header=0]) Decode the contents of the first argument 'input' to the second argument 'output'. Arguments 'input' and 'output' should be file-like objects; 'input' must be readable and 'output' must be writable. If 'header' is a true value, encode spaces as underscores and escape underscores. quopri.decodestring(s [,header=0]) Return the decoding of the quoted printable string passed in the first argument 's'. If 'header' is a true value, decode underscores as spaces. SEE ALSO, [email], `rfc822`, `mimetools`, [mimetypes], `MimeWriter`, `mimify`, [binascii], [base64] ================================================================= MODULE -- uu : UUencode and UUdecode files ================================================================= The [uu] module is a wrapper around the functions `binascii.a2b_uu()` and `binascii.b2a_uu()`. As well as providing a file-based interface on top of the underlying string conversions, [uu] handles the chunking of binary files into UUencoded line blocks and attaches the needed header and footer. FUNCTIONS: uu.encode(in, out, [name=... [,mode=0666]]) Encode the contents of the first argument 'in' to the second argument 'out'. Arguments 'in' and 'out' should be file objects, but filenames are also accepted (the latter is deprecated). The special filename "-" can be used to specify STDIN or STDOUT, as appropriate. When file objects are passed as arguments, 'in' must be readable and 'out' must be writable. The third argument 'name' can be used to specify the filename that appears in the UUencoding header; by default it is the name of 'in'. The fourth argument 'mode' is the octal filemode to store in the UUencoding header. uu.decode(in, [,out_file=... [, mode=...]) Decode the contents of the first argument 'in' to an output file. If the second argument 'out_file' is specified, it will be used as the output file; otherwise, the filename will be taken from the UUencoding header. Arguments 'in' and 'out_file' should be file objects, but filenames are also accepted (the latter is deprecated). If the third argument 'mode' is specified (and if 'out_file' is either unspecified or is a filename), open the created file in mode 'mode'. SEE ALSO, [binascii] TOPIC -- Cryptography -------------------------------------------------------------------- Python does not come with any standard and general cryptography modules. The few included capabilities are fairly narrow in purpose and limited in scope. The capabilities in the standard library consist of several cryptographic hashes and one weak symmetrical encryption algorithm. A quick survey of cryptographic techniques shows what capabilities are absent from the standard library: *Symmetrical Encryption:* Any technique by which a plaintext message M is "encrypted" with a key K to produce a cyphertext C. Application of K--or some K' easily derivable from K--to C is called "decryption" and produces as output M. The standard module [rotor] provides a form of symmetrical encryption. *Cryptographic Hash:* Any technique by which a short "hash" H is produced from a plaintext message M that has several additional properties: (1) Given only H, it is difficult to obtain any M' such that the cryptographic hash of M' is H; (2) Given two plaintext messages M and M', there is a very low probability that the cryptographic hashes of M and M' are the same. Sometimes a third property is included: (3) Given M, its cryptographic hash H, and another hash H', examining the relationship between H and H' does not make it easier to find an M' whose hash is H'. The standard modules [crypt], [md5], and [sha] provide forms of cryptographic hashes. *Asymmetrical Encryption:* Also called "public-key cryptography." Any technique by which a pair of keys K{pub} and K{priv} can be generated that have several properties. The algorithm for an asymmetrical encryption technique will be called "P(M,K)" in the following. (1) For any plaintext message M, M equals P(K{priv},P(M,K{pub})). (2) Given only a public-key K{pub}, it is difficult to obtain a private-key K{priv} that assures the equality in (1). (3) Given only P(M,K{pub}), it is difficult to obtain M. In general, in an asymmetrical encryption system, a user generates K{pub} and K{priv}, then releases K{pub} to other users but retains K{priv} as a secret. There is no support for asymmetrical encryption in the standard library. *Digital Signatures:* Digital signatures are really just "public-keys in reverse." In many cases, the same underlying algorithm is used for each. A digital signature is any technique by which a pair of keys K{ver} and K{sig} can be generated that have several properties. The algorithm for a digital signature will be called S(M,K) in the following. (1) For any message M, M equals P(K{ver},P(M,K{sig})). (2) Given only a verification key K{ver}, it is difficult to obtain a signature key K{sig} that assures the equality in (1). (3) Given only P(M,K{sig}), it is difficult to find any C' such that P(K{ver},C) is a plausible message (in other words, the signature shows it is not a forgery). In general, in a digital signature system, a user generates K{ver} and K{sig}, then releases K{ver} to other users but retains K{sig} as a secret. There is no support for digital signatures in the standard library. -*- Those outlined are the most important cryptographic techniques. More detailed general introductions to cryptology and cryptography can be found at the author's Web site. A first tutorial is _Introduction to Cryptology Concepts I_: Further material is in _Introduction to Cryptology Concepts II_: And more advanced material is in _Intermediate Cryptology: Specialized Protocols_: A number of third-party modules have been created to handle cryptographic tasks; a good guide to these third-party tools is the Vaults of Parnassus Encryption/Encoding index at . Only the tools in the standard library will be covered here specifically, since all the third-party tools are somewhat far afield of the topic of text processing as such. Moreover, third-party tools often rely on additional non-Python libraries, which will not be present on most platforms; and these tools will not necessarily be maintained as new Python versions introduce changes. The most important third-party modules are listed below. These are modules that the author believes are likely to be maintained and that provide access to a wide range of cryptographic algorithms. mxCrypto amkCrypto Marc-Andre Lemburg and Andrew Kuchling--both valuable contributors of many Python modules--have played a game of leapfrog with each other by releasing [mxCrypto] and [amkCrypto], respectively. Each release of either module builds on the work of the other, providing compatible interfaces and overlapping source code. Whatever is newest at the time you read this is the best bet. Current information on both should be obtainable from: Python Cryptography Andrew Kuchling, who has provided a great deal of excellent Python documentation, documents these cryptography modules at: M2Crypto The [mxCrypto] and [amkCrypto] modules are most readily available for Unix-like platforms. A similar range of cryptographic capabilities for a Windows platform is available in Ng Pheng Siong's [M2Crypto]. Information and documentation can be found at: fcrypt Carey Evans has created [fcrypt], which is a pure-Python, single-module replacement for the standard library's [crypt] module. While probably orders-of-magnitude slower than a C implementation, [fcrypt] will run anywhere that Python does (and speed is rarely an issue for this functionality). [fcrypt] may be obtained at: ================================================================= MODULE -- crypt : Create and verify Unix-style passwords ================================================================= The 'crypt()' function is a frequently used, but somewhat antiquated, password creation/verification tool. Under Unix-like systems, 'crypt()' is contained in system libraries and may be called from wrapper functions in languages like Python. 'crypt()' is a form of cryptographic hash based on the Data Encryption Standard (DES). The hash produced by 'crypt()' is based on an 8-byte key and a 2-byte "salt." The output of 'crypt()' is produced by repeated encryption of a constant string, using the user key as a DES key and the salt to perturb the encryption in one of 4,096 ways. Both the key and the salt are restricted to alphanumerics plus dot and slash. By using a cryptographic hash, passwords may be stored in a relatively insecure location. An imposter cannot easily produce a false password that will hash to the same value as the one stored in the password file, even given access to the password file. The salt is used to make "dictionary attacks" more difficult. If an imposter has access to the password file, she might try applying 'crypt()' to a candidate password and compare the result to every entry in the password file. Without a salt, the chances of matching -some- encrypted password would be higher. The salt (a random value should be used) decreases the chance of such a random guess by 4,096 times. The [crypt] module is only installed on some Python systems (even only some Unix systems). Moreover, the module, if installed, relies on an underlying system library. For a portable approach to password creation, the third-party [fcrypt] module provides a portable, pure-Python reimplementation. FUNCTIONS: crypt.crypt(passwd, salt) Return an ASCII 13-byte encrypted password. The first argument 'passwd' must be a string up to eight characters in length (extra characters are truncated and do not affect the result). The second argument 'salt' must be a string up to two characters in length (extra characters are truncated). The value of 'salt' forms the first two characters of the result. >>> from crypt import crypt >>> crypt('mypassword','XY') 'XY5XuULXk4pcs' >>> crypt('mypasswo','XY') 'XY5XuULXk4pcs' >>> crypt('mypassword...more.characters','XY') 'XY5XuULXk4pcs' >>> crypt('mypasswo','AB') 'AB06lnfYxWIKg' >>> crypt('diffpass','AB') 'ABlO5BopaFYNs' SEE ALSO, `fcrypt`, [md5], [sha] ================================================================= MODULE -- md5 : Create MD5 message digests ================================================================= RSA Data Security, Inc.'s MD5 cryptographic hash is a popular algorithm that is codified by RFC1321. Like [sha], and unlike [crypt], [md5] allows one to find the cryptographic hash of arbitrary strings (Unicode strings may not be hashed, however). Absent any other considerations--such as compatibility with other programs--Secure Hash Algorithm (SHA) is currently considered a better algorithm than MD5, and the [sha] module should be used for cryptographic hashes. The operation of [md5] objects is similar to `binascii.crc32()` hashes in that the final hash value may be built progressively from partial concatenated strings. The MD5 algorithm produces a 128-bit hash. CONSTANTS: md5.MD5Type The type of an `md5.new` instance. CLASSES: md5.new([s]) Create an [md5] object. If the first argument 's' is specified, initialize the MD5 digest buffer with the initial string 's'. An MD5 hash can be computed in a single line with: >>> import md5 >>> md5.new('Mary had a little lamb').hexdigest() 'e946adb45d4299def2071880d30136d4' md5.md5([s]) Identical to `md5.new`. METHODS: md5.copy() Return a new [md5] object that is identical to the current state of the current object. Different terminal strings can be concatenated to the clone objects after they are copied. For example: >>> import md5 >>> m = md5.new('spam and eggs') >>> m.digest() '\xb5\x81f\x0c\xff\x17\xe7\x8c\x84\xc3\xa8J\xd0.g\x85' >>> m2 = m.copy() >>> m2.digest() '\xb5\x81f\x0c\xff\x17\xe7\x8c\x84\xc3\xa8J\xd0.g\x85' >>> m.update(' are tasty') >>> m2.update(' are wretched') >>> m.digest() '*\x94\xa2\xc5\xceq\x96\xef&\x1a\xc9#\xac98\x16' >>> m2.digest() 'h\x8c\xfam\xe3\xb0\x90\xe8\x0e\xcb\xbf\xb3\xa7N\xe6\xbc' md5.digest() Return the 128-bit digest of the current state of the [md5] object as a 16-byte string. Each byte will contain a full 8-bit range of possible values. >>> import md5 # Python 2.1+ >>> m = md5.new('spam and eggs') >>> m.digest() '\xb5\x81f\x0c\xff\x17\xe7\x8c\x84\xc3\xa8J\xd0.g\x85' >>> import md5 # Python <= 2.0 >>> m = md5.new('spam and eggs') >>> m.digest() '\265\201f\014\377\027\347\214\204\303\250J\320.g\205' md5.hexdigest() Return the 128-bit digest of the current state of the [md5] object as a 32-byte hexadecimal-encoded string. Each byte will contain only values in `string.hexdigits`. Each pair of bytes represents 8-bits of hash, and this format may be transmitted over 7-bit ASCII channels like email. >>> import md5 >>> m = md5.new('spam and eggs') >>> m.hexdigest() 'b581660cff17e78c84c3a84ad02e6785' md5.update(s) Concatenate additional strings to the [md5] object. Current hash state is adjusted accordingly. The number of concatenation steps that go into an MD5 hash does not affect the final hash, only the actual string that would result from concatenating each part in a single string. However, for large strings that are determined incrementally, it may be more practical to call `md5.update()` numerous times. For example: >>> import md5 >>> m1 = md5.new('spam and eggs') >>> m2 = md5.new('spam') >>> m2.update(' and eggs') >>> m3 = md5.new('spam') >>> m3.update(' and ') >>> m3.update('eggs') >>> m1.hexdigest() 'b581660cff17e78c84c3a84ad02e6785' >>> m2.hexdigest() 'b581660cff17e78c84c3a84ad02e6785' >>> m3.hexdigest() 'b581660cff17e78c84c3a84ad02e6785' SEE ALSO, [sha], [crypt], `binascii.crc32()` ================================================================= MODULE -- rotor : Perform Enigma-like encryption and decryption ================================================================= The [rotor] module is a bit of a curiosity in the Python standard library. The symmetric encryption performed by [rotor] is similar to that performed by the extremely historically interesting and important Enigma algorithm. Given Alan Turing's famous role not just in inventing the theory of computability, but also in cracking German encryption during WWII, there is a nice literary quality to the inclusion of [rotor] in Python. However, [rotor] should not be mistaken for a robust modern encryption algorithm. Bruce Schneier has commented that there are two types of encryption algorithms: those that will stop your little sister from reading your messages, and those that will stop major governments and powerful organization from reading your messages. [rotor] is in the first category--albeit allowing for rather bright little sisters. But [rotor] will not help much against TLAs (three letter agencies). On the other hand, there is nothing else in the Python standard library that performs actual military-grade encryption, either. CLASSES: rotor.newrotor(key [,numrotors]) Return a [rotor] object with rotor permutations and positions based on the first argument 'key'. If the second argument 'numrotors' is specified, a number of rotors other than the default of 6 can be used (more is stronger). A rotor encryption can be computed in a single line with: >>> rotor.newrotor('mypassword').encrypt('Mary had a lamb') '\x10\xef\xf1\x1e\xeaor\xe9\xf7\xe5\xad,r\xc6\x9f' Object style encryption and decryption is performed like the following: >>> import rotor >>> C = rotor.newrotor('pass2').encrypt('Mary had a little lamb') >>> r1 = rotor.newrotor('mypassword') >>> C2 = r1.encrypt('Mary had a little lamb') >>> r1.decrypt(C2) 'Mary had a little lamb' >>> r1.decrypt(C) # Let's try it '\217R$\217/sE\311\330~#\310\342\200\025F\221\245\263\036\220O' >>> r1.setkey('pass2') >>> r1.decrypt(C) # Let's try it 'Mary had a little lamb' METHODS: rotor.decrypt(s) Return a decrypted version of cyphertext string 's'. Prior to decryption, rotors are set to their initial positions. rotor.decryptmore(s) Return a decrypted version of cyphertext string 's'. Prior to decryption, rotors are left in their current positions. rotor.encrypt(s) Return an encrypted version of plaintext string 's'. Prior to encryption, rotors are set to their initial positions. rotor.encryptmore(s) Return an encrypted version of plaintext string 's'. Prior to encryption, rotors are left in their current positions. rotor.setkey(key) Set a new key for a [rotor] object. ================================================================= MODULE -- sha : Create SHA message digests ================================================================= The National Institute of Standards and Technology's (NIST's) Secure Hash Algorithm is the best well-known cryptographic hash for most purposes. Like [md5], and unlike [crypt], [sha] allows one to find the cryptographic hash of arbitrary strings (Unicode strings may not be hashed, however). Absent any other considerations--such as compatibility with other programs--SHA is currently considered a better algorithm than MD5, and the [sha] module should be used for cryptographic hashes. The operation of [sha] objects is similar to `binascii.crc32()` hashes in that the final hash value may be built progressively from partial concatenated strings. The SHA algorithm produces a 160-bit hash. CLASSES: sha.new([s]) Create an [sha] object. If the first argument 's' is specified, initialize the SHA digest buffer with the initial string 's'. An SHA hash can be computed in a single line with: >>> import sha >>> sha.new('Mary had a little lamb').hexdigest() 'bac9388d0498fb378e528d35abd05792291af182' sha.sha([s]) Identical to `sha.new`. METHODS: sha.copy() Return a new [sha] object that is identical to the current state of the current object. Different terminal strings can be concatenated to the clone objects after they are copied. For example: >>> import sha >>> s = sha.new('spam and eggs') >>> s.digest() '\276\207\224\213\255\375x\024\245b\036C\322\017\2528 @\017\246' >>> s2 = s.copy() >>> s2.digest() '\276\207\224\213\255\375x\024\245b\036C\322\017\2528 @\017\246' >>> s.update(' are tasty') >>> s2.update(' are wretched') >>> s.digest() '\013^C\366\253?I\323\206nt\2443\251\227\204-kr6' >>> s2.digest() '\013\210\237\216\014\3337X\333\221h&+c\345\007\367\326\274\321' sha.digest() Return the 160-bit digest of the current state of the [sha] object as a 20-byte string. Each byte will contain a full 8-bit range of possible values. >>> import sha # Python 2.1+ >>> s = sha.new('spam and eggs') >>> s.digest() '\xbe\x87\x94\x8b\xad\xfdx\x14\xa5b\x1eC\xd2\x0f\xaa8 @\x0f\xa6' >>> import sha # Python <= 2.0 >>> s = sha.new('spam and eggs') >>> s.digest() '\276\207\224\213\255\375x\024\245b\036C\322\017\2528 @\017\246' sha.hexdigest() Return the 160-bit digest of the current state of the [sha] object as a 40-byte hexadecimal-encoded string. Each byte will contain only values in `string.hexdigits`. Each pair of bytes represents 8-bits of hash, and this format may be transmitted over 7-bit ASCII channels like email. >>> import sha >>> s = sha.new('spam and eggs') >>> s.hexdigest() 'be87948badfd7814a5621e43d20faa3820400fa6' sha.update(s) Concatenate additional strings to the [sha] object. Current hash state is adjusted accordingly. The number of concatenation steps that go into an SHA hash does not affect the final hash, only the actual string that would result from concatenating each part in a single string. However, for large strings that are determined incrementally, it may be more practical to call `sha.update()` numerous times. For example: >>> import sha >>> s1 = sha.sha('spam and eggs') >>> s2 = sha.sha('spam') >>> s2.update(' and eggs') >>> s3 = sha.sha('spam') >>> s3.update(' and ') >>> s3.update('eggs') >>> s1.hexdigest() 'be87948badfd7814a5621e43d20faa3820400fa6' >>> s2.hexdigest() 'be87948badfd7814a5621e43d20faa3820400fa6' >>> s3.hexdigest() 'be87948badfd7814a5621e43d20faa3820400fa6' SEE ALSO, [md5], [crypt], `binascii.crc32()` TOPIC -- Compression -------------------------------------------------------------------- Over the history of computers, a large number of data compression formats have been invented, mostly as variants on Lempel-Ziv and Huffman techniques. Compression is useful for all sorts of data streams, but file-level archive formats have been the most widely used and known application. Under MS-DOS and Windows we have seen ARC, PAK, ZOO, LHA, ARJ, CAB, RAR, and other formats--but the ZIP format has become the most widespread variant. Under Unix-like systems, 'compress' (.Z) mostly gave way to 'gzip' (GZ); 'gzip' is still the most popular format on these systems, but 'bzip' (BZ2) generally obtains better compression rates. Under MacOS, the most popular format is SIT. Other platforms have additional variants on archive formats, but ZIP--and to a lesser extent GZ--are widely supported on a number of platforms. The Python standard library includes support for several styles of compression. The [zlib] module performs low-level compression of raw string data and has no concept of a file. [zlib] is itself called by the high-level modules below for its compression services. The modules [gzip] and [zipfile] provide file-level interfaces to compressed archives. However, a notable difference in the operation of [gzip] and [zipfile] arises out of a difference in the underlying GZ and ZIP formats. 'gzip' (GZ) operates exclusively on single files--leaving the work of concatenating collections of files to tools like 'tar'. One frequently encounters (especially on Unix-like systems) files like 'foo.tar.gz' or 'foo.tgz' that are produced by first applying 'tar' to a collection of files, then applying 'gzip' to the result. ZIP, however, handles both the compression and archiving aspects in a single tool and format. As a consequence, [gzip] is able to create file-like objects based directly on the compressed contents of a GZ file. [ziplib] needs to provide more specialized methods for navigating archive contents and for working with individual compressed file images therein. Also see Appendix B (A Data Compression Primer). ================================================================= MODULE -- gzip : Functions that read and write gzipped files ================================================================= The [gzip] module allows the treatment of the compressed data inside 'gzip' compressed files directly in a file-like manner. Uncompressed data can be read out, and compressed data written back in, all without a caller knowing or caring that the file is a GZ-compressed file. A simple example illustrates this: #---------- gzip_file.py ----------# # Treat a GZ as "just another file" import gzip, glob print "Size of data in files:" for fname in glob.glob('*'): try: if fname[-3:] == '.gz': s = gzip.open(fname).read() else: s = open(fname).read() print ' ',fname,'-',len(s),'bytes' except IOError: print 'Skipping',file The module [gzip] is a wrapper around [zlib], with the latter performing the actual compression and decompression tasks. In many respects, [gzip] is similar to [mmap] and [StringIO] in emulating and/or wrapping a file object. SEE ALSO, [mmap], [StringIO], [cStringIO] CLASSES: gzip.GzipFile([filename=... [,mode="rb" [,compresslevel=9 [,fileobj=...]]]]) Create a [gzip] file-like object. Such an object supports most file object operations, with the exception of '.seek()' and '.tell()'. Either the first argument 'filename' or the fourth argument 'fileobj' should be specified (likely by argument name, especially if fourth argument 'fileobj'). The second argument 'mode' takes the mode of 'fileobj' if specified, otherwise it defaults to 'rb' ('r', 'rb', 'a', 'ab', 'w', or 'wb' may be specified with the same meaning as with `FILE.open()` objects). The third argument 'compresslevel' specifies the level of compression. The default is the highest level, 9; an integer down to 1 may be selected for less compression but faster operation (compression level of a read file comes from the file itself, however). gzip.open(filename=... [mode='rb [,compresslevel=9]]) Same as `gzip.GzipFile` but with extra arguments omitted. A GZ file object opened with `gzip.open` is always opened by name, not by underlying file object. METHODS AND ATTRIBUTES: gzip.close() Close the [gzip] object. No access is permitted after close. If the object was opened by file object, the underlying file object is not closed, only the [gzip] interface to the file. SEE ALSO, `FILE.close()` gzip.flush() Write outstanding data from memory to disk. SEE ALSO, `FILE.close()` gzip.isatty() Return 0. Compatibility method for file-like behavior. SEE ALSO, `FILE.isatty()` gzip.myfileobj Attribute holding the underlying file object. gzip.read([num]) If the first argument 'num' is specified, return a string containing the next 'num' characters. If 'num' characters are not available, return as many as possible. If 'num' is not specified, return all the characters from current file position to end of string buffer. Advance the current file position by the amount read. SEE ALSO, `FILE.read()` gzip.readline([length]) Return a string from the [gzip] object, starting from the current file position and going to the next newline character. The argument 'length' limits the read if specified. Advance the current file position by the amount read. SEE ALSO, `FILE.readline()` gzip.readlines([sizehint=...]) Return a list of strings from the [gzip] object. Each list element consists of a single line, including the trailing newline character(s). If an argument 'sizehint' is specified, read only approximately 'sizehint' characters worth of lines (full lines will always be read). SEE ALSO, `FILE.readlines()` gzip.write(s) Write the first argument 's' into the [gzip] object at the current file position. The current file position is updated to the position following the write. SEE ALSO, `FILE.write()` gzip.writelines(list) Write each element of 'list' into the [gzip] object at the current file position. The current file position is updated to the position following the write. Most sequence types are allowed, but 'list' must contain only strings, or a 'TypeError' will occur. Contrary to what might be expected from the method name, `gzip.writelines()` never inserts newline characters. For the list elements actually to occupy separate lines in the string buffer, each element string must already have a newline terminator. See `StringIO.StringIO.writelines()` for an example. SEE ALSO, `FILE.writelines()`, `StringIO.StringIO.writelines()` SEE ALSO, [zlib], [zipfile] ================================================================= MODULE -- zipfile : Read and write ZIP files ================================================================= The [zipfile] module enables a variety of operations on ZIP files and is compatible with archives created by applications such as PKZip, Info-Zip, and WinZip. Since the ZIP format allows inclusion of multiple file images within a single archive, the [zipfile] does not behave in a directly file-like manner as [gzip] does. Nonetheless, it is possible to view the contents of an archive, add new file images to one, create a new ZIP archive, or manipulate the contents and directory information of a ZIP file. An initial example of working with the [zipfile] module gives a feel for its usage. >>> for name in 'ABC': ... open(name,'w').write(name*1000) ... >>> import zipfile >>> z = zipfile.ZipFile('new.zip','w',zipfile.ZIP_DEFLATED) # new archv >>> z.write('A') # write files to archive >>> z.write('B','B.newname',zipfile.ZIP_STORED) >>> z.write('C','C.newname') >>> z.close() # close the written archive >>> z = zipfile.ZipFile('new.zip') # reopen archive in read mode >>> z.testzip() # 'None' returned means OK >>> z.namelist() # What's in it? ['A', 'B.newname', 'C.newname'] >>> z.printdir() # details File Name Modified Size A 2001-07-18 21:39:36 1000 B.newname 2001-07-18 21:39:36 1000 C.newname 2001-07-18 21:39:36 1000 >>> A = z.getinfo('A') # bind ZipInfo object >>> B = z.getinfo('B.newname') # bind ZipInfo object >>> A.compress_size 11 >>> B.compress_size 1000 >>> z.read(A.filename)[:40] # Check what's in A 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' >>> z.read(B.filename)[:40] # Check what's in B 'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB' >>> # For comparison, see what Info-Zip reports on created archive >>> import os >>> print os.popen('unzip -v new.zip').read() Archive: new.zip Length Method Size Ratio Date Time CRC-32 Name ------ ------ ---- ----- ---- ---- ------ ---- 1000 Defl:N 11 99% 07-18-01 21:39 51a02e01 A 1000 Stored 1000 0% 07-18-01 21:39 7d9c564d B.newname 1000 Defl:N 11 99% 07-18-01 21:39 66778189 C.newname ------ ------ --- ------- 3000 1022 66% 3 files The module [gzip] is a wrapper around [zlib], with the latter performing the actual compression and decompression tasks. CONSTANTS: Several string constants ([struct] formats) are used to recognize signature identifiers in the ZIP format. These constants are not normally used directly by end-users of [zipfile]. #*----- zipfile constants -----# zipfile.stringCentralDir = 'PK\x01\x02' zipfile.stringEndArchive = 'PK\x05\x06' zipfile.stringFileHeader = 'PK\x03\x04' zipfile.structCentralDir = '<4s4B4H3l5H2l' zipfile.structEndArchive = '<4s4H2lH' zipfile.structFileHeader = '<4s2B4H3l2H' Symbolic names for the two supported compression methods are also defined. #*----- zipfile constants -----# zipfile.ZIP_STORED = 0 zipfile.ZIP_DEFLATED = 8 FUNCTIONS: zipfile.is_zipfile(filename=...) Check if the argument 'filename' is a valid ZIP archive. Archives with appended comments are not recognized as valid archives. Return 1 if valid, None otherwise. This function does not guarantee archive is fully intact, but it does provide a sanity check on the file type. CLASSES: zipfile.PyZipFile(pathname) Create a `zipfile.ZipFile` object that has the extra method `zipfile.ZipFile.writepy()`. This extra method allows you to recursively add all '*.py[oc]' files to an archive. This class is not general purpose, but a special feature to aid [distutils]. zipfile.ZipFile(file=... [,mode='r' [,compression=ZIP_STORED]]) Create a new `zipfile.ZipFile` object. This object is used for management of a ZIP archive. The first argument 'file' must be specified and is simply the filename of the archive to be manipulated. The second argument 'mode' may have one of three string values: 'r' to open the archive in read-only mode; 'w' to truncate the filename and create a new archive; 'a' to read an existing archive and add to it. The third argument 'compression' indicates the compression method--ZIP_DEFLATED requires that [zlib] and the zlib system library be present. zipfile.ZipInfo() Create a new `zipfile.ZipInfo` object. This object contains information about an individual archived filename and its file image. Normally, one will not directly instantiate `zipfile.ZipInfo` but only look at the `zipfile.ZipInfo` objects that are returned by methods like `zipfile.ZipFile.infolist()`, `zipfile.ZipFile.getinfo()`, and `zipfile.ZipFile.NameToInfo`. However, in special cases like `zipfile.ZipFile.writestr()`, it is useful to create a `zipfile.ZipInfo` directly. METHODS AND ATTRIBUTES: zipfile.ZipFile.close() Close the `zipfile.ZipFile` object, and flush any changes made to it. An object must be explicitly closed to perform updates. zipfile.ZipFile.getinfo(name=...) Return the `zipfile.ZipInfo` object corresponding to the filename 'name'. If 'name' is not in the ZIP archive, a 'KeyError' is raised. zipfile.ZipFile.infolist() Return a list of `zipfile.ZipInfo` objects contained in the `zipfile.ZipFile` object. The return value is simply a list of instances of the same type. If the filename within the archive is known, `zipfile.ZipFile.getinfo()` is a better method to use. For enumerating over all archived files, however, `zipfile.ZipFile.infolist()` provides a nice sequence. zipfile.ZipFile.namelist() Return a list of the filenames of all the archived files (including nested relative directories). zipfile.ZipFile.printdir() Print to STDOUT a pretty summary of archived files and information about them. The results are similar to running Info-Zip's 'unzip' with the '-l' option. zipfile.ZipFile.read(name=...) Return the contents of the archived file with filename 'name'. zipfile.ZipFile.testzip() Test the integrity of the current archive. Return the filename of the first `zipfile.ZipInfo` object with corruption. If everything is valid, return None. zipfile.ZipFile.write(filename=... [,arcname=... [,compress_type=...]]) Add the file 'filename' to the `zipfile.ZipFile` object. If the second argument 'arcname' is specified, use 'arcname' as the stored filename (otherwise, use 'filename' itself). If the third argument 'compress_type' is specified, use the indicated compression method. The current archive must be opened in 'w' or 'a' mode. zipfile.ZipFile.writestr(zinfo=..., bytes=...) Write the data contained in the second argument 'bytes' to the `zipfile.ZipFile` object. Directory meta-information must be contained in attributes of the first argument 'zinfo' (a filename, data, and time should be included; other information is optional). The current archive must be opened in 'w' or 'a' mode. zipfile.ZipFile.NameToInfo Dictionary that maps filenames in archive to corresponding `zipfile.ZipInfo` objects. The method `zipfile.ZipFile.getinfo()` is simply a wrapper for a dictionary lookup in this attribute. zipfile.ZipFile.compression Compression type currently in effect for new `zipfile.ZipFile.write()` operations. Modify with due caution (most likely not at all after initialization). zipfile.ZipFile.debug = 0 Attribute for level of debugging information sent to STDOUT. Values range from the default 0 (no output) to 3 (verbose). May be modified. zipfile.ZipFile.filelist List of `zipfile.ZipInfo` objects contained in the `zipfile.ZipFile` object. The method `zipfile.ZipFile.infolist()` is simply a wrapper to retrieve this attribute. Modify with due caution (most likely not at all). zipfile.ZipFile.filename Filename of the `zipfile.ZipFile` object. DO NOT modify! zipfile.ZipFile.fp Underlying file object for the `zipfile.ZipFile` object. DO NOT modify! zipfile.ZipFile.mode Access mode of current `zipfile.ZipFile` object. DO NOT modify! zipfile.ZipFile.start_dir Position of start of central directory. DO NOT modify! zipfile.ZipInfo.CRC Hash value of this archived file. DO NOT modify! zipfile.ZipInfo.comment Comment attached to this archived file. Modify with due caution (e.g., for use with `zipfile.ZipFile.writestr()`). zipfile.ZipInfo.compress_size Size of the compressed data of this archived file. DO NOT modify! zipfile.ZipInfo.compress_type Compression type used with this archived file. Modify with due caution (e.g., for use with `zipfile.ZipFile.writestr()`). zipfile.ZipInfo.create_system System that created this archived file. Modify with due caution (e.g., for use with `zipfile.ZipFile.writestr()`). zipfile.ZipInfo.create_version PKZip version that created the archive. Modify with due caution (e.g., for use with `zipfile.ZipFile.writestr()`). zipfile.ZipInfo.date_time Timestamp of this archived file. Modify with due caution (e.g., for use with `zipfile.ZipFile.writestr()`). zipfile.ZipInfo.external_attr File attribute of archived file when extracted. zipfile.ZipInfo.extract_version PKZip version needed to extract the archive. Modify with due caution (e.g., for use with `zipfile.ZipFile.writestr()`). zipfile.ZipInfo.file_offset Byte offset to start of file data. DO NOT modify! zipfile.ZipInfo.file_size Size of the uncompressed data in the archived file. DO NOT modify! zipfile.ZipInfo.filename Filename of archived file. Modify with due caution (e.g., for use with `zipfile.ZipFile.writestr()`). zipfile.ZipInfo.header_offset Byte offset to file header of the archived file. DO NOT modify! zipfile.ZipInfo.volume Volume number of the archived file. DO NOT modify! EXCEPTIONS: zipfile.error Exception that is raised when corrupt ZIP file is processed. zipfile.BadZipFile Alias for `zipfile.error`. SEE ALSO, [zlib], [gzip] ================================================================= MODULE -- zlib : Compress and decompress with zlib library ================================================================= [zlib] is the underlying compression engine for all Python standard library compression modules. Moreover, [zlib] is extremely useful in itself for compression and decompression of data that does not necessarily live in files (or where data does not map directly to files, even if it winds up in them indirectly). The Python [zlib] module relies on the availability of the zlib system library. There are two basic modes of operation for [zlib]. In the simplest mode, one can simply pass an uncompressed string to `zlib.compress()` and have the compressed version returned. Using `zlib.decompress()` is symmetrical. In a more complicated mode, one can create compression or decompression objects that are able to receive incremental raw or compressed byte-streams, and return partial results based on what they have seen so far. This mode of operation is similar to the way one uses `sha.sha.update()`, `md5.md5.update()`, `rotor.encryptmore()`, or `binascii.crc32()` (albeit for a different purpose from each of those). For large byte-streams that are determined, it may be more practical to utilize compression/decompression objects than it would be to compress/decompress an entire string at once (for example, if the input or result is bound to a slow channel). CONSTANTS: zlib.ZLIB_VERSION The installed zlib system library version. zlib.Z_BEST_COMPRESSION = 9 Highest compression level. zlib.Z_BEST_SPEED = 1 Fastest compression level. zlib.Z_HUFFMAN_ONLY = 2 Intermediate compression level that uses Huffman codes, but not Lempel-Ziv. FUNCTIONS: zlib.adler32(s [,crc]) Return the Adler-32 checksum of the first argument 's'. If the second argument 'crc' is specified, it will be used as an initial checksum. This allows partial computation of a checksum and continuation. An Adler-32 checksum can be computed much more quickly than a CRC32 checksum. Unlike [md5] or [sha], an Adler-32 checksum is not sufficient for cryptographic hashes, but merely for detection of accidental corruption of data. SEE ALSO, `zlib.crc32()`, [md5], [sha] zlib.compress(s [,level]) Return the zlib compressed version of the string in the first argument 's'. If the second argument 'level' is specified, the compression technique can be fine-tuned. The compression level ranges from 1 to 9 and may also be specified using symbolic constants such as Z_BEST_COMPRESSION and Z_BEST_SPEED. The default value for 'level' is 6 and is usually the desired compression level (usually within a few percent of the speed of Z_BEST_SPEED and within a few percent of the size of Z_BEST_COMPRESSION). SEE ALSO, `zlib.decompress()`, `zlib.compressobj` zlib.crc32(s [,crc]) Return the CRC32 checksum of the first argument 's'. If the second argument 'crc' is specified, it will be used as an initial checksum. This allows partial computation of a checksum and continuation. Unlike [md5] or [sha], a CRC32 checksum is not sufficient for cryptographic hashes, but merely for detection of accidental corruption of data. Identical to `binascii.crc32()` (example appears there). SEE ALSO, `binascii.crc32()`, `zlib.adler32()`, [md5], [sha] zlib.decompress(s [,winsize [,buffsize]]) Return the decompressed version of the zlib compressed string in the first argument 's'. If the second argument 'winsize' is specified, it determines the base 2 logarithm of the history buffer size. The default 'winsize' is 15. If the third argument 'buffsize' is specified, it determines the size of the decompression buffer. The default 'buffsize' is 16384, but more is dynamically allocated if needed. One rarely needs to use 'winsize' and 'buffsize' values other than the defaults. SEE ALSO, `zlib.compress()`, `zlib.decompressobj` CLASS FACTORIES: [zlib] does not define true classes that can be specialized. `zlib.compressobj()` and `zlib.decompressobj()` are actually factory-functions rather than classes. That is, they return instance objects, just as classes do, but they do not have unbound data and methods. For most users, the difference is not important: To get a `zlib.compressobj` or `zlib.decompressobj` object, you just call that factory-function in the same manner you would a class object. zlib.compressobj([level]) Create a compression object. A compression object is able to incrementally compress new strings that are fed to it while maintaining the seeded symbol table from previously compressed byte-streams. If argument 'level' is specified, the compression technique can be fine-tuned. The compression-level ranges from 1 to 9. The default value for 'level' is 6 and is usually the desired compression level. SEE ALSO, `zlib.compress()`, `zlib.decompressobj()` zlib.decompressobj([winsize]) Create a decompression object. A decompression object is able to incrementally decompress new strings that are fed to it while maintaining the seeded symbol table from previously decompressed byte-streams. If the argument 'winsize' is specified, it determines the base 2 logarithm of the history buffer size. The default 'winsize' is 15. SEE ALSO, `zlib.decompress()`, `zlib.compressobj()` METHODS AND ATTRIBUTES: zlib.compressobj.compress(s) Add more data to the compression object. If symbol table becomes full, compressed data is returned, otherwise an empty string. All returned output from each repeated call to `zlib.compressobj.compress()` should be concatenated to a decompression byte-stream (either a string or a decompression object). The example below, if run in a directory with some files, lets one examine the buffering behavior of compression objects: #---------- zlib_objs.py ----------# # Demonstrate compression object streams import zlib, glob decom = zlib.decompressobj() com = zlib.compressobj() for file in glob.glob('*'): s = open(file).read() c = com.compress(s) print 'COMPRESSED:', len(c), 'bytes out' d = decom.decompress(c) print 'DECOMPRESS:', len(d), 'bytes out' print 'UNUSED DATA:', len(decom.unused_data), 'bytes' raw_input('-- %s (%s bytes) --' % (file, `len(s)`)) f = com.flush() m = decom.decompress(f) print 'DECOMPRESS:', len(m), 'bytes out' print 'UNUSED DATA:', len(decom.unused_data), 'byte' SEE ALSO, `zlib.compressobj.flush()`, `zlib.decompressobj.decompress()`, `zlib.compress()` zlib.compressobj.flush([mode]) Flush any buffered data from the compression object. As in the example in `zlib.compressobj.compress()`, the output of a `zlib.compressobj.flush()` should be concatenated to the same decompression byte-stream as `zlib.compressobj.compress()` calls are. If the first argument 'mode' is left empty, or the default Z_FINISH is specified, the compression object cannot be used further, and one should `delete` it. Otherwise, if Z_SYNC_FLUSH or Z_FULL_FLUSH are specified, the compression object can still be used, but some uncompressed data may not be recovered by the decompression object. SEE ALSO, `zlib.compress()`, `zlib.compressobj.compress()` zlib.decompressobj.unused_data As indicated, `zlib.decompressobj.unused_data` is an instance attribute rather than a method. If any partial compressed stream cannot be decompressed immediately based on the byte-stream received, the remainder is buffered in this instance attribute. Normally, any output of a compression object forms a complete decompression block, and nothing is left in this instance attribute. However, if data is received in bits over a channel, only partial decompression may be possible on a particular `zlib.decompressobj.decompress()` call. SEE ALSO, `zlib.decompress()`, `zlib.decompressobj.decompress()` zlib.decompressobj.decompress(s) Return the decompressed data that may be derived from the current decompression object state and the argument 's' data passed in. If all of 's' cannot be decompressed in this pass, the remainder is left in `zlib.decompressobj.unused_data`. zlib.decompressobj.flush() Return the decompressed data from any bytes buffered by the decompression object. After this call, the decompression object cannot be used further, and you should `del` it. EXCEPTIONS: zlib.error Exception that is raised by compression or decompression errors. SEE ALSO, [gzip], [zipfile] TOPIC -- Unicode -------------------------------------------------------------------- Note that Appendix C (Understanding Unicode) also discusses Unicode issues. Unicode is an enhanced set of character entities, well beyond the basic 128 characters defined in ASCII encoding and the codepage-specific national language sets that contain 128 characters each. The full Unicode character set--evolving continuously, but with a large number of codepoints already fixed--can contain literally millions of distinct characters. This allows the representation of a large number of national character sets within a unified encoding space, even the large character sets of Chinese-Japanese-Korean (CJK) alphabets. Although Unicode defines a unique codepoint for each distinct character in its range, there are numerous -encodings- that correspond to each character. The encoding called 'UTF-8' defines ASCII characters as single bytes with standard ASCII values. However, for non-ASCII characters, a variable number of bytes (up to 6) are used to encode characters, with the "escape" to Unicode being indicated by high-bit values in initial bytes of multibyte sequences. 'UTF-16' is similar, but uses either 2 or 4 bytes to encode each character (but never just 1). 'UTF-32' is a format that uses a fixed 4-byte value for each Unicode character. 'UTF-32', however, is not currently supported by Python. Native Unicode support was added to Python 2.0. On the face of it, it is a happy situation that Python supports Unicode--it brings the world closer to multinational language support in computer applications. But in practice, you have to be careful when working with Unicode, because it is all too easy to encounter glitches like the one below: >>> alef, omega = unichr(1488), unichr(969) >>> unicodedata.name(alef) >>> print alef Traceback (most recent call last): File "", line 1, in ? UnicodeError: ASCII encoding error: ordinal not in range(128) >>> print chr(170) ª >>> if alef == chr(170): print "Hebrew is Roman diacritic" ... Traceback (most recent call last): File "", line 1, in ? UnicodeError: ASCII decoding error: ordinal not in range(128) A Unicode string that is composed of only ASCII characters, however, is considered equal (but not identical) to a Python string of the same characters. >>> u"spam" == "spam" 1 >>> u"spam" is "spam" 0 >>> "spam" is "spam" # string interning is not guaranteed 1 >>> u"spam" is u"spam" # unicode interning not guaranteed 1 Still, the care you take should not discourage you from working with multilanguage strings, as Unicode enables. It is really amazingly powerful to be able to do so. As one says of a talking dog: It is not that he speaks so -well-, but that he speaks at all. ================================================================= Built-In Unicode Functions/Methods ================================================================= The Unicode string method `u"".encode()` and the built-in function `unicode()` are inverse operations. The Unicode string method returns a plain string with the 8-bit bytes needed to represent it (using the specified or default encoding). The built-in `unicode()` takes one of these encoded strings, and produces the Unicode object represented by the encoding. Specifically, suppose we define the function: >>> chk_eq = lambda u,enc: u == unicode(u.encode(enc),enc) The call `chk_eq(u,enc)` should return 1 for every value of 'u' and 'enc'--as long as 'enc' is a valid encoding name and 'u' is capable of being represented in that encoding. The set of encodings supported for both built-ins are listed below. Additional encodings may be registered using the [codecs] module. Each encoding is indicated by the string that names it, and the case of the string is normalized before comparison (case-insensitive naming of encodings): ascii, us-ascii Encode using 7-bit ASCII. base64 Encode Unicode strings using the base64 4-to-3 encoding format. latin-1, iso-8859-1 Encode using common European accent characters in high-bit values of 8-bit bytes. Latin-1 character's `ord()` values are identical to their Unicode codepoints. quopri Encode in quoted printable format. rot13 Not really a Unicode encoding, but "rotate 13 chars" is included with Python 2.2+ as an example and convenience. utf-7 Encode using variable byte-length encoding that is restricted to 7-bit ASCII octets. As with 'utf-8', ASCII characters encode themselves. utf-8 Encode using variable byte-length encoding that preserves ASCII value bytes. utf-16 Encoding using 2/4 byte encoding. Include "endian" lead bytes (platform-specific selection). utf-16-le Encoding using 2/4 byte encoding. Assume "little endian," and do not prepend "endian" indicator bytes. utf-16-be Encoding using 2/4 byte encoding. Assume "big endian," and do not prepend "endian" indicator bytes. unicode-escape Encode using Python-style Unicode string constants ('u"\uXXXX"'). raw-unicode-escape Encode using Python-style Unicode raw string constants ('ur"\uXXXX"'). The error modes for both built-ins are listed below. Errors in encoding transformations may be handled in any of several ways: strict Raise 'UnicodeError' for all decoding errors. Default handling. ignore Skip all invalid characters. replace Replace invalid characters with '?' (string target) or 'u"\xfffd"' (Unicode target). u"".encode([enc [,errmode]]) "".encode([enc [,errmode]]) Return an encoded string representation of a Unicode string (or of a plain string). The representation is in the style of encoding 'enc' (or system default). This string is suitable for writing to a file or stream that other applications will treat as Unicode data. Examples show several encodings: >>> alef = unichr(1488) >>> s = 'A'+alef >>> s u'A\u05d0' >>> s.encode('unicode-escape') 'A\\u05d0' >>> s.encode('utf-8') 'A\xd7\x90' >>> s.encode('utf-16') '\xff\xfeA\x00\xd0\x05' >>> s.encode('utf-16-le') 'A\x00\xd0\x05' >>> s.encode('ascii') Traceback (most recent call last): File "", line 1, in ? UnicodeError: ASCII encoding error: ordinal not in range(128) >>> s.encode('ascii','ignore') 'A' unicode(s [,enc [,errmode]]) Return a Unicode string object corresponding to the encoded string passed in the first argument 's'. The string 's' might be a string that is read from another Unicode-aware application. The representation is treated as conforming to the style of the encoding 'enc' if the second argument is specified, or system default otherwise (usually 'utf-8'). Errors can be handled in the default 'strict' style or in a style specified in the third argument 'errmode' unichr(cp) Return a Unicode string object containing the single Unicode character whose integer codepoint is passed in the argument 'cp'. ================================================================= MODULE -- codecs : Python Codec Registry, API, and helpers ================================================================= The [codecs] module contains a lot of sophisticated functionality to get at the internals of Python's Unicode handling. Most of those capabilities are at a lower level than programmers who are just interested in text processing need to worry about. The documentation of this module, therefore, will break slightly with the style of most of the documentation and present only two very useful wrapper functions within the [codecs] module. codecs.open(filename=... [,mode='rb' [,encoding=... [,errors='strict' -¯ [,buffering=1]]]]) This wrapper function provides a simple and direct means of opening a Unicode file, and treating its contents directly as Unicode. In contrast, a file opened with the built-in `open()` function, its contents are written and read as strings; to read/write Unicode data to such a file involves multiple passes through `u"".encode()` and `unicode()`. The first argument 'filename' specifies the name of the file to access. If the second argument 'mode' is specified, the read/write mode can be selected. These arguments work identically to those used by `open()`. If the third argument 'encoding' is specified, this encoding will be used to interpret the file (an incorrect encoding will probably result in a 'UnicodeError'). Error handling may be modified by specifying the fourth argument 'errors' (the options are the same as with the built-in `unicode()` function). A fifth argument 'buffering' may be specified to use a specific buffer size (on platforms that support this). An example of usage clarifies the difference between `codecs.open()` and the built-in `open()`: >>> import codecs >>> alef = unichr(1488) >>> open('unicode_test','wb').write(('A'+alef).encode('utf-8')) >>> open('unicode_test').read() # Read as plain string 'A\xd7\x90' >>> # Now read directly as Unicode >>> codecs.open('unicode_test', encoding='utf-8').read() u'A\u05d0' Data written back to a file opened with `codecs.open()` should likewise be Unicode data. SEE ALSO, `open()` codecs.EncodedFile(file=..., data_encoding=... [,file_encoding=... -¯ [,errors='strict']]) This function allows an already opened file to be wrapped inside an "encoding translation" layer. The mode and buffering are taken from the underlying file. By specifying a second argument 'data_encoding' and a third argument 'file_encoding', it is possible to generate strings in one encoding within an application, then write them directly into the appropriate file encoding. As with `codecs.open()` and `unicode()`, an error handling style may be specified with the fourth argument 'errors'. The most likely purpose for `codecs.EncodedFile()` is where an application is likely to receive byte-streams from multiple sources, encoded according to multiple Unicode encodings. By wrapping file objects (or file-like objects) in an encoding translation layer, the strings coming in one encoding can be transparently written to an output in the format the output expects. An example clarifies: >>> import codecs >>> alef = unichr(1488) >>> open('unicode_test','wb').write(('A'+alef).encode('utf-8')) >>> fp = open('unicode_test','rb+') >>> fp.read() # Plain string w/ two-byte UTF-8 char in it 'A\xd7\x90' >>> utf16_writer = codecs.EncodedFile(fp,'utf-16','utf-8') >>> ascii_writer = codecs.EncodedFile(fp,'ascii','utf-8') >>> utf16_writer.tell() # Wrapper keeps same current position 3 >>> s = alef.encode('utf-16') >>> s # Plain string as UTF-16 encoding '\xff\xfe\xd0\x05' >>> utf16_writer.write(s) >>> ascii_writer.write('XYZ') >>> fp.close() # File should be UTF-8 encoded >>> open('unicode_test').read() 'A\xd7\x90\xd7\x90XYZ' SEE ALSO, `codecs.open()` ================================================================= MODULE -- unicodedata : Database of Unicode characters ================================================================= The module [unicodedata] is a database of Unicode character entities. Most of the functions in [unicodedata] take as an argument one Unicode character and return some information about the character contained in a plain (non-Unicode) string. The function of [unicodedata] is essentially informational, rather than transformational. Of course, an application might make decisions about the transformations performed based on the information returned by [unicodedata]. The short utility below provides all the information available for any Unicode codepoint: #------------------ unichr_info.py ----------------------# # Return all the information [unicodedata] has # about the single unicode character whose codepoint # is specified as a command-line argument. # Arg may be any expression evaluating to an integer from unicodedata import * import sys char = unichr(eval(sys.argv[1])) print 'bidirectional', bidirectional(char) print 'category ', category(char) print 'combining ', combining(char) print 'decimal ', decimal(char,0) print 'decomposition', decomposition(char) print 'digit ', digit(char,0) print 'mirrored ', mirrored(char) print 'name ', name(char,'NOT DEFINED') print 'numeric ', numeric(char,0) try: print 'lookup ', `lookup(name(char))` except: print "Cannot lookup" The usage of 'unichr_info.py' is illustrated below by the runs with two possible arguments: #*--------------- Using unichr_info.py ------------------# % python unichr_info.py 1488 bidirectional R category Lo combining 0 decimal 0 decomposition digit 0 mirrored 0 name HEBREW LETTER ALEF numeric 0 lookup u'\u05d0' % python unichr_info.py ord('1') bidirectional EN category Nd combining 0 decimal 1 decomposition digit 1 mirrored 0 name DIGIT ONE numeric 1.0 lookup u'1' For additional information on current Unicode character codepoints and attributes, consult: FUNCTIONS: unicodedata.bidirectional(unichr) Return the bidirectional characteristic of the character specified in the argument 'unichr'. Possible values are AL, AN, B, BN, CS, EN, ES, ET, L, LRE, LRO, NSM, ON, PDF, R, RLE, RLO, S, and WS. Consult the URL above for details on these. Particularly notable values are L (left-to-right), R (right-to-left), and WS (whitespace). unicodedata.category(unichr) Return the category of the character specified in the argument 'unichr'. Possible values are Cc, Cf, Cn, Ll, Lm, Lo, Lt, Lu, Mc, Me, Mn, Nd, Nl, No, Pc, Pd, Pe, Pf, Pi, Po, Ps, Sc, Sk , Sm, So, Zl, Zp, and Zs. The first (capital) letter indicates L (letter), M (mark), N (number), P (punctuation), S (symbol), Z (separator), or C (other). The second letter is generally mnemonic within the major category of the first letter. Consult the URL above for details. unicodedata.combining(unichr) Return the numeric combining class of the character specified in the argument 'unichr'. These include values such as 218 (below left) or 210 (right attached). Consult the URL above for details. unicodedata.decimal(unichr [,default]) Return the numeric decimal value assigned to the character specified in the argument 'unichr'. If the second argument 'default' is specified, return that if no value is assigned (otherwise raise 'ValueError'). unicodedata.decomposition(unichr) Return the decomposition mapping of the character specified in the argument 'unichr', or empty string if none exists. Consult the URL above for details. An example shows that some characters may be broken into component characters: >>> from unicodedata import * >>> name(unichr(190)) 'VULGAR FRACTION THREE QUARTERS' >>> decomposition(unichr(190)) ' 0033 2044 0034' >>> name(unichr(0x33)), name(unichr(0x2044)), name(unichr(0x34)) ('DIGIT THREE', 'FRACTION SLASH', 'DIGIT FOUR') unicodedata.digit(unichr [,default]) Return the numeric digit value assigned to the character specified in the argument 'unichr'. If the second argument 'default' is specified, return that if no value is assigned (otherwise raise 'ValueError'). unicodedata.lookup(name) Return the Unicode character with the name specified in the first argument 'name'. Matches must be exact, and 'ValueError' is raised if no match is found. For example: >>> from unicodedata import * >>> lookup('GREEK SMALL LETTER ETA') u'\u03b7' >>> lookup('ETA') Traceback (most recent call last): File "", line 1, in ? KeyError: undefined character name SEE ALSO, `unicodedata.name()` unicodedata.mirrored(unichr) Return 1 if the character specified in the argument 'unichr' is a mirrored character in bidirection text. Return 0 otherwise. unicodedata.name(unichr) Return the name of the character specified in the argument 'unichr'. Names are in all caps and have a regular form by descending category importance. Consult the URL above for details. SEE ALSO, `unicodedata.lookup()` unicodedata.numeric(unichr [,default]) Return the floating point numeric value assigned to the character specified in the argument 'unichr'. If the second argument 'default' is specified, return that if no value is assigned (otherwise raise 'ValueError'). SECTION 3 -- Solving Problems ------------------------------------------------------------------------ EXERCISE: Many ways to take out the garbage -------------------------------------------------------------------- DISCUSSION: Recall, if you will, the dictum in "The Zen of Python" that "There should be one--and preferably only one--obvious way to do it." As with most dictums, the real world sometimes fails our ideals. Also as with most dictums, this is not necessarily such a bad thing. A discussion on the newsgroup '' in 2001 posed an apparently rather simple problem. The immediate problem was that one might encounter telephone numbers with a variety of dividers and delimiters inside them. For example, '(123) 456-7890', '123-456-7890', or '123/456-7890' might all represent the same telephone number, and all forms might be encountered in textual data sources (such as ones entered by users of a free-form entry field. For purposes of this problem, the canonical form of this number should be '1234567890'. The problem mentioned here can be generalized in some natural ways: Maybe we are interested in only some of the characters within a longer text field (in this case, the digits), and the rest is simply filler. So the general problem is how to extract the content out from the filler. The first and "obvious" approach might be a procedural loop through the initial string. One version of this approach might look like: >>> s = '(123)/456-7890' >>> result = '' >>> for c in s: ... if c in '0123456789': ... result = result + c ... >>> result '1234567890' This first approach works fine, but it might seem a bit bulky for what is, after all, basically a single action. And it might also seem odd that you need to loop though character-by-character rather than just transform the whole string. One possibly simpler approach is to use a regular expression. For readers who have skipped to the next chapter, or who know regular expressions already, this approach seems obvious: >>> import re >>> s = '(123)/456-7890' >>> re.sub(r'\D', '', s) '1234567890' The actual work done (excluding defining the initial string and importing the [re] module) is just one short expression. Good enough, but one catch with regular expressions is that they are frequently far slower than basic string operations. This makes no difference for the tiny example presented, but for processing megabytes, it could start to matter. Using a functional style of programming is one way to express the "filter" in question rather tersely, and perhaps more efficiently. For example: >>> s = '(123)/456-7890' >>> filter(lambda c:c.isdigit(), s) '1234567890' We also get something short, without needing to use regular expressions. Here is another technique that utilizes string object methods and list comprehensions, and also pins some hopes on the great efficiency of Python dictionaries: >>> isdigit = {'0':1,'1':1,'2':1,'3':1,'4':1, ... '5':1,'6':1,'7':1,'8':1,'9':1}.has_key >>> ''.join([x for x in s if isdigit(x)]) '1234567890' QUESTIONS: 1. Which content extraction technique seems most natural to you? Which would you prefer to use? Explain why. 2. What intuitions do you have about the performance of these different techniques, if applied to large data sets? Are there differences in comparative efficiency of techniques between operating on one single large string input and operating on a large number of small string inputs? 3. Construct a program to verify or refute your intuitions about performance of the constructs. 4. Can you think of ways of combining these techniques to maximize efficiency? Are there any other techniques available that might be even better (hint: think about what `string.translate()` does)? Construct a faster technique, and demonstrate its efficiency. 5. Are there reasons other than raw processing speed to prefer some of these techniques over others? Explain these reasons, if they exist. EXERCISE: Making sure things are what they should be -------------------------------------------------------------------- DISCUSSION: The concept of a "digital signature" was introduced in Section 2.2.4. As was mentioned, the Python standard library does not include (directly) any support for digital signatures. One way to characterize a digital signature is as some information that -proves- or -verifies- that some other information really is what it purports to be. But this characterization actually applies to a broader set of things than just digital signatures. In cryptology literature one is accustomed to talk about the "threat model" a crypto-system defends against. Let us look at a few. Data may be altered by malicious tampering, but it may also be altered by packet loss, storage-media errors, or by program errors. The threat of accidental damage to data is the easiest threat to defend against. The standard technique is to use a hash of the correct data and send that also. The receiver of the data can simply calculate the hash of the data herself--using the same algorithm--and compare it with the hash sent. A very simple utility like the one below does this: #---------- crc32.py ----------# # Calculate CRC32 hash of input files or STDIN # Incremental read for large input sources # Usage: python crc32.py [file1 [file2 [...]]] # or: python crc32.py < STDIN import binascii import fileinput filelist = [] crc = binascii.crc32('') for line in fileinput.input(): if fileinput.isfirstline(): if fileinput.isstdin(): filelist.append('STDIN') else: filelist.append(fileinput.filename()) crc = binascii.crc32(line,crc) print 'Files:', ' '.join(filelist) print 'CRC32:', crc A slightly faster version could use `zlib.adler32()` instead of `binascii.crc32`. The chance that a randomly corrupted file would have the right CRC32 hash is approximately (2**-32)--unlikely enough not to worry about most times. A CRC32 hash, however, is far too weak to be used cryptographically. While random data error will almost surely not create a chance hash collision, a malicious tamperer-- Mallory, in crypto-parlance--can find one relatively easily. Specifically, suppose the true message is M, Mallory can find an M' such that CRC32(M) equals CRC32(M'). Moreover, even imposing the condition that M' appears plausible as a message to the receiver does not make Mallory's tasks particularly difficult. To thwart fraudulent messages, it is necessary to use a cryptographically strong hash, such as [SHA] or [MD5]. Doing so is almost the same utility as above: #---------- sha.py ----------# # Calculate SHA hash of input files or STDIN # Usage: python sha.py [file1 [file2 [...]]] # or: python sha.py < STDIN import sha, fileinput, os, sys filelist = [] sha = sha.sha() for line in fileinput.input(): if fileinput.isfirstline(): if fileinput.isstdin(): filelist.append('STDIN') else: filelist.append(fileinput.filename()) sha.update(line[:-1]+os.linesep) # same as binary read sys.stderr.write('Files: '+' '.join(filelist)+'\nSHA: ') print sha.hexdigest() An SHA or MD5 hash cannot be forged practically, but if our threat model includes a malicious tamperer, we need to worry about whether the hash itself is authentic. Mallory, our tamperer, can produce a false SHA hash that matches her false message. With CRC32 hashes, a very common procedure is to attach the hash to the data message itself--for example, as the first or last line of the data file, or within some wrapper lines. This is called an "in band" or "in channel" transmission. One alternative is "out of band" or "off channel" transmission of cryptographic hashes. For example, a set of cryptographic hashes matching data files could be placed on a Web page. Merely transmitting the hash off channel does not guarantee security, but it does require Mallory to attack both channels effectively. By using encryption, it is possible to transmit a secured hash in channel. The key here is to encrypt the hash and attach that encrypted version. If the hash is appended with some identifying information before the encryption, that can be recovered to prove identity. Otherwise, one could simply include both the hash and its encrypted version. For the encryption of the hash, an asymmetrical encryption algorithm is ideal; however, with the Python standard library, the best we can do is to use the (weak) symmetrical encryption in [rotor]. For example, we could use the utility below: #---------- hash_rotor.py ----------# #!/usr/bin/env python # Encrypt hash on STDIN using sys.argv[1] as password import rotor, sys, binascii cipher = rotor.newrotor(sys.argv[1]) hexhash = sys.stdin.read()[:-1] # no newline print hexhash hash = binascii.unhexlify(hexhash) sys.stderr.write('Encryption: ') print binascii.hexlify(cipher.encrypt(hash)) The utilities could then be used like: #*-------- hash_rotor at work --------# % cat mary.txt Mary had a little lamb % python sha.py mary.txt | hash_rotor.py mypassword >> mary.txt Files: mary.txt SHA: Encryption: % cat mary.txt Mary had a little lamb c49bf9a7840f6c07ab00b164413d7958e0945941 63a9d3a2f4493d957397178354f21915cb36f8f8 The penultimate line of the file now has its SHA hash, and the last line has an encryption of the hash. The password used will somehow need to be transmitted securely for the receiver to validate the appended document (obviously, the whole system make more sense with longer and more proprietary documents than in the example). QUESTIONS: 1. How would you wrap up the suggestions in the small utilities above into a more robust and complete "digital_signatures.py" utility or module? What concerns would come into a completed utility? 2. Why is CRC32 not suitable for cryptographic purposes? What sets SHA and MD5 apart (you should not need to know the details of the algorithm for this answer)? Why is uniformity of coverage of hash results important for any hash algorithm? 3. Explain in your own words why hashes serve to verify documents. If you were actually the malicious attacker in the scenarios above, how would you go about interfering with the crypto-systems outlined here? What lines of attack are left open by the system you sketched out or programmed in (1)? 4. If messages are subject to corruptions, including accidental corruption, so are hashes. The short length of hashes may make problems in them less likely, but not impossible. How might you enhance the document verification systems above to detect corruption within a hash itself? How might you allow more accurate targeting of corrupt versus intact portions of a large document (it may be desirable to recover as much as possible from a corrupt document)? 5. Advanced: The RSA public-key algorithm is actually quite simple; it just involves some modulo exponentiation operations and some large primes. An explanation can be found, among other places, at the author's -Introduction to Cryptology Concepts II-: Try implementing an RSA public-key algorithm in Python, and use this to enrich the digital signature system you developed above. EXERCISE: Finding needles in haystacks (full-text indexing) -------------------------------------------------------------------- DISCUSSION: Many texts you deal with are loosely structured and prose-like, rather than composed of well-ordered records. For documents of that sort, a very frequent question you want answered is, "What is (or isn't) in the documents?"--at a more general level than the semantic richness you might obtain by actually -reading- the documents. In particular, you often want to check a large collection of documents to determine the (comparatively) small subset of them that are relevant to a given area of interest. A certain category of questions about document collections has nothing much to do with text processing. For example, to locate all the files modified within a certain time period, and having a certain file size, some basic use of the [os.path] module suffices. Below is a sample utility to do such a search, which includes some typical argument parsing and help screens. The search itself is only a few lines of code: #---------- findfile1.py ----------# # Find files matching date and size _usage = """ Usage: python findfile1.py [-start=days_ago] [-end=days_ago] [-small=min_size] [-large=max_size] [pattern] Example: python findfile1.py -start=10 -end=5 -small=1000 -large=5000 *.txt """ import os.path import time import glob import sys def parseargs(args): """Somewhat flexible argument parser for multiple platforms. Switches can start with - or /, keywords can end with = or :. No error checking for bad arguments is performed, however. """ now = time.time() secs_in_day = 60*60*24 start = 0 # start of epoch end = time.time() # right now small = 0 # empty files large = sys.maxint # max file size pat = '*' # match all for arg in args: if arg[0] in '-/': if arg[1:6]=='start': start = now-(secs_in_day*int(arg[7:])) elif arg[1:4]=='end': end = now-(secs_in_day*int(arg[5:])) elif arg[1:6]=='small': small = int(arg[7:]) elif arg[1:6]=='large': large = int(arg[7:]) elif arg[1] in 'h?': print _usage else: pat = arg return (start,end,small,large,pat) if __name__ == '__main__': if len(sys.argv) > 1: (start,end,small,large,pat) = parseargs(sys.argv[1:]) for fname in glob.glob(pat): if not os.path.isfile(fname): continue # don't check directories modtime = os.path.getmtime(fname) size = os.path.getsize(fname) if small <= size <= large and start <= modtime <= end: print time.ctime(modtime),'%8d '%size,fname else: print _usage What about searching for text inside files? The `string.find()` function is good for locating contents quickly and could be used to search files for contents. But for large document collections, hits may be common. To make sense of search results, ranking the results by number of hits can help. The utility below performs a match-accuracy ranking (for brevity, without the argument parsing of 'findfile1.py'): #---------- findfile2.py ----------# # Find files that contain a word _usage = "Usage: python findfile.py word" import os.path import glob import sys if len(sys.argv) == 2: search_word = sys.argv[1] results = [] for fname in glob.glob('*'): if os.path.isfile(fname): # don't check directories text = open(fname).read() fsize = len(text) hits = text.count(search_word) density = (fsize > 0) and float(hits)/(fsize) if density > 0: # consider when density==0 results.append((density,fname)) results.sort() results.reverse() print 'RANKING FILENAME' print '------- --------------------------' for match in results: print '%6d '%int(match[0]*1000000), match[1] else: print _usage Variations on these are, of course, possible. But generally you could build pretty sophisticated searches and rankings by adding new search options incrementally to 'findfile2.py'. For example, adding some regular expression options could give the utility capabilities similar to the 'grep' utility. The place where a word search program like the one above falls terribly short is in speed of locating documents in -very- large document collections. Even something as fast, and well optimized, as 'grep' simply takes a while to search a lot of source text. Fortunately, it is possible to -shortcut- this search time, as well as add some additional capabilities. A technique for rapid searching is to perform a generic search just once (or periodically) and create an index--i.e., database--of those generic search results. Performing a later search need not -really- search contents, but only check the abstracted and structured index of possible searches. The utility 'indexer.py' is a functional example of such a computed search index. The most current version may be downloaded from the book's Web site . The utility 'indexer.py' allows very rapid searching for the simultaneous occurrence of multiple words within a file. For example, one might want to locate all the document files (or other text sources, such as VARCHAR database fields) that contain the words 'Python', 'index', and 'search'. Supposing there are many thousands of candidate documents, searching them on an ad hoc basis could be slow. But 'indexer.py' creates a comparatively compact collection of persistent dictionaries that provide answers to such inquiries. The full source code to 'indexer.py' is worth reading, but most of it deals with a variety of persistence mechanisms and with an object-oriented programming (OOP) framework for reuse. The underlying idea is simple, however. Create three dictionaries based on scanning a collection of documents: #*---------- Index dictionaries ----------# *Indexer.fileids: fileid --> filename *Indexer.files: filename --> (fileid, wordcount) *Indexer.words: word --> {fileid1:occurs, fileid2:occurs, ...} The essential mapping is '*Indexer.words'. For each word, what files does it occur in and how often? The mappings '*Indexer.fileids' and '*Indexer.files' are ancillary. The first just allows shorter numeric aliases to be used instead of long filenames in the '*Indexer.words' mapping (a performance boost and storage saver). The second, '*Indexer.files', also holds a total wordcount for each file. This allows a ranking of the importance of different matches. The thought is that a megabyte file with ten occurrences of 'Python' is less focused on the topic of Python than is a kilobyte file with the same ten occurrences. Both generating and utilizing the mappings above is straightforward. To search multiple words, one basically simply needs the intersection of the results of several values of the '*Indexer.words' dictionary, one value for each word key. Generating the mappings involves incrementing counts in the nested dictionary of '*Indexer.words', but is not complicated. QUESTIONS: 1. One of the most significant--and surprisingly subtle--concerns in generating useful word indexes is figuring out just what a "word" is. What considerations would you bring to determine word identities? How might you handle capitalization? Punctuation? Whitespace? How might you disallow binary strings that are not "real" words. Try performing word-identification tests against real-world documents. How successful were you? 2. Could other data structures be used to store word index information than those proposed above? If other data structures are used, what efficiency (speed) advantages or disadvantages do you expect to encounter? Are there other data structures that would allow for additional search capabilities than the multiword search of 'indexer.py'? If so, what other indexed search capabilities would have the most practical benefit? 3. Consider adding integrity guarantees to index results. What if an index falls out of synchronization with the underlying documents? How might you address referential integrity? Hint: consider `binascii.crc32`, [sha], and [md5]. What changes to the data structures would be needed for integrity checks? Implement such an improvement. 4. The utility 'indexer.py' has some ad hoc exclusions of nontextual files from inclusion in an index, based simply on some file extensions. How might one perform accurate exclusion of nontextual data? What does it mean for a document to contain text? Try writing a utility 'istextual.py' that will identify text and nontext real-world documents. Does it work to your satisfaction? 5. Advanced: 'indexer.py' implements several different persistence mechanisms. What other mechanisms might you use from those implemented? Benchmark your mechanism. Does it do better than 'SlicedZPickleIndexer' (the best variant ncluded in both speed and space)?