CHAPTER II -- BASIC STRING OPERATIONS
-------------------------------------------------------------------
The cheapest, fastest and most reliable components of a
computer system are those that aren't there.
--Gordon Bell, Encore Computer Corporation
If you are writing programs in Python to accomplish text
processing tasks, most of what you need to know is in this
chapter. Sure, you will probably need to know how to do some
basic things with pipes, files, and arguments to get your text
to process (covered in Chapter 1); but for actually
-processing- the text you have gotten, the [string] module and
string methods--and Python's basic data structures--do most
all of what you need done, almost all the time. To a lesser
extent, the various custom modules to perform encodings,
encryptions, and compressions are handy to have around (and you
certainly do not want the work of implementing them yourself).
But at the heart of text processing are basic transformations of
bits of text. That's what [string] functions and string
methods do.
There are a lot of interesting techniques elsewhere in this
book. I wouldn't have written about them if I did not find
them important. But be cautious before doing interesting
things. Specifically, given a fixed task in mind, before
cracking this book open to any of the other chapters, consider
very carefully whether your problem can be solved using the
techniques in this chapter. If you can answer this question
affirmatively, you should usually eschew the complications of
using the higher-level modules and techniques that other
chapters discuss. By all means read all of this book for
the insight and edification that I hope it provides; but still
focus on the "Zen of Python," and prefer simple to complex when
simple is enough.
This chapter does several things. Section 2.1 looks at a number
of common problems in text processing that can (and should) be
solved using (predominantly) the techniques documented in this
chapter. Each of these "Problems" presents working solutions that
can often be adopted with little change to real-life jobs. But a
larger goal is to provide readers with a starting point for
adaptation of the examples. It is not my goal to provide mere
collections of packaged utilities and modules--plenty of those
exist on the Web, and resources like the Vaults of Parnassus
and the Python Cookbook
are worth
investigating as part of any project/task (and new and better
utilities will be written between the time I write this and when
you read it). It is better for readers to receive a solid
foundation and starting point from which to develop the
functionality they need for their own projects and tasks. And
even better than spurring adaptation, these examples aim to
encourage contemplation. In presenting examples, this book tries
to embody a way of thinking about problems and an attitude
towards solving them. More than any individual technique, such
ideas are what I would most like to share with readers.
Section 2.2 is a "reference with commentary" on the Python
standard library modules for doing basic text manipulations. The
discussions interspersed with each module try to give some
guidance on why you would want to use a given module or function,
and the reference documentation tries to contain more examples of
actual typical usage than does a plain reference. In many cases,
the examples and discussion of individual functions addresses
common and productive design patterns in Python. The
cross-references are intended to contextualize a given function
(or other thing) in terms of related ones (and to help you decide
which is right for you). The actual listing of functions,
constants, classes, and the like is in alphabetical order within
type of thing.
Section 2.3 in many ways continues Section 2.1, but also provides
some aids for using this book in a learning context. The
problems and solutions presented in Section 2.3 are somewhat more
open-ended than those in Section 2.1. As well, each section
labeled as "Discussion" is followed by one labeled
"Questions." These questions are ones that could be assigned
by a teacher to students; but they are also intended to be
issues that general readers will enjoy and benefit from
contemplating. In many cases, the questions point to
limitations of the approaches initially presented, and ask
readers to think about ways to address or move beyond these
limitations--exactly what readers need to do when writing their
own custom code to accomplish outside tasks. However, each
Discussion in Section 2.3 should stand on its own, even if the
Questions are skipped over by the reader.
SECTION 1 -- Some Common Tasks
------------------------------------------------------------------------
PROBLEM: Quickly sorting lines on custom criteria
--------------------------------------------------------------------
Sorting is one of the real meat-and-potatoes algorithms of text
processing and, in fact, of most programming. Fortunately for
Python developers, the native `[].sort` method is extraordinarily
fast. Moreover, Python lists with almost any heterogeneous
objects as elements can be sorted--Python cannot rely on the
uniform arrays of a language like C (an unfortunate exception to
this general power was introduced in recent Python versions where
comparisons of complex numbers raise a 'TypeError'; and
'[1+1j,2+2j].sort()' dies for the same reason; Unicode strings in
lists can cause similar problems).
SEE ALSO, [complex]
+++
The list sort method is wonderful when you want to sort items in
their "natural" order--or in the order that Python considers
natural, in the case of items of varying types. Unfortunately, a
lot of times, you want to sort things in "unnatural" orders. For
lines of text, in particular, any order that is not simple
alphabetization of the lines is "unnatural." But often text lines
contain meaningful bits of information in positions other than
the first character position: A last name may occur as the second
word of a list of people (for example, with first name as the
first word); an IP address may occur several fields into a server
log file; a money total may occur at position 70 of each line;
and so on. What if you want to sort lines based on this style of
meaningful order that Python doesn't quite understand?
The list sort method `[].sort()` supports an optional custom
comparison function argument. The job this function has is to
return -1 if the first thing should come first, return 0 if the
two things are equal order-wise, and return 1 if the first thing
should come second. The built-in function `cmp()` does this in a
manner identical to the default `[].sort()` (except in terms of
speed, 'lst.sort()' is much faster than 'lst.sort(cmp)'). For
short lists and quick solutions, a custom comparison function is
probably the best thing. In a lot of cases, one can even get by
with an in-line 'lambda' function as the custom comparison
function, which is a pleasant and handy idiom.
When it comes to speed, however, use of custom comparison
functions is fairly awful. Part of the problem is Python's
function call overhead, but a lot of other factors contribute to
the slowness. Fortunately, a technique called "Schwartzian
Transforms" can make for much faster custom sorts. Schwartzian
Transforms are so named after Randal Schwartz, who proposed the
technique for working with Perl; but the technique is equally
applicable to Python.
The pattern involved in the Schwartzian Transform technique
consists of three steps (these can more precisely be called the
Guttman-Rosler Transform, which is based on the Schwartzian
Transform):
1. Transform the list in a reversible way into one that sorts
"naturally."
2. Call Python's native `[].sort()` method.
3. Reverse the transformation in (1) to restore the
original list items (in new sorted order).
The reason this technique works is that, for a list of size N,
it only requires O(2N) transformation operations, which is easy
to amortize over the necessary O(N log N) compare/flip
operations for large lists. The sort dominates computational
time, so anything that makes the sort more efficient is a win
in the limit case (this limit is reached quickly).
Below is an example of a simple, but plausible, custom sorting
algorithm. The sort is on the fourth and subsequent words of
a list of input lines. Lines that are shorter than four words
sort to the bottom. Running the test against a file with about
20,000 lines--about 1 megabyte--performed the Schwartzian
Transform sort in less than 2 seconds, while taking over 12
seconds for the custom comparison function sort (outputs were
verified as identical). Any number of factors will change the
exact relative timings, but a better than six times gain can
generally be expected.
#---------- schwartzian_sort.py ----------#
# Timing test for "sort on fourth word"
# Specifically, two lines >= 4 words will be sorted
# lexographically on the 4th, 5th, etc.. words.
# Any line with fewer than four words will be sorted to
# the end, and will occur in "natural" order.
import sys, string, time
wrerr = sys.stderr.write
# naive custom sort
def fourth_word(ln1,ln2):
lst1 = string.split(ln1)
lst2 = string.split(ln2)
#-- Compare "long" lines
if len(lst1) >= 4 and len(lst2) >= 4:
return cmp(lst1[3:],lst2[3:])
#-- Long lines before short lines
elif len(lst1) >= 4 and len(lst2) < 4:
return -1
#-- Short lines after long lines
elif len(lst1) < 4 and len(lst2) >= 4:
return 1
else: # Natural order
return cmp(ln1,ln2)
# Don't count the read itself in the time
lines = open(sys.argv[1]).readlines()
# Time the custom comparison sort
start = time.time()
lines.sort(fourth_word)
end = time.time()
wrerr("Custom comparison func in %3.2f secs\n" % (end-start))
# open('tmp.custom','w').writelines(lines)
# Don't count the read itself in the time
lines = open(sys.argv[1]).readlines()
# Time the Schwartzian sort
start = time.time()
for n in range(len(lines)): # Create the transform
lst = string.split(lines[n])
if len(lst) >= 4: # Tuple w/ sort info first
lines[n] = (lst[3:], lines[n])
else: # Short lines to end
lines[n] = (['\377'], lines[n])
lines.sort() # Native sort
for n in range(len(lines)): # Restore original lines
lines[n] = lines[n][1]
end = time.time()
wrerr("Schwartzian transform sort in %3.2f secs\n" % (end-start))
# open('tmp.schwartzian','w').writelines(lines)
Only one particular example is presented, but readers should be
able to generalize this technique to any sort they need to
perform frequently or on large files.
PROBLEM: Reformatting paragraphs of text
--------------------------------------------------------------------
While I mourn the decline of plaintext ASCII as a communication
format--and its eclipse by unnecessarily complicated and large
(and often proprietary) formats--there is still plenty of life
left in text files full of prose. READMEs, HOWTOs, email,
Usenet posts, and this book itself are written in plaintext (or
at least something close enough to plaintext that generic
processing techniques are valuable). Moreover, many formats like
HTML and LaTeX are frequently enough hand-edited that their
plaintext appearance is important.
One task that is extremely common when working with prose text
files is reformatting paragraphs to conform to desired margins.
Python 2.3 adds the module [textwrap], which performs more
limited reformatting than the code below. Most of the time, this
task gets done within text editors, which are indeed quite
capable of performing the task. However, sometimes it would be
nice to automate the formatting process. The task is simple
enough that it is slightly surprising that Python has no standard
module function to do this. There -is- the class
`formatter.DumbWriter`, or the possibility of inheriting from and
customizing `formatter.AbstractWriter`. These classes are
discussed in Chapter 5; but frankly, the amount of customization
and sophistication needed to use these classes and their many
methods is way out of proportion for the task at hand.
Below is a simple solution that can be used either as a
command-line tool (reading from STDIN and writing to STDOUT) or
by import to a larger application.
#---------- reformat_para.py ----------#
# Simple paragraph reformatter. Allows specification
# of left and right margins, and of justification style
# (using constants defined in module).
LEFT,RIGHT,CENTER = 'LEFT','RIGHT','CENTER'
def reformat_para(para='',left=0,right=72,just=LEFT):
words = para.split()
lines = []
line = ''
word = 0
end_words = 0
while not end_words:
if len(words[word]) > right-left: # Handle very long words
line = words[word]
word +=1
if word >= len(words):
end_words = 1
else: # Compose line of words
while len(line)+len(words[word]) <= right-left:
line += words[word]+' '
word += 1
if word >= len(words):
end_words = 1
break
lines.append(line)
line = ''
if just==CENTER:
r, l = right, left
return '\n'.join([' '*left+ln.center(r-l) for ln in lines])
elif just==RIGHT:
return '\n'.join([line.rjust(right) for line in lines])
else: # left justify
return '\n'.join([' '*left+line for line in lines])
if __name__=='__main__':
import sys
if len(sys.argv) <> 4:
print "Please specify left_margin, right_marg, justification"
else:
left = int(sys.argv[1])
right = int(sys.argv[2])
just = sys.argv[3].upper()
# Simplistic approach to finding initial paragraphs
for p in sys.stdin.read().split('\n\n'):
print reformat_para(p,left,right,just),'\n'
A number of enhancements are left to readers, if needed. You
might want to allow hanging indents or indented first lines, for
example. Or paragraphs meeting certain criteria might not be
appropriate for wrapping (e.g., headers). A custom application
might also determine the input paragraphs differently, either
by a different parsing of an input file, or by generating
paragraphs internally in some manner.
PROBLEM: Column statistics for delimited or flat-record files
--------------------------------------------------------------------
Data feeds, DBMS dumps, log files, and flat-file databases all
tend to contain ontologically similar records--one per line--with
a collection of fields in each record. Usually such fields are
separated either by a specified delimiter or by specific column
positions where fields are to occur.
Parsing these structured text records is quite easy, and
performing computations on fields is equally straightforward. But
in working with a variety of such "structured text databases," it
is easy to keep writing almost the same code over again for each
variation in format and computation.
The example below provides a generic framework for every
similar computation on a structured text database.
#---------- fields_stats.py ----------#
# Perform calculations on one or more of the
# fields in a structured text database.
import operator
from types import *
from xreadlines import xreadlines # req 2.1, but is much faster...
# could use .readline() meth < 2.1
#-- Symbolic Constants
DELIMITED = 1
FLATFILE = 2
#-- Some sample "statistical" func (in functional programming style)
nillFunc = lambda lst: None
toFloat = lambda lst: map(float, lst)
avg_lst = lambda lst: reduce(operator.add, toFloat(lst))/len(lst)
sum_lst = lambda lst: reduce(operator.add, toFloat(lst))
max_lst = lambda lst: reduce(max, toFloat(lst))
class FieldStats:
"""Gather statistics about structured text database fields
text_db may be either string (incl. Unicode) or file-like object
style may be in (DELIMITED, FLATFILE)
delimiter specifies the field separator in DELIMITED style text_db
column_positions lists all field positions for FLATFILE style,
using one-based indexing (first column is 1).
E.g.: (1, 7, 40) would take fields one, two, three
from columns 1, 7, 40 respectively.
field_funcs is a dictionary with column positions as keys,
and functions on lists as values.
E.g.: {1:avg_lst, 4:sum_lst, 5:max_lst} would specify the
average of column one, the sum of column 4, and the
max of column 5. All other cols--incl 2,3, >=6--
are ignored.
"""
def __init__(self,
text_db='',
style=DELIMITED,
delimiter=',',
column_positions=(1,),
field_funcs={} ):
self.text_db = text_db
self.style = style
self.delimiter = delimiter
self.column_positions = column_positions
self.field_funcs = field_funcs
def calc(self):
"""Calculate the column statistics
"""
#-- 1st, create a list of lists for data (incl. unused flds)
used_cols = self.field_funcs.keys()
used_cols.sort()
# one-based column naming: column[0] is always unused
columns = []
for n in range(1+used_cols[-1]):
# hint: '[[]]*num' creates refs to same list
columns.append([])
#-- 2nd, fill lists used for calculated fields
# might use a string directly for text_db
if type(self.text_db) in (StringType,UnicodeType):
for line in self.text_db.split('\n'):
fields = self.splitter(line)
for col in used_cols:
field = fields[col-1] # zero-based index
columns[col].append(field)
else: # Something file-like for text_db
for line in xreadlines(self.text_db):
fields = self.splitter(line)
for col in used_cols:
field = fields[col-1] # zero-based index
columns[col].append(field)
#-- 3rd, apply the field funcs to column lists
results = [None] * (1+used_cols[-1])
for col in used_cols:
results[col] = \
apply(self.field_funcs[col],(columns[col],))
#-- Finally, return the result list
return results
def splitter(self, line):
"""Split a line into fields according to curr inst specs"""
if self.style == DELIMITED:
return line.split(self.delimiter)
elif self.style == FLATFILE:
fields = []
# Adjust offsets to Python zero-based indexing,
# and also add final position after the line
num_positions = len(self.column_positions)
offsets = [(pos-1) for pos in self.column_positions]
offsets.append(len(line))
for pos in range(num_positions):
start = offsets[pos]
end = offsets[pos+1]
fields.append(line[start:end])
return fields
else:
raise ValueError, \
"Text database must be DELIMITED or FLATFILE"
#-- Test data
# First Name, Last Name, Salary, Years Seniority, Department
delim = '''
Kevin,Smith,50000,5,Media Relations
Tom,Woo,30000,7,Accounting
Sally,Jones,62000,10,Management
'''.strip() # no leading/trailing newlines
# Comment First Last Salary Years Dept
flat = '''
tech note Kevin Smith 50000 5 Media Relations
more filler Tom Woo 30000 7 Accounting
yet more... Sally Jones 62000 10 Management
'''.strip() # no leading/trailing newlines
#-- Run self-test code
if __name__ == '__main__':
getdelim = FieldStats(delim, field_funcs={3:avg_lst,4:max_lst})
print 'Delimited Calculations:'
results = getdelim.calc()
print ' Average salary -', results[3]
print ' Max years worked -', results[4]
getflat = FieldStats(flat, field_funcs={3:avg_lst,4:max_lst},
style=FLATFILE,
column_positions=(15,25,35,45,52))
print 'Flat Calculations:'
results = getflat.calc()
print ' Average salary -', results[3]
print ' Max years worked -', results[4]
The example above includes some efficiency considerations that
make it a good model for working with large data sets. In the
first place, class 'FieldStats' can (optionally) deal with a
file-like object, rather than keeping the whole structured text
database in memory. The generator `xreadlines.xreadlines()` is
an extremely fast and efficient file reader, but it requires
Python 2.1+--otherwise use `FILE.readline()` or
`FILE.readlines()` (for either memory or speed efficiency,
respectively). Moreover, only the data that is actually of
interest is collected into lists, in order to save memory.
However, rather than require multiple passes to collect
statistics on multiple fields, as many field columns and
summary functions as wanted can be used in one pass.
One possible improvement would be to allow multiple summary
functions against the same field during a pass. But that is
left as an exercise to the reader, if she desires to do it.
PROBLEM: Counting characters, words, lines, and paragraphs
--------------------------------------------------------------------
There is a wonderful utility under Unix-like systems called
'wc'. What it does is so basic, and so obvious, that it is
hard to imagine working without it. 'wc' simply counts the
characters, words, and lines of files (or STDIN). A few
command-line options control which results are displayed, but I
rarely use them.
In writing this chapter, I found myself on a system without
'wc', and felt a remedy was in order. The example below is
actually an "enhanced" 'wc' since it also counts paragraphs
(but it lacks the command-line switches). Unlike the external
'wc', it is easy to use the technique directly within Python
and is available anywhere Python is. The main trick--inasmuch
as there is one--is a compact use of the `"".join()` and
`"".split()` methods (`string.join()` and `string.split()` could
also be used, for example, to be compatible with Python 1.5.2 or
below).
#---------- wc.py ----------#
# Report the chars, words, lines, paragraphs
# on STDIN or in wildcard filename patterns
import sys, glob
if len(sys.argv) > 1:
c, w, l, p = 0, 0, 0, 0
for pat in sys.argv[1:]:
for file in glob.glob(pat):
s = open(file).read()
wc = len(s), len(s.split()), \
len(s.split('\n')), len(s.split('\n\n'))
print '\t'.join(map(str, wc)),'\t'+file
c, w, l, p = c+wc[0], w+wc[1], l+wc[2], p+wc[3]
wc = (c,w,l,p)
print '\t'.join(map(str, wc)), '\tTOTAL'
else:
s = sys.stdin.read()
wc = len(s), len(s.split()), len(s.split('\n')), \
len(s.split('\n\n'))
print '\t'.join(map(str, wc)), '\tSTDIN'
This little functionality could be wrapped up in a function,
but it is almost too compact to bother with doing so. Most of
the work is in the interaction with the shell environment, with
the counting basically taking only two lines.
The solution above is quite likely the "one obvious way to do
it," and therefore Pythonic. On the other hand a slightly more
adventurous reader might consider this assignment (if only for
fun):
>>> wc = map(len,[s]+map(s.split,(None,'\n','\n\n')))
A real daredevil might be able to reduce the entire program to
a single 'print' statement.
PROBLEM: Transmitting binary data as ASCII
--------------------------------------------------------------------
Many channels require that the information that travels over them
is 7-bit ASCII. Any bytes with a high-order first bit of one will
be handled unpredictably when transmitting data over protocols
like Simple Mail Transport Protocol (SMTP), Network News
Transport Protocol (NNTP), or HTTP (depending on content
encoding), or even just when displaying them in many standard
tools like editors. In order to encode 8-bit binary data as
ASCII, a number of techniques have been invented over time.
An obvious, but obese, encoding technique is to translate each
binary byte into its hexadecimal digits. UUencoding is an older
standard that developed around the need to transmit binary files
over the Usenet and on BBSs. Binhex is similar technique from
the MacOS world. In recent years, base64--which is specified by
RFC1521--has edged out the other styles of encoding. All of the
techniques are basically 4/3 encodings--that is, four ASCII bytes
are used to represent three binary bytes--but they differ
somewhat in line ending and header conventions (as well as in the
encoding as such). Quoted printable is yet another format, but of
variable encoding length. In quoted printable encoding, most
plain ASCII bytes are left unchanged, but a few special
characters and all high-bit bytes are escaped.
Python provides modules for all the encoding styles mentioned.
The high-level wrappers [uu], [binhex], [base64], and [quopri]
all operate on input and output file-like objects, encoding the
data therein. They also each have slightly different method names
and arguments. [binhex], for example, closes its output file
after encoding, which makes it unusable in conjunction with a
[cStringIO] file-like object. All of the high-level encoders
utilize the services of the low-level C module [binascii].
[binascii], in turn, implements the actual low-level block
conversions, but assumes that it will be passed the right size
blocks for a given encoding.
The standard library, therefore, does not contain quite the
right intermediate-level functionality for when the goal is
just encoding the binary data in arbitrary strings. It is easy
to wrap that up though:
#---------- encode_binary.py ----------#
# Provide encoders for arbitrary binary data
# in Python strings. Handles block size issues
# transparently, and returns a string.
# Precompression of the input string can reduce
# or eliminate any size penalty for encoding.
import sys
import zlib
import binascii
UU = 45
BASE64 = 57
BINHEX = sys.maxint
def ASCIIencode(s='', type=BASE64, compress=1):
"""ASCII encode a binary string"""
# First, decide the encoding style
if type == BASE64: encode = binascii.b2a_base64
elif type == UU: encode = binascii.b2a_uu
elif type == BINHEX: encode = binascii.b2a_hqx
else: raise ValueError, "Encoding must be in UU, BASE64, BINHEX"
# Second, compress the source if specified
if compress: s = zlib.compress(s)
# Third, encode the string, block-by-block
offset = 0
blocks = []
while 1:
blocks.append(encode(s[offset:offset+type]))
offset += type
if offset > len(s):
break
# Fourth, return the concatenated blocks
return ''.join(blocks)
def ASCIIdecode(s='', type=BASE64, compress=1):
"""Decode ASCII to a binary string"""
# First, decide the encoding style
if type == BASE64: s = binascii.a2b_base64(s)
elif type == BINHEX: s = binascii.a2b_hqx(s)
elif type == UU:
s = ''.join([binascii.a2b_uu(line) for line in s.split('\n')])
# Second, decompress the source if specified
if compress: s = zlib.decompress(s)
# Third, return the decoded binary string
return s
# Encode/decode STDIN for self-test
if __name__ == '__main__':
decode, TYPE = 0, BASE64
for arg in sys.argv:
if arg.lower()=='-d': decode = 1
elif arg.upper()=='UU': TYPE=UU
elif arg.upper()=='BINHEX': TYPE=BINHEX
elif arg.upper()=='BASE64': TYPE=BASE64
if decode:
print ASCIIdecode(sys.stdin.read(),type=TYPE)
else:
print ASCIIencode(sys.stdin.read(),type=TYPE)
The example above does not attach any headers or delimit the
encoded block (by design); for that, a wrapper like [uu],
[mimify], or [MimeWriter] is a better choice. Or a custom
wrapper around 'encode_binary.py'.
PROBLEM: Creating word or letter histograms
--------------------------------------------------------------------
A histogram is an analysis of the relative occurrence frequency
of each of a number of possible values. In terms of text
processing, the occurrences in question are almost always
either words or byte values. Creating histograms is quite
simple using Python dictionaries, but the technique is not
always immediately obvious to people thinking about it. The
example below has a good generality, provides several utility
functions associated with histograms, and can be used in a
command-line operation mode.
#---------- histogram.py ----------#
# Create occurrence counts of words or characters
# A few utility functions for presenting results
# Avoids requirement of recent Python features
from string import split, maketrans, translate, punctuation, digits
import sys
from types import *
import types
def word_histogram(source):
"""Create histogram of normalized words (no punct or digits)"""
hist = {}
trans = maketrans('','')
if type(source) in (StringType,UnicodeType): # String-like src
for word in split(source):
word = translate(word, trans, punctuation+digits)
if len(word) > 0:
hist[word] = hist.get(word,0) + 1
elif hasattr(source,'read'): # File-like src
try:
from xreadlines import xreadlines # Check for module
for line in xreadlines(source):
for word in split(line):
word = translate(word, trans, punctuation+digits)
if len(word) > 0:
hist[word] = hist.get(word,0) + 1
except ImportError: # Older Python ver
line = source.readline() # Slow but mem-friendly
while line:
for word in split(line):
word = translate(word, trans, punctuation+digits)
if len(word) > 0:
hist[word] = hist.get(word,0) + 1
line = source.readline()
else:
raise TypeError, \
"source must be a string-like or file-like object"
return hist
def char_histogram(source, sizehint=1024*1024):
hist = {}
if type(source) in (StringType,UnicodeType): # String-like src
for char in source:
hist[char] = hist.get(char,0) + 1
elif hasattr(source,'read'): # File-like src
chunk = source.read(sizehint)
while chunk:
for char in chunk:
hist[char] = hist.get(char,0) + 1
chunk = source.read(sizehint)
else:
raise TypeError, \
"source must be a string-like or file-like object"
return hist
def most_common(hist, num=1):
pairs = []
for pair in hist.items():
pairs.append((pair[1],pair[0]))
pairs.sort()
pairs.reverse()
return pairs[:num]
def first_things(hist, num=1):
pairs = []
things = hist.keys()
things.sort()
for thing in things:
pairs.append((thing,hist[thing]))
pairs.sort()
return pairs[:num]
if __name__ == '__main__':
if len(sys.argv) > 1:
hist = word_histogram(open(sys.argv[1]))
else:
hist = word_histogram(sys.stdin)
print "Ten most common words:"
for pair in most_common(hist, 10):
print '\t', pair[1], pair[0]
print "First ten words alphabetically:"
for pair in first_things(hist, 10):
print '\t', pair[0], pair[1]
# a more practical command-line version might use:
# for pair in most_common(hist,len(hist)):
# print pair[1],'\t',pair[0]
Several of the design choices are somewhat arbitrary. Words
have all their punctuation stripped to identify "real" words.
But on the other hand, words are still case-sensitive, which
may not be what is desired. The sorting functions
'first_things()' and 'most_common()' only return an initial
sublist. Perhaps it would be better to return the whole list,
and let the user slice the result. It is simple to customize
around these sorts of issues, though.
PROBLEM: Reading a file backwards by record, line, or paragraph
--------------------------------------------------------------------
Reading a file line by line is a common task in Python, or in
most any language. Files like server logs, configuration files,
structured text databases, and others frequently arrange
information into logical records, one per line. Very often, the
job of a program is to perform some calculation on each record
in turn.
Python provides a number of convenient methods on file-like
objects for such line-by-line reading. `FILE.readlines()`
reads a whole file at once and returns a list of lines. The
technique is very fast, but requires the whole contents of the
file be kept in memory. For very large files, this can be a
problem. `FILE.readline()` is memory-friendly--it just reads a
line at a time and can be called repeatedly until the EOF is
reached--but it is also much slower. The best solution for
recent Python versions is `xreadlines.xreadlines()` or
`FILE.xreadlines()` in Python 2.1+. These techniques are
memory-friendly, while still being fast and presenting a
"virtual list" of lines (by way of Python's new
generator/iterator interface).
The above techniques work nicely for reading a file in its
natural order, but what if you want to start at the end of a
file and work backwards from there? This need is frequently
encountered when you want to read log files that have records
appended over time (and when you want to look at the most
recent records first). It comes up in other situations also.
There is a very easy technique if memory usage is not an issue:
>>> open('lines','w').write('\n'.join([`n` for n in range(100)]))
>>> fp = open('lines')
>>> lines = fp.readlines()
>>> lines.reverse()
>>> for line in lines[1:5]:
... # Processing suite here
... print line,
...
98
97
96
95
For large input files, however, this technique is not feasible.
It would be nice to have something analogous to [xreadlines]
here. The example below provides a good starting point (the
example works equally well for file-like objects).
#---------- read_backwards.py ----------#
# Read blocks of a file from end to beginning.
# Blocks may be defined by any delimiter, but the
# constants LINE and PARA are useful ones.
# Works much like the file object method '.readline()':
# repeated calls continue to get "next" part, and
# function returns empty string once BOF is reached.
# Define constants
from os import linesep
LINE = linesep
PARA = linesep*2
READSIZE = 1000
# Global variables
buffer = ''
def read_backwards(fp, mode=LINE, sizehint=READSIZE, _init=[0]):
"""Read blocks of file backwards (return empty string when done)"""
# Trick of mutable default argument to hold state between calls
if not _init[0]:
fp.seek(0,2)
_init[0] = 1
# Find a block (using global buffer)
global buffer
while 1:
# first check for block in buffer
delim = buffer.rfind(mode)
if delim <> -1: # block is in buffer, return it
block = buffer[delim+len(mode):]
buffer = buffer[:delim]
return block+mode
#-- BOF reached, return remainder (or empty string)
elif fp.tell()==0:
block = buffer
buffer = ''
return block
else: # Read some more data into the buffer
readsize = min(fp.tell(),sizehint)
fp.seek(-readsize,1)
buffer = fp.read(readsize) + buffer
fp.seek(-readsize,1)
#-- Self test of read_backwards()
if __name__ == '__main__':
# Let's create a test file to read in backwards
fp = open('lines','wb')
fp.write(LINE.join(['--- %d ---'%n for n in range(15)]))
# Now open for reading backwards
fp = open('lines','rb')
# Read the blocks in, one per call (block==line by default)
block = read_backwards(fp)
while block:
print block,
block = read_backwards(fp)
Notice that -anything- could serve as a block delimiter. The
constants provided just happened to work for lines and block
paragraphs (and block paragraphs only with current OS's style
of line breaks). But other delimiters could be used. It would
-not- be immediately possible to read backwards word-by-word--a
space delimiter would come close, but would not be quite right
for other whitespace. However, reading a line (and maybe
reversing its words) is generally good enough.
Another enhancement is possible with Python 2.2+. Using the
new 'yield' keyword, 'read_backwards()' could be programmed as
an iterator rather than as a multi-call function. The
performance will not differ significantly, but the function
might be expressed more clearly (and a "list-like" interface
like `FILE.readlines()` makes the application's loop simpler).
QUESTIONS:
1. Write a generator-based version of 'read_backwards()' that
uses the 'yield' keyword. Modify the self-test code to
utilize the generator instead.
2. Explore and explain some pitfalls with the use of a mutable
default value as a function argument. Explain also how the
style allows functions to encapsulate data and contrast
with the encapsulation of class instances.
SECTION 2 -- Standard Modules
------------------------------------------------------------------------
TOPIC -- Basic String Transformations
--------------------------------------------------------------------
The module [string] forms the core of Python's text manipulation
libraries. That module is certainly the place to look before
other modules. Most of the methods in the [string] module, you
should note, have been copied to methods of string objects from
Python 1.6+. Moreover, methods of string objects are a little bit
faster to use than are the corresponding module functions. A few
new methods of string objects do not have equivalents in the
[string] module, but are still documented here.
SEE ALSO, [str], [UserString]
=================================================================
MODULE -- string : A collection of string operations
=================================================================
There are a number of general things to notice about the
functions in the [string] module (which is composed entirely
of functions and constants; no classes).
1. Strings are immutable (as discussed in Chapter 1). This
means that there is no such thing as changing a string "in
place" (as we might do in many other languages, such as C,
by changing the bytes at certain offsets within the
string). Whenever a [string] module function takes a
string object as an argument, it returns a brand-new
string object and leaves the original one as is. However,
the very common pattern of binding the same name on the
left of an assignment as was passed on the right side
within the [string] module function somewhat conceals this
fact. For example:
>>> import string
>>> str = "Mary had a little lamb"
>>> str = string.replace(str, 'had', 'ate')
>>> str
'Mary ate a little lamb'
The first string object never gets modified per se; but
since the first string object is no longer bound to any
name after the example runs, the object is subject to
garbage collection and will disappear from memory. In
short, calling a [string] module function will not change
any existing strings, but rebinding a name can make it
look like they changed.
2. Many [string] module functions are now also available as
string object methods. To use these string object
methods, there is no need to import the [string] module,
and the expression is usually slightly more concise.
Moreover, using a string object method is usually slightly
faster than the corresponding [string] module function.
However, the most thorough documentation of each
function/method that exists as both a [string] module
function and a string object method is contained in this
reference to the [string] module.
3. The form 'string.join(string.split(...))' is a frequent
Python idiom. A more thorough discussion is contained in
the reference items for `string.join()` and
`string.split()`, but in general, combining these two
functions is very often a useful way of breaking down a
text, processing the parts, then putting together the
pieces.
4. Think about clever `string.replace()` patterns. By
combining multiple `string.replace()` calls with use of
"place holder" string patterns, a surprising range of
results can be achieved (especially when also manipulating
the intermediate strings with other techniques). See the
reference item for `string.replace()` for some discussion
and examples.
5. A mutable string of sorts can be obtained by using built-in
lists, or the [array] module. Lists can contain a
collection of substrings, each one of which may be replaced
or modified individually. The [array] module can define
arrays of individual characters, each position modifiable,
included with slice notation. The function `string.join()`
or the method `"".join()` may be used to re-create true
strings; for example:
>>> lst = ['spam','and','eggs']
>>> lst[2] = 'toast'
>>> print ''.join(lst)
spamandtoast
>>> print ' '.join(lst)
spam and toast
Or:
>>> import array
>>> a = array.array('c','spam and eggs')
>>> print ''.join(a)
spam and eggs
>>> a[0] = 'S'
>>> print ''.join(a)
Spam and eggs
>>> a[-4:] = array.array('c','toast')
>>> print ''.join(a)
Spam and toast
CONSTANTS:
The [string] module contains constants for a number of frequently
used collections of characters. Each of these constants is itself
simply a string (rather than a list, tuple, or other collection).
As such, it is easy to define constants alongside those provided
by the [string] module, should you need them. For example:
>>> import string
>>> string.brackets = "[]{}()<>"
>>> print string.brackets
[]{}()<>
string.digits
The decimal numerals ("0123456789").
string.hexdigits
The hexadecimal numerals ("0123456789abcdefABCDEF").
string.octdigits
The octal numerals ("01234567").
string.lowercase
The lowercase letters; can vary by language. In English
versions of Python (most systems):
>>> import string
>>> string.lowercase
'abcdefghijklmnopqrstuvwxyz'
You should not modify `string.lowercase` for a source
text language, but rather define a new attribute, such as
'string.spanish_lowercase' with an appropriate string
(some methods depend on this constant).
string.uppercase
The uppercase letters; can vary by language. In English
versions of Python (most systems):
>>> import string
>>> string.uppercase
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
You should not modify `string.uppercase` for a source
text language, but rather define a new attribute, such as
'string.spanish_uppercase' with an appropriate string
(some methods depend on this constant).
string.letters
All the letters (string.lowercase+string.uppercase).
string.punctuation
The characters normally considered as punctuation; can
vary by language. In English versions of Python (most
systems):
>>> import string
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
string.whitespace
The "empty" characters. Normally these consist of tab,
linefeed, vertical tab, formfeed, carriage return, and
space (in that order):
>>> import string
>>> string.whitespace
'\011\012\013\014\015 '
You should not modify `string.whitespace` (some methods
depend on this constant).
string.printable
All the characters that can be printed to any device; can
vary by language
(string.digits+string.letters+string.punctuation+string.whitespace)
FUNCTIONS:
string.atof(s=...)
Deprecated. Use `float()`.
Converts a string to a floating point value.
SEE ALSO, `eval()`, `float()`
string.atoi(s=... [,base=10])
Deprecated with Python 2.0. Use `int()` if no custom
base is needed or if using Python 2.0+.
Converts a string to an integer value (if the string
should be assumed to be in a base other than 10, the base
may be specified as the second argument).
SEE ALSO, `eval()`, `int()`, `long()`
string.atol(s=... [,base=10])
Deprecated with Python 2.0. Use `long()` if no custom
base is needed or if using Python 2.0+.
Converts a string to an unlimited length integer value
(if the string should be assumed to be in a base other
than 10, the base may be specified as the second argument).
SEE ALSO, `eval()`, `long()`, `int()`
string.capitalize(s=...)
"".capitalize()
Return a string consisting of the initial character
converted to uppercase (if applicable), and all other
characters converted to lowercase (if applicable):
>>> import string
>>> string.capitalize("mary had a little lamb!")
'Mary had a little lamb!'
>>> string.capitalize("Mary had a Little Lamb!")
'Mary had a little lamb!'
>>> string.capitalize("2 Lambs had Mary!")
'2 lambs had mary!'
For Python 1.6+, use of a string object method is
marginally faster and is stylistically preferred in most
cases:
>>> "mary had a little lamb".capitalize()
'Mary had a little lamb'
SEE ALSO, `string.capwords()`, `string.lower()`
string.capwords(s=...)
"".title()
Return a string consisting of the capitalized words.
An equivalent expression is:
#*----- equivalent expression -----#
string.join(map(string.capitalize,string.split(s))
But `string.capwords()` is a clearer way of writing it. An
effect of this implementation is that whitespace is
"normalized" by the process:
>>> import string
>>> string.capwords("mary HAD a little lamb!")
'Mary Had A Little Lamb!'
>>> string.capwords("Mary had a Little Lamb!")
'Mary Had A Little Lamb!'
With the creation of string methods in Python 1.6, the
module function `string.capwords()` was renamed as a string
method to `"".title()`.
SEE ALSO, `string.capitalize()`, `string.lower()`,
`"".istitle()`
string.center(s=..., width=...)
"".center(width)
Return a string with 's' padded with symmetrical leading
and trailing spaces (but not truncated) to occupy length
'width' (or more).
>>> import string
>>> string.center(width=30,s="Mary had a little lamb")
' Mary had a little lamb '
>>> string.center("Mary had a little lamb", 5)
'Mary had a little lamb'
For Python 1.6+, use of a string object method is
stylistically preferred in many cases:
>>> "Mary had a little lamb".center(25)
' Mary had a little lamb '
SEE ALSO, `string.ljust()`, `string.rjust()`
string.count(s, sub [,start [,end]])
"".count(sub [,start [,end]])
Return the number of nonoverlapping occurrences of 'sub'
in 's'. If the optional third or fourth arguments are
specified only the corresponding slice of 's' is
examined.
>>> import string
>>> string.count("mary had a little lamb", "a")
4
>>> string.count("mary had a little lamb", "a", 3, 10)
2
For Python 1.6+, use of a string object method is
stylistically preferred in many cases:
>>> 'mary had a little lamb'.count("a")
4
"".endswith(suffix [,start [,end]])
This string method does not have an equivalent in the
[string] module. Return a Boolean value indicating whether
the string ends with the suffix 'suffix'. If the optional
second argument 'start' is specified, only consider the
terminal substring after offset 'start'. If the optional
third argument 'end' is given, only consider the slice
'[start:end]'.
SEE ALSO, `"".startswith()`, `string.find()`
string.expandtabs(s=... [,tabsize=8])
"".expandtabs([,tabsize=8])
Return a string with tabs replaced by a variable number
of spaces. The replacement causes text blocks to line up
at "tab stops." If no second argument is given, the new
string will line up at multiples of 8 spaces. A newline
implies a new set of tab stops.
>>> import string
>>> s = 'mary\011had a little lamb'
>>> print s
mary had a little lamb
>>> string.expandtabs(s, 16)
'mary had a little lamb'
>>> string.expandtabs(tabsize=1, s=s)
'mary had a little lamb'
For Python 1.6+, use of a string object method is
stylistically preferred in many cases:
>>> 'mary\011had a little lamb'.expandtabs(25)
'mary had a little lamb'
string.find(s, sub [,start [,end]])
"".find(sub [,start [,end]])
Return the index position of the first occurrence of 'sub'
in 's'. If the optional third or fourth arguments are
specified, only the corresponding slice of 's' is examined
(but result is position in s as a whole). Return -1 if
no occurrence is found. Position is zero-based, as with
Python list indexing:
>>> import string
>>> string.find("mary had a little lamb", "a")
1
>>> string.find("mary had a little lamb", "a", 3, 10)
6
>>> string.find("mary had a little lamb", "b")
21
>>> string.find("mary had a little lamb", "b", 3, 10)
-1
For Python 1.6+, use of a string object method is
stylistically preferred in many cases:
>>> 'mary had a little lamb'.find("ad")
6
SEE ALSO, `string.index()`, `string.rfind()`
string.index(s, sub [,start [,end]])
"".index(sub [,start [,end]])
Return the same value as does `string.find()` with same
arguments, except raise 'ValueError' instead of returning
-1 when sub does not occur in s.
>>> import string
>>> string.index("mary had a little lamb", "b")
21
>>> string.index("mary had a little lamb", "b", 3, 10)
Traceback (most recent call last):
File "", line 1, in ?
File "d:/py20sl/lib/string.py", line 139, in index
return s.index(*args)
ValueError: substring not found in string.index
For Python 1.6+, use of a string object method is
stylistically preferred in many cases:
>>> 'mary had a little lamb'.index("ad")
6
SEE ALSO, `string.find()`, `string.rindex()`
Several string methods that return Boolean values indicating
whether a string has a certain property. None of the '.is*()'
methods, however, have equivalents in the [string] module:
"".isalpha()
Return a true value if all the characters are alphabetic.
"".isalnum()
Return a true value if all the characters are alphanumeric.
"".isdigit()
Return a true value if all the characters are digits.
"".islower()
Return a true value if all the characters are lowercase
and there is at least one cased character:
>>> "ab123".islower(), '123'.islower(), 'Ab123'.islower()
(1, 0, 0)
SEE ALSO, `"".lower()`
"".isspace()
Return a true value if all the characters are whitespace.
"".istitle()
Return a true value if all the string has title casing
(each word capitalized).
SEE ALSO, `"".title()`
"".isupper()
Return a true value if all the characters are uppercase
and there is at least one cased character.
SEE ALSO, `"".upper()`
string.join(words=... [,sep=" "])
"".join(words)
Return a string that results from concatenating the
elements of the list 'words' together, with 'sep' between
each. The function `string.join()` differs from all
other [string] module functions in that it takes a list
(of strings) as a primary argument, rather than a string.
It is worth noting `string.join()` and `string.split()`
are inverse functions if 'sep' is specified to both; in
other words, 'string.join(string.split(s,sep),sep)==s'
for all 's' and 'sep'.
Typically, `string.join()` is used in contexts where it
is natural to generate lists of strings. For example,
here is a small program to output the list of
all-capital words from STDIN to STDOUT, one per line:
#---------- list_capwords.py ----------#
import string,sys
capwords = []
#*--- fix linebreak ---#
for line in sys.stdin.readlines():
for word in line.split():
if word == word.upper() and word.isalpha():
capwords.append(word)
print string.join(capwords, '\n')
The technique in the sample 'list_capwords.py' script can
be considerably more efficient than building up a string
by direct concatenation. However, Python 2.0's augmented
assignment reduces the performance difference:
>>> import string
>>> s = "Mary had a little lamb"
>>> t = "its fleece was white as snow"
>>> s = s +" "+ t # relatively "expensive" for big strings
>>> s += " " + t # "cheaper" than Python 1.x style
>>> lst = [s]
>>> lst.append(t) # "cheapest" way of building long string
>>> s = string.join(lst)
For Python 1.6+, use of a string object method is
stylistically preferred in some cases. However, just as
`string.join()` is special in taking a list as a first
argument, the string object method `"".join()` is unusual
in being an operation on the (optional) 'sep' string, not
on the (required) 'words' list (this surprises many new
Python programmers).
SEE ALSO, `string.split()`
string.joinfields(...)
Identical to `string.join()`.
string.ljust(s=..., width=...)
"".ljust(width)
Return a string with 's' padded with trailing spaces (but
not truncated) to occupy length 'width' (or more).
>>> import string
>>> string.ljust(width=30,s="Mary had a little lamb")
'Mary had a little lamb '
>>> string.ljust("Mary had a little lamb", 5)
'Mary had a little lamb'
For Python 1.6+, use of a string object method is
stylistically preferred in many cases:
>>> "Mary had a little lamb".ljust(25)
'Mary had a little lamb '
SEE ALSO, `string.rjust()`, `string.center()`
string.lower(s=...)
"".lower()
Return a string with any uppercase letters converted to
lowercase.
>>> import string
>>> string.lower("mary HAD a little lamb!")
'mary had a little lamb!'
>>> string.lower("Mary had a Little Lamb!")
'mary had a little lamb!'
For Python 1.6+, use of a string object method is
stylistically preferred in many cases:
>>> "Mary had a Little Lamb!".lower()
'mary had a little lamb!'
SEE ALSO, `string.upper()`
string.lstrip(s=...)
"".lstrip([chars=string.whitespace])
Return a string with leading whitespace characters
removed. For Python 1.6+, use of a string object method
is stylistically preferred in many cases:
>>> import string
>>> s = """
... Mary had a little lamb \011"""
>>> string.lstrip(s)
'Mary had a little lamb \011'
>>> s.lstrip()
'Mary had a little lamb \011'
Python 2.3+ accepts the optional argument 'chars' to the
string object method. All characters in the string
'chars' will be removed.
SEE ALSO, `string.rstrip(), `string.strip()`
string.maketrans(from, to)
Return a translation table string, for use with
`string.translate()`. The strings 'from' and 'to' must
be the same length. A translation table is a string of
256 successive byte values, where each position defines a
translation from the `chr()` value of the index to the
character contained at that index position.
>>> import string
>>> ord('A')
65
>>> ord('z')
122
>>> string.maketrans('ABC','abc')[65:123]
'abcDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz'
>>> string.maketrans('ABCxyz','abcXYZ')[65:123]
'abcDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwXYZ'
SEE ALSO, `string.translate()`
string.replace(s=..., old=..., new=... [,maxsplit=...])
"".replace(old, new [,maxsplit])
Return a string based on 's' with occurrences of 'old'
replaced by 'new'. If the fourth argument 'maxsplit' is
specified, only replace 'maxsplit' initial occurrences.
>>> import string
>>> string.replace("Mary had a little lamb", "a little", "some")
'Mary had some lamb'
For Python 1.6+, use of a string object method is
stylistically preferred in many cases:
>>> "Mary had a little lamb".replace("a little", "some")
'Mary had some lamb'
A common "trick" involving `string.replace()` is to use
it multiple times to achieve a goal. Obviously, simply
to replace several different substrings in a string,
multiple `string.replace()` operations are almost
inevitable. But there is another class of cases where
`string.replace()` can be used to create an intermediate
string with "placeholders" for an original substring in
a particular context. The same goal can always be
achieved with regular expressions, but sometimes staged
`string.replace()` operations are both faster and easier
to program:
>>> import string
>>> line = 'variable = val # see comments #3 and #4'
>>> # we'd like '#3' and '#4' spelled out within comment
>>> string.replace(line,'#','number ') # doesn't work
'variable = val number see comments number 3 and number 4'
>>> place_holder=string.replace(line,' # ',' !!! ') # insrt plcholder
>>> place_holder
'variable = val !!! see comments #3 and #4'
>>> place_holder=place_holder.replace('#','number ') # almost there
>>> place_holder
'variable = val !!! see comments number 3 and number 4'
>>> line = string.replace(place_holder,'!!!','#') # restore orig
>>> line
'variable = val # see comments number 3 and number 4'
Obviously, for jobs like this, a place holder must be
chosen so as not ever to occur within the strings
undergoing "staged transformation"; but that should be
possible generally since place holders may be as long as
needed.
SEE ALSO, `string.translate()`, `mx.TextTools.replace()`
string.rfind(s, sub [,start [,end]])
"".rfind(sub [,start [,end]])
Return the index position of the last occurrence of 'sub'
in 's'. If the optional third or fourth arguments are
specified only the corresponding slice of 's' is examined
(but result is position in 's' as a whole). Return -1 if
no occurrence is found. Position is zero-based, as with
Python list indexing:
>>> import string
>>> string.rfind("mary had a little lamb", "a")
19
>>> string.rfind("mary had a little lamb", "a", 3, 10)
9
>>> string.rfind("mary had a little lamb", "b")
21
>>> string.rfind("mary had a little lamb", "b", 3, 10)
-1
For Python 1.6+, use of a string object method
stylistically preferred in many cases:
>>> 'mary had a little lamb'.rfind("ad")
6
SEE ALSO, `string.rindex()`, `string.find()`
string.rindex(s, sub [,start [,end]])
"".rindex(sub [,start [,end]])
Return the same value as does `string.rfind()` with same
arguments, except raise 'ValueError' instead of returning
-1 when sub does not occur in 's'.
>>> import string
>>> string.rindex("mary had a little lamb", "b")
21
>>> string.rindex("mary had a little lamb", "b", 3, 10)
Traceback (most recent call last):
File "", line 1, in ?
File "d:/py20sl/lib/string.py", line 148, in rindex
return s.rindex(*args)
ValueError: substring not found in string.rindex
For Python 1.6+, use of a string object method is
stylistically preferred in many cases:
>>> 'mary had a little lamb'.index("ad")
6
SEE ALSO, `string.rfind()`, `string.index()`
string.rjust(s=..., width=...)
"".rjust(width)
Return a string with 's' padded with leading spaces (but
not truncated) to occupy length 'width' (or more).
>>> import string
>>> string.rjust(width=30,s="Mary had a little lamb")
' Mary had a little lamb'
>>> string.rjust("Mary had a little lamb", 5)
'Mary had a little lamb'
For Python 1.6+, use of a string object method is
stylistically preferred in many cases:
>>> "Mary had a little lamb".rjust(25)
' Mary had a little lamb'
SEE ALSO, `string.ljust()`, `string.center()`
string.rstrip(s=...)
"".rstrip()
Return a string with trailing whitespace characters
removed. For Python 1.6+, use of a string object method
is stylistically preferred in many cases:
>>> import string
>>> s = """
... Mary had a little lamb \011"""
>>> string.rstrip(s)
'\012 Mary had a little lamb'
>>> s.rstrip()
'\012 Mary had a little lamb'
Python 2.3+ accepts the optional argument 'chars' to the
string object method. All characters in the string
'chars' will be removed.
SEE ALSO, `string.lstrip(), `string.strip()`
string.split(s=... [,sep=... [,maxsplit=...]])
"".split([,sep [,maxsplit]])
Return a list of nonoverlapping substrings of 's'. If the
second argument 'sep' is specified, the substrings are
divided around the occurrences of 'sep'. If 'sep' is not
specified, the substrings are divided around -any-
whitespace characters. The dividing strings do not
appear in the resultant list. If the third argument
'maxsplit' is specified, everything "left over" after
splitting 'maxsplit' parts is appended to the list,
giving the list length 'maxsplit'+1.
>>> import string
>>> s = 'mary had a little lamb ...with a glass of sherry'
>>> string.split(s, ' a ')
['mary had', 'little lamb ...with', 'glass of sherry']
>>> string.split(s)
['mary', 'had', 'a', 'little', 'lamb', '...with', 'a', 'glass',
'of', 'sherry']
>>> string.split(s,maxsplit=5)
['mary', 'had', 'a', 'little', 'lamb', '...with a glass of sherry']
For Python 1.6+, use of a string object method is
stylistically preferred in many cases:
>>> "Mary had a Little Lamb!".split()
['Mary', 'had', 'a', 'Little', 'Lamb!']
The `string.split()` function (and corresponding string
object method) is surprisingly versatile for working with
texts, especially ones that resemble prose. Its default
behavior of treating all whitespace as a single divider
allows `string.split()` to act as a quick-and-dirty word
parser:
>>> wc = lambda s: len(s.split())
>>> wc("Mary had a Little Lamb")
5
>>> s = """Mary had a Little Lamb
... its fleece as white as snow.
... And everywhere that Mary went ... the lamb was sure to go."""
>>> print s
Mary had a Little Lamb
its fleece as white as snow.
And everywhere that Mary went ... the lamb was sure to go.
>>> wc(s)
23
The function `string.split()` is very often used in
conjunction with `string.join()`. The pattern involved is
"pull the string apart, modify the parts, put it back
together." Often the parts will be words, but this also
works with lines (dividing on '\n') or other chunks. For
example:
>>> import string
>>> s = """Mary had a Little Lamb
... its fleece as white as snow.
... And everywhere that Mary went ... the lamb was sure to go."""
>>> string.join(string.split(s))
'Mary had a Little Lamb its fleece as white as snow. And everywhere
... that Mary went the lamb was sure to go.'
A Python 1.6+ idiom for string object methods expresses
this technique compactly:
>>> "-".join(s.split())
'Mary-had-a-Little-Lamb-its-fleece-as-white-as-snow.-And-everywhere
...-that-Mary-went--the-lamb-was-sure-to-go.'
SEE ALSO, `string.join()`,
`mx.TextTools.setsplit()`,
`mx.TextTools.charsplit()`,
`mx.TextTools.splitat()`,
`mx.TextTools.splitlines()`
string.splitfields(...)
Identical to `string.split()`.
"".splitlines([keepends=0])
This string method does not have an equivalent in the
[string] module. Return a list of lines in the string.
The optional argument 'keepends' determines whether line
break character(s) are included in the line strings.
"".startswith(prefix [,start [,end]])
This string method does not have an equivalent in the
[string] module. Return a Boolean value indicating whether
the string begins with the prefix 'prefix'. If the optional
second argument 'start' is specified, only consider the
terminal substring after the offset 'start'. If the
optional third argument 'end' is given, only consider the
slice '[start:end]'.
SEE ALSO, `"".endswith()`, `string.find()`
string.strip(s=...)
"".strip()
Return a string with leading and trailing whitespace
characters removed. For Python 1.6+, use of a string
object method is stylistically preferred in many cases:
>>> import string
>>> s = """
... Mary had a little lamb \011"""
>>> string.strip(s)
'Mary had a little lamb'
>>> s.strip()
'Mary had a little lamb'
Python 2.3+ accepts the optional argument 'chars' to the
string object method. All characters in the string
'chars' will be removed.
>>> s = "MARY had a LITTLE lamb STEW"
>>> s.strip("ABCDEFGHIJKLMNOPQRSTUVWXYZ") # strip caps
' had a LITTLE lamb '
SEE ALSO, `string.rstrip(), `string.lstrip()`
string.swapcase(s=...)
"".swapcase()
Return a string with any uppercase letters converted to
lowercase and any lowercase letters converted to uppercase.
>>> import string
>>> string.swapcase("mary HAD a little lamb!")
'MARY had A LITTLE LAMB!'
For Python 1.6+, use of a string object method is
stylistically preferred in many cases:
>>> "Mary had a Little Lamb!".swapcase()
'MARY had A LITTLE LAMB!'
SEE ALSO, `string.upper()`, `string.lower()`
string.translate(s=..., table=... [,deletechars=""])
"".translate(table [,deletechars=""])
Return a string, based on 's', with 'deletechars' deleted
(if third argument is specified) and with any remaining
characters translated according to translation 'table'.
>>> import string
>>> tab = string.maketrans('ABC','abc')
>>> string.translate('MARY HAD a little LAMB', tab, 'Atl')
'MRY HD a ie LMb'
For Python 1.6+, use of a string object method is
stylistically preferred in many cases. However, if
`string.maketrans()` is used to create the translation
table, one will need to import the [string] module
anyway:
>>> 'MARY HAD a little LAMB'.translate(tab, 'Atl')
'MRY HD a ie LMb'
The `string.translate()` function is a -very- fast way to
modify a string. Setting up the translation table takes
some getting used to, but the resultant transformation is
much faster than a procedural technique such as:
>>> (new,frm,to,dlt) = ("",'ABC','abc','Alt')
>>> for c in 'MARY HAD a little LAMB':
... if c not in dlt:
... pos = frm.find(c)
... if pos == -1: new += c
... else: new += to[pos]
...
>>> new
'MRY HD a ie LMb'
SEE ALSO, `string.maketrans()`
string.upper(s=...)
"".upper()
Return a string with any lowercase letters converted to
uppercase.
>>> import string
>>> string.upper("mary HAD a little lamb!")
'MARY HAD A LITTLE LAMB!'
>>> string.upper("Mary had a Little Lamb!")
'MARY HAD A LITTLE LAMB!'
For Python 1.6+, use of a string object method is
stylistically preferred in many cases:
>>> "Mary had a Little Lamb!".upper()
'MARY HAD A LITTLE LAMB!'
SEE ALSO, `string.lower()`
string.zfill(s=..., width=...)
Return a string with 's' padded with leading zeros (but
not truncated) to occupy length 'width' (or more). If a
leading sign is present, it "floats" to the beginning of
the return value. In general, `string.zfill()` is
designed for alignment of numeric values, but no checking
is done that a string looks number-like.
>>> import string
>>> string.zfill("this", 20)
'0000000000000000this'
>>> string.zfill("-37", 20)
'-0000000000000000037'
>>> string.zfill("+3.7", 20)
'+00000000000000003.7'
Based on the example of `string.rjust()`, one might
expect a string object method `"".zfill()`; however,
no such method exists.
SEE ALSO, `string.rjust()`
TOPIC -- Strings as Files, and Files as Strings
--------------------------------------------------------------------
In many ways, strings and files do a similar job. Both provide a
storage container for an unlimited amount of (textual)
information that is directly structured only by linear position
of the bytes. A first inclination is to suppose that the
difference between files and strings is one of persistence--files
hang around when the current program is no longer running. But
that distinction is not really tenable. On the one hand, standard
Python modules like [shelve], [pickle], and [marshal]--and
third-party modules like [xml_pickle] and [ZODB]--provide simple
ways of making strings persist (but not thereby correspond in any
direct way to a filesystem). On the other hand, many files are
not particularly persistent: Special files like STDIN and STDOUT
under Unix-like systems exist only for program life; other
peculiar files like '/dev/cua0' and similar "device files" are
really just streams; and even files that live on transient memory
disks, or get deleted with program cleanup, are not very
persistent.
The real difference between files and strings in Python is no
more or less than the set of techniques available to operate on
them. File objects can do things like '.read()' and '.seek()' on
themselves. Notably, file objects have a concept of a "current
position" that emulates an imaginary "read-head" passing over the
physical storage media. Strings, on the other hand, can be sliced
and indexed--for example 'str[4:10]' or 'for c in str:'--and can
be processed with string object methods and by functions of
modules like [string] and [re]. Moreover, a number of
special-purpose Python objects act "file-like" without quite
being files; for example `gzip.open()` and `urllib.urlopen()`. Of
course, Python itself does not impose any strict condition for
just how "file-like" something has to be to work in a file-like
context. A programmer has to figure that out for each type of
object she wishes to apply techniques to (but most of the time
things "just work" right).
Happily, Python provides some standard modules to make files
and strings easily interoperable.
=================================================================
MODULE -- mmap : Memory-mapped file support
=================================================================
The [mmap] module allows a programmer to create "memory-mapped"
file objects. These special [mmap] objects enable most of the
techniques you might apply to "true" file objects and
simultaneously most of the techniques one might apply to "true"
strings. Keep in mind the hinted caveat about "most," however:
Many [string] module functions are implemented using the
corresponding string object methods. Since a [mmap] object is
only somewhat "string-like," it basically only implements the
'.find()' method and those "magic" methods associated with
slicing and indexing. This is enough to support most string
object idioms.
When a string-like change is made to a [mmap] object, that change
is propagated to the underlying file, and the change is
persistent (assuming the underlying file is persistent, and that
the object called '.flush()' before destruction). [mmap] thereby
provides an efficient route to "persistent strings."
Some examples of working with memory-mapped file objects are
worth looking at:
>>> # Create a file with some test data
>>> open('test','w').write(' #'.join(map(str, range(1000))))
>>> fp = open('test','r+')
>>> import mmap
>>> mm = mmap.mmap(fp.fileno(),1000)
>>> len(mm)
1000
>>> mm[-20:]
'218 #219 #220 #221 #'
>>> import string # apply a string module method
>>> mm.seek(string.find(mm, '21'))
>>> mm.read(10)
'21 #22 #23'
>>> mm.read(10) # next ten bytes
' #24 #25 #'
>>> mm.find('21') # object method to find next occurrence
402
>>> try: string.rfind(mm, '21')
... except AttributeError: print "Unsupported string function"
...
Unsupported string function
>>> '/'.join(re.findall('..21..',mm)) # regex's work nicely
' #21 #/#121 #/ #210 / #212 / #214 / #216 / #218 /#221 #'
It is worth emphasizing that the bytes in a file on disk are in
fixed positions. You may use the `mmap.mmap.resize()` method
to write into different portions of a file, but you cannot
expand the file from the middle, only by adding to the end.
CLASSES:
mmap.mmap(fileno, length [,tagname]) (Windows)
mmap.mmap(fileno, length [,flags=MAP_SHARED,
-¯ prot=PROT_READ|PROT_WRITE])
-¯ (Unix)
Create a new memory-mapped file object. 'fileno' is the
numeric file handle to base the mapping on. Generally this
number should be obtained using the '.fileno()' method of a
file object. 'length' specifies the length of the mapping.
Under Windows, the value 0 may be given for 'length' to
specify the current length of the file. If 'length'
smaller than the current file is specified, only the
initial portion of the file will be mapped. If 'length'
larger than the current file is specified, the file can be
extended with additional string content.
The underlying file for a memory-mapped file object must be
opened for updating, using the "+" mode modifier.
According to the official Python documentation for Python
2.1, a third argument 'tagname' may be specified. If
it is, multiple memory-maps against the same file are
created. In practice, however, each instance of
`mmap.mmap()` creates a new memory-map whether or not a
'tagname' is specified. In any case, this allows multiple
file-like updates to the same underlying file, generally at
different positions in the file.
>>> open('test','w').write(' #'.join([str(n) for n in range(1000)]))
>>> fp = open('test','r+')
>>> import mmap
>>> mm1 = mmap.mmap(fp.fileno(),1000)
>>> mm2 = mmap.mmap(fp.fileno(),1000)
>>> mm1.seek(500)
>>> mm1.read(10)
'122 #123 #'
>>> mm2.read(10)
'0 #1 #2 #3'
Under Unix, the third argument 'flags' may be MAP_PRIVATE
or MAP_SHARED. If MAP_SHARED is specified for 'flags', all
processes mapping the file will see the changes made to a
[mmap] object. Otherwise, the changes are restricted to
the current process. The fourth argument, 'prot', may be
used to disallow certain types of access by other processes
to the mapped file regions.
METHODS:
mmap.mmap.close()
Close the memory-mapped file object. Subsequent calls to
the other methods of the [mmap] object will raise an
exception. Under Windows, the behavior of a [mmap] object
after '.close()' is somewhat erratic, however. Note that
closing the memory-mapped file object is not the same as
closing the underlying file object. Closing the underlying
file will make the contents inaccessible, but closing the
memory-mapped file object will not affect the underlying
file object.
SEE ALSO, `FILE.close()`
mmap.mmap.find(sub [,pos])
Similar to `string.find()`. Return the index position of
the first occurrence of 'sub' in the [mmap] object. If the
optional second argument 'pos' is specified, the result is
the offset returned relative to 'pos'. Return -1 if no
occurrence is found:
>>> open('test','w').write(' #'.join([str(n) for n in range(1000)]))
>>> fp = open('test','r+')
>>> import mmap
>>> mm = mmap.mmap(fp.fileno(), 0)
>>> mm.find('21')
74
>>> mm.find('21',100)
-26
>>> mm.tell()
0
SEE ALSO, `mmap.mmap.seek()`, `string.find()`
mmap.mmap.flush([offset, size])
Writes changes made in memory to [mmap] object back to
disk. The first argument 'offset' and second argument 'size'
must either both be specified or both omitted. If 'offset'
and 'size' are specified, only the position starting at
'offset' or length 'size' will be written back to disk.
`mmap.mmap.flush()` is necessary to guarantee that changes
are written to disk; however, no guarantee is given that
changes -will not- be written to disk as part of normal
Python interpreter housekeeping. [mmap] should not be used
for systems with "cancelable" changes (since changes may
not be cancelable).
SEE ALSO, `FILE.flush()`
mmap.mmap.move(target, source, length)
Copy a substring within a memory-mapped file object. The
length of the substring is the third argument 'length'. The
target location is the first argument 'target'. The
substring is copied from the position 'source'. It is
allowable to have the substring's original position overlap
its target range, but it must not go past the last position
of the [mmap] object.
>>> open('test','w').write(''.join([c*10 for c in 'ABCDE']))
>>> fp = open('test','r+')
>>> import mmap
>>> mm = mmap.mmap(fp.fileno(),0)
>>> mm[:]
'AAAAAAAAAABBBBBBBBBBCCCCCCCCCCDDDDDDDDDDEEEEEEEEEE'
>>> mm.move(40,0,5)
>>> mm[:]
'AAAAAAAAAABBBBBBBBBBCCCCCCCCCCDDDDDDDDDDAAAAAEEEEE'
mmap.mmap.read(num)
Return a string containing 'num' bytes, starting at the
current file position. The file position is moved to the
end of the read string. In contrast to the '.read()'
method of file objects, `mmap.mmap.read()` always requires
that a byte count be specified, which makes a memory-map
file object not fully substitutable for a file object when
data is read. However, the following is safe for both true
file objects and memory-mapped file objects:
>>> open('test','w').write(' #'.join([str(n) for n in range(1000)]))
>>> fp = open('test','r+')
>>> import mmap
>>> mm = mmap.mmap(fp.fileno(),0)
>>> def safe_readall(file):
... try:
... length = len(file)
... return file.read(length)
... except TypeError:
... return file.read()
...
>>> s1 = safe_readall(fp)
>>> s2 = safe_readall(mm)
>>> s1 == s2
1
SEE ALSO, `mmap.mmap.read_byte()`, `mmap.mmap.readline()`,
`mmap.mmap.write()`, `FILE.read()`
mmap.mmap.read_byte()
Return a one-byte string from the current file position
and advance the current position by one. Same as
'mmap.mmap.read(1)'.
SEE ALSO, `mmap.mmap.read()`, `mmap.mmap.readline()`
mmap.mmap.readline()
Return a string from the memory-mapped file object,
starting from the current file position and going to the
next newline character. Advance the current file position
by the amount read.
SEE ALSO, `mmap.mmap.read()`, `mmap.mmap.read_byte()`,
`FILE.readline()`
mmap.mmap.resize(newsize)
Change the size of a memory-mapped file object. This may
be used to expand the size of an underlying file or merely
to expand the area of a file that is memory-mapped. An
expanded file is padded with null bytes ('\000') unless
otherwise filled with content. As with other operations on
[mmap] objects, changes to the underlying file system may
not occur until a '.flush()' is performed.
SEE ALSO, `mmap.mmap.flush()`
mmap.mmap.seek(offset [,mode])
Change the current file position. If a second argument
'mode' is given, a different seek mode can be selected.
The default is 0, absolute file positioning. Mode 1 seeks
relative to the current file position. Mode 2 is relative
to the end of the memory-mapped file (which may be smaller
than the whole size of the underlying file). The first
argument 'offset' specifies the distance to move the
current file position--in mode 0 it should be positive, in
mode 2 it should be negative, in mode 1 the current
position can be moved either forward or backward.
SEE ALSO, `FILE.seek()`
mmap.mmap.size()
Return the length of the underlying file. The size of the
actual memory-map may be smaller if less than the whole
file is mapped:
>>> open('test','w').write('X'*100)
>>> fp = open('test','r+')
>>> import mmap
>>> mm = mmap.mmap(fp.fileno(),50)
>>> mm.size()
100
>>> len(mm)
50
SEE ALSO, `len()`, `mmap.mmap.seek()`, `mmap.mmap.tell()`
mmap.mmap.tell()
Return the current file position.
>>> open('test','w').write('X'*100)
>>> fp = open('test','r+')
>>> import mmap
>>> mm = mmap.mmap(fp.fileno(), 0)
>>> mm.tell()
0
>>> mm.seek(20)
>>> mm.tell()
20
>>> mm.read(20)
'XXXXXXXXXXXXXXXXXXXX'
>>> mm.tell()
40
SEE ALSO, `FILE.tell()`, `mmap.mmap.seek()`
mmap.mmap.write(s)
Write 's' into the memory-mapped file object at the current
file position. The current file position is updated to the
position following the write. The method
`mmap.mmap.write()` is useful for functions that expect to
be passed a file-like object with a '.write()' method.
However, for new code, it is generally more natural to use
the string-like index and slice operations to write
contents. For example:
>>> open('test','w').write('X'*50)
>>> fp = open('test','r+')
>>> import mmap
>>> mm = mmap.mmap(fp.fileno(), 0)
>>> mm.write('AAAAA')
>>> mm.seek(10)
>>> mm.write('BBBBB')
>>> mm[30:35] = 'SSSSS'
>>> mm[:]
'AAAAAXXXXXBBBBBXXXXXXXXXXXXXXXSSSSSXXXXXXXXXXXXXXX'
>>> mm.tell()
15
SEE ALSO, `FILE.write()`, `mmap.mmap.read()`
mmap.mmap.write_byte(c)
Write a one-byte string to the current file position,
and advance the current position by one. Same as
'mmap.mmap.write(c)' where 'c' is a one-byte string.
SEE ALSO, `mmap.mmap.write()`
=================================================================
MODULE -- StringIO : File-like objects that read from or
write to a string buffer
=================================================================
MODULE -- cStringIO : Fast, but incomplete, StringIO
replacement
=================================================================
The [StringIO] and [cStringIO] modules allow a programmer to
create "memory files," that is, "string buffers." These special
[StringIO] objects enable most of the techniques you might apply
to "true" file objects, but without any connection to a
filesystem.
The most common use of string buffer objects is when some
existing techniques for working with byte-streams in files are to
be applied to strings that do not come from files. A string
buffer object behaves in a file-like manner and can "drop in" to
most functions that want file objects.
[cStringIO] is much faster than [StringIO] and should be used in
most cases. Both modules provide a 'StringIO' class whose
instances are the string buffer objects. `cStringIO.StringIO`
cannot be subclassed (and therefore cannot provide additional
methods), and it cannot handle Unicode strings. One rarely needs
to subclass [StringIO], but the absence of Unicode support in
[cStringIO] could be a problem for many developers. As well,
[cStringIO] does not support write operations, which makes its
string buffers less general (the effect of a write against an
in-memory file can be accomplished by normal string operations).
A string buffer object may be initialized with a string (or
Unicode for [StringIO]) argument. If so, that is the initial
content of the buffer. Below are examples of usage (including
Unicode handling):
>>> from cStringIO import StringIO as CSIO
>>> from StringIO import StringIO as SIO
>>> alef, omega = unichr(1488), unichr(969)
>>> sentence = "In set theory, the Greek "+omega+" represents the \n"+\
... "ordinal limit of the integers, while the Hebrew \n"+\
... alef+" represents their cardinality."
>>> sio = SIO(sentence)
>>> try:
... csio = CSIO(sentence)
... print "New string buffer from raw string"
... except TypeError:
... csio = CSIO(sentence.encode('utf-8'))
... print "New string buffer from ENCODED string"
...
New string buffer from ENCODED string
>>> sio.getvalue() == unicode(csio.getvalue(),'utf-8')
1
>>> try:
... sio.getvalue() == csio.getvalue()
... except UnicodeError:
... print "Cannot even compare Unicode with string, in general"
...
Cannot even compare Unicode with string, in general
>>> lines = csio.readlines()
>>> len(lines)
3
>>> sio.seek(0)
>>> print sio.readline().encode('utf-8'),
In set theory, the Greek ω represents the ordinal
>>> sio.tell(), csio.tell()
(51, 124)
CONSTANTS:
cStringIO.InputType
The type of a `cStringIO.StringIO` instance that has been
opened in "read" mode. All `StringIO.StringIO` instances
are simply InstanceType.
SEE ALSO, `cStringIO.StringIO`
cStringIO.OutputType
The type of `cStringIO.StringIO` instance that has been
opened in "write" mode (actually read/write). All
`StringIO.StringIO` instances are simply InstanceType.
SEE ALSO, `cStringIO.StringIO`
CLASSES:
StringIO.StringIO([buf=...])
cStringIO.StringIO([buf])
Create a new string buffer. If the first argument 'buf' is
specified, the buffer is initialized with a string content.
If the [cStringIO] module is used, the presence of the 'buf'
argument determines whether write access to the buffer is
enabled. A `cStringIO.StringIO` buffer with write access
must be initialized with no argument, otherwise it becomes
read-only. A `StringIO.StringIO` buffer, however, is
always read/write.
METHODS:
StringIO.StringIO.close()
cStringIO.StringIO.close()
Close the string buffer. No access is permitted after close.
SEE ALSO, `FILE.close()`
StringIO.StringIO.flush()
cStringIO.StringIO.flush()
Compatibility method for file-like behavior. Data in a string
buffer is already in memory, so there is no need to finalize a
write to disk.
SEE ALSO, `FILE.close()`
StringIO.StringIO.getvalue()
cStringIO.StringIO.getvalue()
Return the entire string held by the string buffer. Does
not affect the current file position. Basically, this is
the way you convert back from a string buffer to a string.
StringIO.StringIO.isatty()
cStringIO.StringIO.isatty()
Return 0. Compatibility method for file-like behavior.
SEE ALSO, `FILE.isatty()`
StringIO.StringIO.read([num])
cStringIO.StringIO.read([num])
If the first argument 'num' is specified, return a string
containing the next 'num' characters. If 'num' characters
are not available, return as many as possible. If 'num' is
not specified, return all the characters from current file
position to end of string buffer. Advance the current file
position by the amount read.
SEE ALSO, `FILE.read()`, `mmap.mmap.read()`,
`StringIO.StringIO.readline()`
StringIO.StringIO.readline([length=...])
cStringIO.StringIO.readline([length])
Return a string from the string buffer, starting from the
current file position and going to the next newline
character. Advance the current file position by the amount
read.
SEE ALSO, `mmap.mmap.readline()`,
`StringIO.StringIO.read()`,
`StringIO.StringIO.readlines()`,
`FILE.readline()`
StringIO.StringIO.readlines([sizehint=...])
cStringIO.StringIO.readlines([sizehint]
Return a list of strings from the string buffer. Each
list element consists of a single line, including the
trailing newline character(s). If an argument 'sizehint'
is specified, only read approximately 'sizehint' characters
worth of lines (full lines will always be read).
SEE ALSO, `StringIO.StringIO.readline()`,
`FILE.readlines()`
cStringIO.StringIO.reset()
Sets the current file position to the beginning of the
string buffer. Same as 'cStringIO.StringIO.seek(0)'.
SEE ALSO, `StringIO.StringIO.seek()`
StringIO.StringIO.seek(offset [,mode=0])
cStringIO.StringIO.seek(offset [,mode])
Change the current file position. If the second argument
'mode' is given, a different seek mode can be selected.
The default is 0, absolute file positioning. Mode 1 seeks
relative to the current file position. Mode 2 is relative
to the end of the string buffer. The first argument
'offset' specifies the distance to move the current file
position--in mode 0 it should be positive, in mode 2 it
should be negative, in mode 1 the current position can be
moved either forward or backward.
SEE ALSO, `FILE.seek()`, `mmap.mmap.seek()`
StringIO.StringIO.tell()
cStringIO.StringIO.tell()
Return the current file position in the string buffer.
SEE ALSO, `StringIO.StringIO.seek()`
StringIO.StringIO.truncate([len=0])
cStringIO.StringIO.truncate([len])
Reduce the length of the string buffer to the first
argument 'len' characters. Truncate can only reduce
characters later than the current file position (an initial
'cStringIO.StringIO.reset()' can be used to assure
truncation from the beginning).
SEE ALSO, `StringIO.StringIO.seek()`,
`cStringIO.StringIO.reset()`,
`StringIO.StringIO.close()`
StringIO.StringIO.write(s=...)
cStringIO.StringIO.write(s)
Write the first argument 's' into the string buffer at the
current file position. The current file position is
updated to the position following the write.
SEE ALSO, `StringIO.StringIO.writelines()`,
`mmap.mmap.write()`, `StringIO.StringIO.read()`,
`FILE.write()`
StringIO.StringIO.writelines(list=...)
cStringIO.StringIO.writelines(list)
Write each element of 'list' into the string buffer at the
current file position. The current file position is
updated to the position following the write. For the
[cStringIO] method, 'list' must be an actual list. For the
[StringIO] method, other sequence types are allowed. To be
safe, it is best to coerce an argument into an actual list
first. In either case, 'list' must contain only strings,
or a 'TypeError' will occur.
Contrary to what might be expected from the method name,
`StringIO.StringIO.writelines()` never inserts newline
characters. For the list elements actually to occupy
separate lines in the string buffer, each element string
must already have a newline terminator. Consider the
following variants on writing a list to a string buffer:
>>> from StringIO import StringIO
>>> sio = StringIO()
>>> lst = [c*5 for c in 'ABC']
>>> sio.writelines(lst)
>>> sio.write(''.join(lst))
>>> sio.write('\n'.join(lst))
>>> print sio.getvalue()
AAAAABBBBBCCCCCAAAAABBBBBCCCCCAAAAA
BBBBB
CCCCC
SEE ALSO, `FILE.writelines()`, `StringIO.StringIO.write()`
TOPIC -- Converting Between Binary and ASCII
--------------------------------------------------------------------
The Python standard library provides several modules for
converting between binary data and 7-bit ASCII. At the low level,
[binascii] is a C extension to produce fast string conversions.
At a high level, [base64], [binhex], [quopri], and [uu] provide
file-oriented wrappers to the facilities in [binascii].
=================================================================
MODULE -- base64 : Convert to/from base64 encoding (RFC1521)
=================================================================
The [base64] module is a wrapper around the functions
`binascii.a2b_base64()` and `binascii.b2a_base64()`. As well
as providing a file-based interface on top of the underlying
string conversions, [base64] handles the chunking of binary
files into base64 line blocks and provides for the direct
encoding of arbitrary input strings. Unlike [uu], [base64]
adds no content headers to encoded data; MIME standards for
headers and message-wrapping are handled by other modules that
utilize [base64]. Base64 encoding is specified in RFC1521.
FUNCTIONS:
base64.encode(input=..., output=...)
Encode the contents of the first argument 'input' to the
second argument 'output'. Arguments 'input' and 'output'
should be file-like objects; 'input' must be readable and
'output' must be writable.
base64.encodestring(s=...)
Return the base64 encoding of the string passed in the
first argument 's'.
base64.decode(input=..., output=...)
Decode the contents of the first argument 'input' to the
second argument 'output'. Arguments 'input' and 'output'
should be file-like objects; 'input' must be readable and
'output' must be writable.
base64.decodestring(s=...)
Return the decoding of the base64-encoded string passed in
the first argument 's'.
SEE ALSO, [email], `rfc822`, `mimetools`, [mimetypes],
`MimeWriter`, `mimify`, [binascii], [quopri]
=================================================================
MODULE -- binascii : Convert between binary data and ASCII
=================================================================
The [binascii] module is a C implementation of a number of
styles of ASCII encoding of binary data. Each function in the
[binascii] module takes either encoded ASCII or raw binary
strings as an argument, and returns the string result of
converting back or forth. Some restrictions apply to the
length of strings passed to some functions in the module (for
encodings that operate on specific block sizes).
FUNCTIONS:
binascii.a2b_base64(s)
Return the decoded version of a base64-encoded string.
A string consisting of one or more encoding blocks should
be passed as the argument 's'.
binascii.a2b_hex(s)
Return the decoded version of a hexadecimal-encoded string.
A string consisting of an even number of hexadecimals
digits should be passed as the argument 's'.
binascii.a2b_hqx(s)
Return the decoded version of a binhex-encoded string.
A string containing a complete number of encoded binary
bytes should be passed as the argument 's'.
binascii.a2b_qp(s [,header=0])
Return the decoded version of a quoted printable string.
A string containing a complete number of encoded binary
bytes should be passed as the argument 's'. If the
optional argument 'header' is specified, underscores will
be decoded as spaces. New to Python 2.2.
binascii.a2b_uu(s)
Return the decoded version of a UUencoded string. A string
consisting of exactly one encoding block should be passed
as the argument 's' (for a full block, 62 bytes input, 45
bytes returned).
binascii.b2a_base64(s)
Return the based64 encoding of a binary string (including
the newline after block). A binary string no longer than
57 bytes should be passed as the argument 's'.
binascii.b2a_hex(s)
Return the hexadecimal encoding of a binary string. A
binary string of any length should be passed as the
argument 's'.
binascii.b2a_hqx(s)
Return the binhex4 encoding of a binary string. A
binary string of any length should be passed as the
argument 's'. Run-length compression of 's' is not
performed by this function (use `binascii.rlecode_hqx()`
first, if needed).
binascii.b2a_qp(s [,quotetabs=0 [,istext=1 [header=0]]])
Return the quoted printable encoding of a binary string. A
binary string of any length should be passed as the argument
's'. The optional argument 'quotetabs' specified whether
to escape spaces and tabs; 'istext' specifies -not- to
newlines; 'header' specifies whether to encode spaces as
underscores (and escape underscores). New to Python 2.2.
binascii.b2a_uu(s)
Return the UUencoding of a binary string (including
the initial block specifier--"M" for full blocks--and
newline after block). A binary string no longer than 45
bytes should be passed as the argument 's'.
binascii.crc32(s [,crc])
Return the CRC32 checksum of the first argument 's'. If
the second argument 'crc' is specified, it will be used as
an initial checksum. This allows partial computation of a
checksum and continuation. For example:
>>> import binascii
>>> crc = binascii.crc32('spam')
>>> binascii.crc32(' and eggs', crc)
739139840
>>> binascii.crc32('spam and eggs')
739139840
binascii.crc_hqx(s, crc)
Return the binhex4 checksum of the first argument 's',
using initial checksum value in second argument. This
allows partial computation of a checksum and continuation.
For example:
>>> import binascii
>>> binascii.crc_hqx('spam and eggs', 0)
17918
>>> crc = binascii.crc_hqx('spam', 0)
>>> binascii.crc_hqx(' and eggs', crc)
17918
SEE ALSO, `binascii.crc32`
binascii.hexlify(s)
Identical to `binascii.b2a_hex()`.
binascii.rlecode_hqx(s)
Return the binhex4 run-length encoding (RLE) of first
argument 's'. Under this RLE technique, '0x90' is used as
an indicator byte. Independent of the binhex4 standard,
this is a poor choice of precompression for encoded
strings.
SEE ALSO, `zlib.compress()`
binascii.rledecode_hqx(s)
Return the expansion of a binhex4 run-length encoded
string.
binascii.unhexlify(s)
Identical to `binascii.a2b_hex()`
EXCEPTIONS:
binascii.Error
Generic exception that should only result from programming
errors.
binascii.Incomplete
Exception raised when a data block is incomplete. Usually
this results from programming errors in reading blocks, but
it could indicate data or channel corruption.
SEE ALSO, [base64], [binhex], [uu]
=================================================================
MODULE -- binhex : Encode and decode binhex4 files
=================================================================
The [binhex] module is a wrapper around the functions
`binascii.a2b_hqx()`, `binascii.b2a_hqx()`,
`binascii.rlecode_hqx()`, `binascii.rledecode_hqx()`, and
`binascii.crc_hqx()`. As well as providing a file-based
interface on top of the underlying string conversions, [binhex]
handles run-length encoding of encoded files and attaches the
needed header and footer information. Under MacOS, the
resource fork of a file is encoded along with the data fork
(not applicable under other platforms).
FUNCTIONS:
binhex.binhex(inp=..., out=...)
Encode the contents of the first argument 'inp' to the
second argument 'out'. Argument 'inp' is a filename;
'out' may be either a filename or a file-like object.
However, a `cStringIO.StringIO` object is not "file-like"
enough since it will be closed after the conversion--and
therefore, its value lost. You could override the
'.close()' method in a subclass of `StringIO.StringIO` to
solve this limitation.
binhex.hexbin(inp=... [,out=...])
Decode the contents of the first argument to an output
file. If the second argument 'out' is specified, it will
be used as the output filename, otherwise the filename
will be taken from the binhex header. The argument 'inp'
may be either a filename or a file-like object.
CLASSES:
A number of internal classes are used by [binhex]. They are
not documented here, but can be examined in
'$PYTHONHOME/lib/binhex.py' if desired (it is unlikely readers
will need to do this).
SEE ALSO, [binascii]
=================================================================
MODULE -- quopri : Convert to/from quoted printable encoding (RFC1521)
=================================================================
The [quopri] module is a wrapper around the functions
`binascii.a2b_qp()` and `binascii.b2a_qp()`. The module
[quopri] has the same methods as [base64]. Unlike [uu], [base64]
adds no content headers to encoded data; MIME standards for
headers and message wrapping are handled by other modules that
utilize [quopri]. Quoted printable encoding is specified in
RFC1521.
FUNCTIONS:
quopri.encode(input, output, quotetabs)
Encode the contents of the first argument 'input' to the
second argument 'output'. Arguments 'input' and 'output'
should be file-like objects; 'input' must be readable and
'output' must be writable. If 'quotetabs' is a true
value, escape tabs and spaces.
quopri.encodestring(s [,quotetabs=0])
Return the quoted printable encoding of the string passed
in the first argument 's'. If 'quotetabs' is a true value,
escape tabs and spaces.
quopri.decode(input=..., output=... [,header=0])
Decode the contents of the first argument 'input' to the
second argument 'output'. Arguments 'input' and 'output'
should be file-like objects; 'input' must be readable and
'output' must be writable. If 'header' is a true value,
encode spaces as underscores and escape underscores.
quopri.decodestring(s [,header=0])
Return the decoding of the quoted printable string passed
in the first argument 's'. If 'header' is a true value,
decode underscores as spaces.
SEE ALSO, [email], `rfc822`, `mimetools`, [mimetypes],
`MimeWriter`, `mimify`, [binascii], [base64]
=================================================================
MODULE -- uu : UUencode and UUdecode files
=================================================================
The [uu] module is a wrapper around the functions
`binascii.a2b_uu()` and `binascii.b2a_uu()`. As well as
providing a file-based interface on top of the underlying
string conversions, [uu] handles the chunking of binary files
into UUencoded line blocks and attaches the needed header and
footer.
FUNCTIONS:
uu.encode(in, out, [name=... [,mode=0666]])
Encode the contents of the first argument 'in' to the
second argument 'out'. Arguments 'in' and 'out' should be
file objects, but filenames are also accepted (the latter
is deprecated). The special filename "-" can be used to
specify STDIN or STDOUT, as appropriate. When file objects
are passed as arguments, 'in' must be readable and 'out'
must be writable. The third argument 'name' can be used to
specify the filename that appears in the UUencoding header;
by default it is the name of 'in'. The fourth argument
'mode' is the octal filemode to store in the UUencoding
header.
uu.decode(in, [,out_file=... [, mode=...])
Decode the contents of the first argument 'in' to an output
file. If the second argument 'out_file' is specified, it
will be used as the output file; otherwise, the filename
will be taken from the UUencoding header. Arguments 'in'
and 'out_file' should be file objects, but filenames are
also accepted (the latter is deprecated). If the third
argument 'mode' is specified (and if 'out_file' is either
unspecified or is a filename), open the created file in
mode 'mode'.
SEE ALSO, [binascii]
TOPIC -- Cryptography
--------------------------------------------------------------------
Python does not come with any standard and general cryptography
modules. The few included capabilities are fairly narrow in
purpose and limited in scope. The capabilities in the standard
library consist of several cryptographic hashes and one weak
symmetrical encryption algorithm. A quick survey of cryptographic
techniques shows what capabilities are absent from the standard
library:
*Symmetrical Encryption:* Any technique by which a plaintext
message M is "encrypted" with a key K to produce a cyphertext C.
Application of K--or some K' easily derivable from K--to C is
called "decryption" and produces as output M. The standard module
[rotor] provides a form of symmetrical encryption.
*Cryptographic Hash:* Any technique by which a short "hash" H is
produced from a plaintext message M that has several additional
properties: (1) Given only H, it is difficult to obtain any M'
such that the cryptographic hash of M' is H; (2) Given two
plaintext messages M and M', there is a very low probability that
the cryptographic hashes of M and M' are the same. Sometimes a
third property is included: (3) Given M, its cryptographic hash
H, and another hash H', examining the relationship between H and
H' does not make it easier to find an M' whose hash is H'. The
standard modules [crypt], [md5], and [sha] provide forms of
cryptographic hashes.
*Asymmetrical Encryption:* Also called "public-key cryptography."
Any technique by which a pair of keys K{pub} and K{priv} can be
generated that have several properties. The algorithm for an
asymmetrical encryption technique will be called "P(M,K)" in the
following. (1) For any plaintext message M, M equals
P(K{priv},P(M,K{pub})). (2) Given only a public-key K{pub}, it is
difficult to obtain a private-key K{priv} that assures the
equality in (1). (3) Given only P(M,K{pub}), it is difficult to
obtain M. In general, in an asymmetrical encryption system, a
user generates K{pub} and K{priv}, then releases K{pub} to other
users but retains K{priv} as a secret. There is no support for
asymmetrical encryption in the standard library.
*Digital Signatures:* Digital signatures are really just
"public-keys in reverse." In many cases, the same underlying
algorithm is used for each. A digital signature is any technique
by which a pair of keys K{ver} and K{sig} can be generated that
have several properties. The algorithm for a digital signature
will be called S(M,K) in the following. (1) For any message M, M
equals P(K{ver},P(M,K{sig})). (2) Given only a verification key
K{ver}, it is difficult to obtain a signature key K{sig} that
assures the equality in (1). (3) Given only P(M,K{sig}), it is
difficult to find any C' such that P(K{ver},C) is a plausible
message (in other words, the signature shows it is not a
forgery). In general, in a digital signature system, a user
generates K{ver} and K{sig}, then releases K{ver} to other users
but retains K{sig} as a secret. There is no support for digital
signatures in the standard library.
-*-
Those outlined are the most important cryptographic techniques.
More detailed general introductions to cryptology and
cryptography can be found at the author's Web site.
A first tutorial is _Introduction to Cryptology Concepts I_:
Further material is in _Introduction to Cryptology Concepts II_:
And more advanced material is in _Intermediate Cryptology:
Specialized Protocols_:
A number of third-party modules have been created to handle
cryptographic tasks; a good guide to these third-party tools is
the Vaults of Parnassus Encryption/Encoding index at
. Only the
tools in the standard library will be covered here
specifically, since all the third-party tools are somewhat far
afield of the topic of text processing as such. Moreover,
third-party tools often rely on additional non-Python
libraries, which will not be present on most platforms; and
these tools will not necessarily be maintained as new Python
versions introduce changes.
The most important third-party modules are listed below. These
are modules that the author believes are likely to be
maintained and that provide access to a wide range of
cryptographic algorithms.
mxCrypto
amkCrypto
Marc-Andre Lemburg and Andrew Kuchling--both valuable
contributors of many Python modules--have played a game of
leapfrog with each other by releasing [mxCrypto] and
[amkCrypto], respectively. Each release of either module
builds on the work of the other, providing compatible
interfaces and overlapping source code. Whatever is newest
at the time you read this is the best bet. Current
information on both should be obtainable from:
Python Cryptography
Andrew Kuchling, who has provided a great deal of excellent
Python documentation, documents these cryptography modules
at:
M2Crypto
The [mxCrypto] and [amkCrypto] modules are most readily
available for Unix-like platforms. A similar range of
cryptographic capabilities for a Windows platform is
available in Ng Pheng Siong's [M2Crypto]. Information and
documentation can be found at:
fcrypt
Carey Evans has created [fcrypt], which is a pure-Python,
single-module replacement for the standard library's
[crypt] module. While probably orders-of-magnitude slower
than a C implementation, [fcrypt] will run anywhere that
Python does (and speed is rarely an issue for this
functionality). [fcrypt] may be obtained at:
=================================================================
MODULE -- crypt : Create and verify Unix-style passwords
=================================================================
The 'crypt()' function is a frequently used, but somewhat
antiquated, password creation/verification tool. Under
Unix-like systems, 'crypt()' is contained in system libraries
and may be called from wrapper functions in languages like
Python. 'crypt()' is a form of cryptographic hash based on the
Data Encryption Standard (DES). The hash produced by 'crypt()'
is based on an 8-byte key and a 2-byte "salt." The output of
'crypt()' is produced by repeated encryption of a constant
string, using the user key as a DES key and the salt to
perturb the encryption in one of 4,096 ways. Both the key and
the salt are restricted to alphanumerics plus dot and slash.
By using a cryptographic hash, passwords may be stored in a
relatively insecure location. An imposter cannot easily
produce a false password that will hash to the same value as
the one stored in the password file, even given access to the
password file. The salt is used to make "dictionary attacks"
more difficult. If an imposter has access to the password
file, she might try applying 'crypt()' to a candidate password
and compare the result to every entry in the password file.
Without a salt, the chances of matching -some- encrypted
password would be higher. The salt (a random value should be
used) decreases the chance of such a random guess by 4,096
times.
The [crypt] module is only installed on some Python systems
(even only some Unix systems). Moreover, the module, if
installed, relies on an underlying system library. For a
portable approach to password creation, the third-party
[fcrypt] module provides a portable, pure-Python
reimplementation.
FUNCTIONS:
crypt.crypt(passwd, salt)
Return an ASCII 13-byte encrypted password. The first
argument 'passwd' must be a string up to eight characters
in length (extra characters are truncated and do not
affect the result). The second argument 'salt' must be a
string up to two characters in length (extra characters are
truncated). The value of 'salt' forms the first two
characters of the result.
>>> from crypt import crypt
>>> crypt('mypassword','XY')
'XY5XuULXk4pcs'
>>> crypt('mypasswo','XY')
'XY5XuULXk4pcs'
>>> crypt('mypassword...more.characters','XY')
'XY5XuULXk4pcs'
>>> crypt('mypasswo','AB')
'AB06lnfYxWIKg'
>>> crypt('diffpass','AB')
'ABlO5BopaFYNs'
SEE ALSO, `fcrypt`, [md5], [sha]
=================================================================
MODULE -- md5 : Create MD5 message digests
=================================================================
RSA Data Security, Inc.'s MD5 cryptographic hash is a popular
algorithm that is codified by RFC1321. Like [sha], and unlike
[crypt], [md5] allows one to find the cryptographic hash of
arbitrary strings (Unicode strings may not be hashed, however).
Absent any other considerations--such as compatibility with other
programs--Secure Hash Algorithm (SHA) is currently considered a
better algorithm than MD5, and the [sha] module should be used
for cryptographic hashes. The operation of [md5] objects is
similar to `binascii.crc32()` hashes in that the final hash value
may be built progressively from partial concatenated strings. The
MD5 algorithm produces a 128-bit hash.
CONSTANTS:
md5.MD5Type
The type of an `md5.new` instance.
CLASSES:
md5.new([s])
Create an [md5] object. If the first argument 's' is
specified, initialize the MD5 digest buffer with the
initial string 's'. An MD5 hash can be computed in a
single line with:
>>> import md5
>>> md5.new('Mary had a little lamb').hexdigest()
'e946adb45d4299def2071880d30136d4'
md5.md5([s])
Identical to `md5.new`.
METHODS:
md5.copy()
Return a new [md5] object that is identical to the current
state of the current object. Different terminal strings
can be concatenated to the clone objects after they are
copied. For example:
>>> import md5
>>> m = md5.new('spam and eggs')
>>> m.digest()
'\xb5\x81f\x0c\xff\x17\xe7\x8c\x84\xc3\xa8J\xd0.g\x85'
>>> m2 = m.copy()
>>> m2.digest()
'\xb5\x81f\x0c\xff\x17\xe7\x8c\x84\xc3\xa8J\xd0.g\x85'
>>> m.update(' are tasty')
>>> m2.update(' are wretched')
>>> m.digest()
'*\x94\xa2\xc5\xceq\x96\xef&\x1a\xc9#\xac98\x16'
>>> m2.digest()
'h\x8c\xfam\xe3\xb0\x90\xe8\x0e\xcb\xbf\xb3\xa7N\xe6\xbc'
md5.digest()
Return the 128-bit digest of the current state of the [md5]
object as a 16-byte string. Each byte will contain a full
8-bit range of possible values.
>>> import md5 # Python 2.1+
>>> m = md5.new('spam and eggs')
>>> m.digest()
'\xb5\x81f\x0c\xff\x17\xe7\x8c\x84\xc3\xa8J\xd0.g\x85'
>>> import md5 # Python <= 2.0
>>> m = md5.new('spam and eggs')
>>> m.digest()
'\265\201f\014\377\027\347\214\204\303\250J\320.g\205'
md5.hexdigest()
Return the 128-bit digest of the current state of the [md5]
object as a 32-byte hexadecimal-encoded string. Each byte
will contain only values in `string.hexdigits`. Each pair
of bytes represents 8-bits of hash, and this format may be
transmitted over 7-bit ASCII channels like email.
>>> import md5
>>> m = md5.new('spam and eggs')
>>> m.hexdigest()
'b581660cff17e78c84c3a84ad02e6785'
md5.update(s)
Concatenate additional strings to the [md5] object.
Current hash state is adjusted accordingly. The number of
concatenation steps that go into an MD5 hash does not
affect the final hash, only the actual string that would
result from concatenating each part in a single string.
However, for large strings that are determined
incrementally, it may be more practical to call
`md5.update()` numerous times. For example:
>>> import md5
>>> m1 = md5.new('spam and eggs')
>>> m2 = md5.new('spam')
>>> m2.update(' and eggs')
>>> m3 = md5.new('spam')
>>> m3.update(' and ')
>>> m3.update('eggs')
>>> m1.hexdigest()
'b581660cff17e78c84c3a84ad02e6785'
>>> m2.hexdigest()
'b581660cff17e78c84c3a84ad02e6785'
>>> m3.hexdigest()
'b581660cff17e78c84c3a84ad02e6785'
SEE ALSO, [sha], [crypt], `binascii.crc32()`
=================================================================
MODULE -- rotor : Perform Enigma-like encryption and decryption
=================================================================
The [rotor] module is a bit of a curiosity in the Python standard
library. The symmetric encryption performed by [rotor] is similar
to that performed by the extremely historically interesting and
important Enigma algorithm. Given Alan Turing's famous role not
just in inventing the theory of computability, but also in
cracking German encryption during WWII, there is a nice literary
quality to the inclusion of [rotor] in Python. However, [rotor]
should not be mistaken for a robust modern encryption algorithm.
Bruce Schneier has commented that there are two types of
encryption algorithms: those that will stop your little sister
from reading your messages, and those that will stop major
governments and powerful organization from reading your messages.
[rotor] is in the first category--albeit allowing for rather
bright little sisters. But [rotor] will not help much against
TLAs (three letter agencies). On the other hand, there is nothing
else in the Python standard library that performs actual
military-grade encryption, either.
CLASSES:
rotor.newrotor(key [,numrotors])
Return a [rotor] object with rotor permutations and
positions based on the first argument 'key'. If the second
argument 'numrotors' is specified, a number of rotors other
than the default of 6 can be used (more is stronger). A
rotor encryption can be computed in a single line with:
>>> rotor.newrotor('mypassword').encrypt('Mary had a lamb')
'\x10\xef\xf1\x1e\xeaor\xe9\xf7\xe5\xad,r\xc6\x9f'
Object style encryption and decryption is performed like
the following:
>>> import rotor
>>> C = rotor.newrotor('pass2').encrypt('Mary had a little lamb')
>>> r1 = rotor.newrotor('mypassword')
>>> C2 = r1.encrypt('Mary had a little lamb')
>>> r1.decrypt(C2)
'Mary had a little lamb'
>>> r1.decrypt(C) # Let's try it
'\217R$\217/sE\311\330~#\310\342\200\025F\221\245\263\036\220O'
>>> r1.setkey('pass2')
>>> r1.decrypt(C) # Let's try it
'Mary had a little lamb'
METHODS:
rotor.decrypt(s)
Return a decrypted version of cyphertext string 's'. Prior
to decryption, rotors are set to their initial positions.
rotor.decryptmore(s)
Return a decrypted version of cyphertext string 's'. Prior
to decryption, rotors are left in their current positions.
rotor.encrypt(s)
Return an encrypted version of plaintext string 's'. Prior
to encryption, rotors are set to their initial positions.
rotor.encryptmore(s)
Return an encrypted version of plaintext string 's'. Prior
to encryption, rotors are left in their current positions.
rotor.setkey(key)
Set a new key for a [rotor] object.
=================================================================
MODULE -- sha : Create SHA message digests
=================================================================
The National Institute of Standards and Technology's (NIST's)
Secure Hash Algorithm is the best well-known cryptographic hash
for most purposes. Like [md5], and unlike [crypt], [sha] allows
one to find the cryptographic hash of arbitrary strings (Unicode
strings may not be hashed, however). Absent any other
considerations--such as compatibility with other programs--SHA is
currently considered a better algorithm than MD5, and the [sha]
module should be used for cryptographic hashes. The operation of
[sha] objects is similar to `binascii.crc32()` hashes in that the
final hash value may be built progressively from partial
concatenated strings. The SHA algorithm produces a 160-bit hash.
CLASSES:
sha.new([s])
Create an [sha] object. If the first argument 's' is
specified, initialize the SHA digest buffer with the
initial string 's'. An SHA hash can be computed in a
single line with:
>>> import sha
>>> sha.new('Mary had a little lamb').hexdigest()
'bac9388d0498fb378e528d35abd05792291af182'
sha.sha([s])
Identical to `sha.new`.
METHODS:
sha.copy()
Return a new [sha] object that is identical to the current
state of the current object. Different terminal strings
can be concatenated to the clone objects after they are
copied. For example:
>>> import sha
>>> s = sha.new('spam and eggs')
>>> s.digest()
'\276\207\224\213\255\375x\024\245b\036C\322\017\2528 @\017\246'
>>> s2 = s.copy()
>>> s2.digest()
'\276\207\224\213\255\375x\024\245b\036C\322\017\2528 @\017\246'
>>> s.update(' are tasty')
>>> s2.update(' are wretched')
>>> s.digest()
'\013^C\366\253?I\323\206nt\2443\251\227\204-kr6'
>>> s2.digest()
'\013\210\237\216\014\3337X\333\221h&+c\345\007\367\326\274\321'
sha.digest()
Return the 160-bit digest of the current state of the [sha]
object as a 20-byte string. Each byte will contain a full
8-bit range of possible values.
>>> import sha # Python 2.1+
>>> s = sha.new('spam and eggs')
>>> s.digest()
'\xbe\x87\x94\x8b\xad\xfdx\x14\xa5b\x1eC\xd2\x0f\xaa8 @\x0f\xa6'
>>> import sha # Python <= 2.0
>>> s = sha.new('spam and eggs')
>>> s.digest()
'\276\207\224\213\255\375x\024\245b\036C\322\017\2528 @\017\246'
sha.hexdigest()
Return the 160-bit digest of the current state of the [sha]
object as a 40-byte hexadecimal-encoded string. Each byte
will contain only values in `string.hexdigits`. Each pair of
bytes represents 8-bits of hash, and this format may be
transmitted over 7-bit ASCII channels like email.
>>> import sha
>>> s = sha.new('spam and eggs')
>>> s.hexdigest()
'be87948badfd7814a5621e43d20faa3820400fa6'
sha.update(s)
Concatenate additional strings to the [sha] object.
Current hash state is adjusted accordingly. The number of
concatenation steps that go into an SHA hash does not
affect the final hash, only the actual string that would
result from concatenating each part in a single string.
However, for large strings that are determined
incrementally, it may be more practical to call
`sha.update()` numerous times. For example:
>>> import sha
>>> s1 = sha.sha('spam and eggs')
>>> s2 = sha.sha('spam')
>>> s2.update(' and eggs')
>>> s3 = sha.sha('spam')
>>> s3.update(' and ')
>>> s3.update('eggs')
>>> s1.hexdigest()
'be87948badfd7814a5621e43d20faa3820400fa6'
>>> s2.hexdigest()
'be87948badfd7814a5621e43d20faa3820400fa6'
>>> s3.hexdigest()
'be87948badfd7814a5621e43d20faa3820400fa6'
SEE ALSO, [md5], [crypt], `binascii.crc32()`
TOPIC -- Compression
--------------------------------------------------------------------
Over the history of computers, a large number of data compression
formats have been invented, mostly as variants on Lempel-Ziv and
Huffman techniques. Compression is useful for all sorts of data
streams, but file-level archive formats have been the most widely
used and known application. Under MS-DOS and Windows we have seen
ARC, PAK, ZOO, LHA, ARJ, CAB, RAR, and other formats--but the ZIP
format has become the most widespread variant. Under Unix-like
systems, 'compress' (.Z) mostly gave way to 'gzip' (GZ); 'gzip'
is still the most popular format on these systems, but 'bzip'
(BZ2) generally obtains better compression rates. Under MacOS,
the most popular format is SIT. Other platforms have additional
variants on archive formats, but ZIP--and to a lesser extent
GZ--are widely supported on a number of platforms.
The Python standard library includes support for several styles
of compression. The [zlib] module performs low-level compression
of raw string data and has no concept of a file. [zlib] is itself
called by the high-level modules below for its compression
services.
The modules [gzip] and [zipfile] provide file-level interfaces to
compressed archives. However, a notable difference in the
operation of [gzip] and [zipfile] arises out of a difference in
the underlying GZ and ZIP formats. 'gzip' (GZ) operates
exclusively on single files--leaving the work of concatenating
collections of files to tools like 'tar'. One frequently
encounters (especially on Unix-like systems) files like
'foo.tar.gz' or 'foo.tgz' that are produced by first applying
'tar' to a collection of files, then applying 'gzip' to the
result. ZIP, however, handles both the compression and archiving
aspects in a single tool and format. As a consequence, [gzip] is
able to create file-like objects based directly on the compressed
contents of a GZ file. [ziplib] needs to provide more specialized
methods for navigating archive contents and for working with
individual compressed file images therein.
Also see Appendix B (A Data Compression Primer).
=================================================================
MODULE -- gzip : Functions that read and write gzipped files
=================================================================
The [gzip] module allows the treatment of the compressed data
inside 'gzip' compressed files directly in a file-like manner.
Uncompressed data can be read out, and compressed data written
back in, all without a caller knowing or caring that the file
is a GZ-compressed file. A simple example illustrates this:
#---------- gzip_file.py ----------#
# Treat a GZ as "just another file"
import gzip, glob
print "Size of data in files:"
for fname in glob.glob('*'):
try:
if fname[-3:] == '.gz':
s = gzip.open(fname).read()
else:
s = open(fname).read()
print ' ',fname,'-',len(s),'bytes'
except IOError:
print 'Skipping',file
The module [gzip] is a wrapper around [zlib], with the latter
performing the actual compression and decompression tasks. In
many respects, [gzip] is similar to [mmap] and [StringIO] in
emulating and/or wrapping a file object.
SEE ALSO, [mmap], [StringIO], [cStringIO]
CLASSES:
gzip.GzipFile([filename=... [,mode="rb" [,compresslevel=9 [,fileobj=...]]]])
Create a [gzip] file-like object. Such an object supports
most file object operations, with the exception of
'.seek()' and '.tell()'. Either the first argument
'filename' or the fourth argument 'fileobj' should be
specified (likely by argument name, especially if fourth
argument 'fileobj').
The second argument 'mode' takes the mode of 'fileobj' if
specified, otherwise it defaults to 'rb' ('r', 'rb', 'a',
'ab', 'w', or 'wb' may be specified with the same meaning
as with `FILE.open()` objects). The third argument
'compresslevel' specifies the level of compression. The
default is the highest level, 9; an integer down to 1 may
be selected for less compression but faster operation
(compression level of a read file comes from the file
itself, however).
gzip.open(filename=... [mode='rb [,compresslevel=9]])
Same as `gzip.GzipFile` but with extra arguments omitted.
A GZ file object opened with `gzip.open` is always opened
by name, not by underlying file object.
METHODS AND ATTRIBUTES:
gzip.close()
Close the [gzip] object. No access is permitted after
close. If the object was opened by file object, the
underlying file object is not closed, only the [gzip]
interface to the file.
SEE ALSO, `FILE.close()`
gzip.flush()
Write outstanding data from memory to disk.
SEE ALSO, `FILE.close()`
gzip.isatty()
Return 0. Compatibility method for file-like behavior.
SEE ALSO, `FILE.isatty()`
gzip.myfileobj
Attribute holding the underlying file object.
gzip.read([num])
If the first argument 'num' is specified, return a string
containing the next 'num' characters. If 'num' characters
are not available, return as many as possible. If 'num' is
not specified, return all the characters from current file
position to end of string buffer. Advance the current file
position by the amount read.
SEE ALSO, `FILE.read()`
gzip.readline([length])
Return a string from the [gzip] object, starting from the
current file position and going to the next newline
character. The argument 'length' limits the read if specified.
Advance the current file position by the amount read.
SEE ALSO, `FILE.readline()`
gzip.readlines([sizehint=...])
Return a list of strings from the [gzip] object. Each
list element consists of a single line, including the
trailing newline character(s). If an argument 'sizehint'
is specified, read only approximately 'sizehint' characters
worth of lines (full lines will always be read).
SEE ALSO, `FILE.readlines()`
gzip.write(s)
Write the first argument 's' into the [gzip] object at the
current file position. The current file position is
updated to the position following the write.
SEE ALSO, `FILE.write()`
gzip.writelines(list)
Write each element of 'list' into the [gzip] object at the
current file position. The current file position is
updated to the position following the write. Most sequence
types are allowed, but 'list' must contain only strings, or
a 'TypeError' will occur.
Contrary to what might be expected from the method name,
`gzip.writelines()` never inserts newline characters. For
the list elements actually to occupy separate lines in the
string buffer, each element string must already have a
newline terminator. See `StringIO.StringIO.writelines()`
for an example.
SEE ALSO, `FILE.writelines()`, `StringIO.StringIO.writelines()`
SEE ALSO, [zlib], [zipfile]
=================================================================
MODULE -- zipfile : Read and write ZIP files
=================================================================
The [zipfile] module enables a variety of operations on ZIP
files and is compatible with archives created by applications
such as PKZip, Info-Zip, and WinZip. Since the ZIP format
allows inclusion of multiple file images within a single
archive, the [zipfile] does not behave in a directly file-like
manner as [gzip] does. Nonetheless, it is possible to view the
contents of an archive, add new file images to one, create a
new ZIP archive, or manipulate the contents and directory
information of a ZIP file.
An initial example of working with the [zipfile] module gives a
feel for its usage.
>>> for name in 'ABC':
... open(name,'w').write(name*1000)
...
>>> import zipfile
>>> z = zipfile.ZipFile('new.zip','w',zipfile.ZIP_DEFLATED) # new archv
>>> z.write('A') # write files to archive
>>> z.write('B','B.newname',zipfile.ZIP_STORED)
>>> z.write('C','C.newname')
>>> z.close() # close the written archive
>>> z = zipfile.ZipFile('new.zip') # reopen archive in read mode
>>> z.testzip() # 'None' returned means OK
>>> z.namelist() # What's in it?
['A', 'B.newname', 'C.newname']
>>> z.printdir() # details
File Name Modified Size
A 2001-07-18 21:39:36 1000
B.newname 2001-07-18 21:39:36 1000
C.newname 2001-07-18 21:39:36 1000
>>> A = z.getinfo('A') # bind ZipInfo object
>>> B = z.getinfo('B.newname') # bind ZipInfo object
>>> A.compress_size
11
>>> B.compress_size
1000
>>> z.read(A.filename)[:40] # Check what's in A
'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'
>>> z.read(B.filename)[:40] # Check what's in B
'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB'
>>> # For comparison, see what Info-Zip reports on created archive
>>> import os
>>> print os.popen('unzip -v new.zip').read()
Archive: new.zip
Length Method Size Ratio Date Time CRC-32 Name
------ ------ ---- ----- ---- ---- ------ ----
1000 Defl:N 11 99% 07-18-01 21:39 51a02e01 A
1000 Stored 1000 0% 07-18-01 21:39 7d9c564d B.newname
1000 Defl:N 11 99% 07-18-01 21:39 66778189 C.newname
------ ------ --- -------
3000 1022 66% 3 files
The module [gzip] is a wrapper around [zlib], with the latter
performing the actual compression and decompression tasks.
CONSTANTS:
Several string constants ([struct] formats) are used to
recognize signature identifiers in the ZIP format. These
constants are not normally used directly by end-users of
[zipfile].
#*----- zipfile constants -----#
zipfile.stringCentralDir = 'PK\x01\x02'
zipfile.stringEndArchive = 'PK\x05\x06'
zipfile.stringFileHeader = 'PK\x03\x04'
zipfile.structCentralDir = '<4s4B4H3l5H2l'
zipfile.structEndArchive = '<4s4H2lH'
zipfile.structFileHeader = '<4s2B4H3l2H'
Symbolic names for the two supported compression methods are
also defined.
#*----- zipfile constants -----#
zipfile.ZIP_STORED = 0
zipfile.ZIP_DEFLATED = 8
FUNCTIONS:
zipfile.is_zipfile(filename=...)
Check if the argument 'filename' is a valid ZIP archive.
Archives with appended comments are not recognized as valid
archives. Return 1 if valid, None otherwise. This
function does not guarantee archive is fully intact, but it
does provide a sanity check on the file type.
CLASSES:
zipfile.PyZipFile(pathname)
Create a `zipfile.ZipFile` object that has the extra method
`zipfile.ZipFile.writepy()`. This extra method allows you
to recursively add all '*.py[oc]' files to an archive.
This class is not general purpose, but a special feature to
aid [distutils].
zipfile.ZipFile(file=... [,mode='r' [,compression=ZIP_STORED]])
Create a new `zipfile.ZipFile` object. This object is used
for management of a ZIP archive. The first argument 'file'
must be specified and is simply the filename of the
archive to be manipulated. The second argument 'mode' may
have one of three string values: 'r' to open the archive
in read-only mode; 'w' to truncate the filename and create
a new archive; 'a' to read an existing archive and add to
it. The third argument 'compression' indicates the
compression method--ZIP_DEFLATED requires that [zlib] and
the zlib system library be present.
zipfile.ZipInfo()
Create a new `zipfile.ZipInfo` object. This object
contains information about an individual archived filename
and its file image. Normally, one will not directly
instantiate `zipfile.ZipInfo` but only look at the
`zipfile.ZipInfo` objects that are returned by methods like
`zipfile.ZipFile.infolist()`, `zipfile.ZipFile.getinfo()`,
and `zipfile.ZipFile.NameToInfo`. However, in special
cases like `zipfile.ZipFile.writestr()`, it is useful to
create a `zipfile.ZipInfo` directly.
METHODS AND ATTRIBUTES:
zipfile.ZipFile.close()
Close the `zipfile.ZipFile` object, and flush any changes
made to it. An object must be explicitly closed to perform
updates.
zipfile.ZipFile.getinfo(name=...)
Return the `zipfile.ZipInfo` object corresponding to the
filename 'name'. If 'name' is not in the ZIP archive, a
'KeyError' is raised.
zipfile.ZipFile.infolist()
Return a list of `zipfile.ZipInfo` objects contained in the
`zipfile.ZipFile` object. The return value is simply a
list of instances of the same type. If the filename
within the archive is known, `zipfile.ZipFile.getinfo()` is
a better method to use. For enumerating over all archived
files, however, `zipfile.ZipFile.infolist()` provides a
nice sequence.
zipfile.ZipFile.namelist()
Return a list of the filenames of all the archived files
(including nested relative directories).
zipfile.ZipFile.printdir()
Print to STDOUT a pretty summary of archived files and
information about them. The results are similar to running
Info-Zip's 'unzip' with the '-l' option.
zipfile.ZipFile.read(name=...)
Return the contents of the archived file with filename
'name'.
zipfile.ZipFile.testzip()
Test the integrity of the current archive. Return the
filename of the first `zipfile.ZipInfo` object with
corruption. If everything is valid, return None.
zipfile.ZipFile.write(filename=... [,arcname=... [,compress_type=...]])
Add the file 'filename' to the `zipfile.ZipFile` object. If
the second argument 'arcname' is specified, use 'arcname' as
the stored filename (otherwise, use 'filename' itself). If
the third argument 'compress_type' is specified, use the
indicated compression method. The current archive must be
opened in 'w' or 'a' mode.
zipfile.ZipFile.writestr(zinfo=..., bytes=...)
Write the data contained in the second argument 'bytes' to
the `zipfile.ZipFile` object. Directory meta-information
must be contained in attributes of the first argument
'zinfo' (a filename, data, and time should be included;
other information is optional). The current archive must
be opened in 'w' or 'a' mode.
zipfile.ZipFile.NameToInfo
Dictionary that maps filenames in archive to corresponding
`zipfile.ZipInfo` objects. The method
`zipfile.ZipFile.getinfo()` is simply a wrapper for a
dictionary lookup in this attribute.
zipfile.ZipFile.compression
Compression type currently in effect for new
`zipfile.ZipFile.write()` operations. Modify with due
caution (most likely not at all after initialization).
zipfile.ZipFile.debug = 0
Attribute for level of debugging information sent to
STDOUT. Values range from the default 0 (no output) to 3
(verbose). May be modified.
zipfile.ZipFile.filelist
List of `zipfile.ZipInfo` objects contained in the
`zipfile.ZipFile` object. The method
`zipfile.ZipFile.infolist()` is simply a wrapper to
retrieve this attribute. Modify with due caution (most
likely not at all).
zipfile.ZipFile.filename
Filename of the `zipfile.ZipFile` object. DO NOT modify!
zipfile.ZipFile.fp
Underlying file object for the `zipfile.ZipFile` object.
DO NOT modify!
zipfile.ZipFile.mode
Access mode of current `zipfile.ZipFile` object. DO NOT
modify!
zipfile.ZipFile.start_dir
Position of start of central directory. DO NOT modify!
zipfile.ZipInfo.CRC
Hash value of this archived file. DO NOT modify!
zipfile.ZipInfo.comment
Comment attached to this archived file. Modify with due
caution (e.g., for use with `zipfile.ZipFile.writestr()`).
zipfile.ZipInfo.compress_size
Size of the compressed data of this archived file. DO NOT
modify!
zipfile.ZipInfo.compress_type
Compression type used with this archived file. Modify with
due caution (e.g., for use with `zipfile.ZipFile.writestr()`).
zipfile.ZipInfo.create_system
System that created this archived file. Modify with due
caution (e.g., for use with `zipfile.ZipFile.writestr()`).
zipfile.ZipInfo.create_version
PKZip version that created the archive. Modify with due
caution (e.g., for use with `zipfile.ZipFile.writestr()`).
zipfile.ZipInfo.date_time
Timestamp of this archived file. Modify with due caution
(e.g., for use with `zipfile.ZipFile.writestr()`).
zipfile.ZipInfo.external_attr
File attribute of archived file when extracted.
zipfile.ZipInfo.extract_version
PKZip version needed to extract the archive. Modify with
due caution (e.g., for use with `zipfile.ZipFile.writestr()`).
zipfile.ZipInfo.file_offset
Byte offset to start of file data. DO NOT modify!
zipfile.ZipInfo.file_size
Size of the uncompressed data in the archived file. DO NOT
modify!
zipfile.ZipInfo.filename
Filename of archived file. Modify with due caution (e.g.,
for use with `zipfile.ZipFile.writestr()`).
zipfile.ZipInfo.header_offset
Byte offset to file header of the archived file. DO NOT
modify!
zipfile.ZipInfo.volume
Volume number of the archived file. DO NOT modify!
EXCEPTIONS:
zipfile.error
Exception that is raised when corrupt ZIP file is
processed.
zipfile.BadZipFile
Alias for `zipfile.error`.
SEE ALSO, [zlib], [gzip]
=================================================================
MODULE -- zlib : Compress and decompress with zlib library
=================================================================
[zlib] is the underlying compression engine for all Python
standard library compression modules. Moreover, [zlib] is
extremely useful in itself for compression and decompression of
data that does not necessarily live in files (or where data
does not map directly to files, even if it winds up in them
indirectly). The Python [zlib] module relies on the
availability of the zlib system library.
There are two basic modes of operation for [zlib]. In the
simplest mode, one can simply pass an uncompressed string to
`zlib.compress()` and have the compressed version returned.
Using `zlib.decompress()` is symmetrical. In a more
complicated mode, one can create compression or decompression
objects that are able to receive incremental raw or compressed
byte-streams, and return partial results based on what they have
seen so far. This mode of operation is similar to the way one
uses `sha.sha.update()`, `md5.md5.update()`,
`rotor.encryptmore()`, or `binascii.crc32()` (albeit for a
different purpose from each of those). For large byte-streams
that are determined, it may be more practical to utilize
compression/decompression objects than it would be to
compress/decompress an entire string at once (for example, if
the input or result is bound to a slow channel).
CONSTANTS:
zlib.ZLIB_VERSION
The installed zlib system library version.
zlib.Z_BEST_COMPRESSION = 9
Highest compression level.
zlib.Z_BEST_SPEED = 1
Fastest compression level.
zlib.Z_HUFFMAN_ONLY = 2
Intermediate compression level that uses Huffman codes,
but not Lempel-Ziv.
FUNCTIONS:
zlib.adler32(s [,crc])
Return the Adler-32 checksum of the first argument 's'.
If the second argument 'crc' is specified, it will be used
as an initial checksum. This allows partial computation
of a checksum and continuation. An Adler-32 checksum can
be computed much more quickly than a CRC32 checksum.
Unlike [md5] or [sha], an Adler-32 checksum is not
sufficient for cryptographic hashes, but merely for
detection of accidental corruption of data.
SEE ALSO, `zlib.crc32()`, [md5], [sha]
zlib.compress(s [,level])
Return the zlib compressed version of the string in the
first argument 's'. If the second argument 'level' is
specified, the compression technique can be fine-tuned.
The compression level ranges from 1 to 9 and may also be
specified using symbolic constants such as
Z_BEST_COMPRESSION and Z_BEST_SPEED. The default value for
'level' is 6 and is usually the desired compression level
(usually within a few percent of the speed of
Z_BEST_SPEED and within a few percent of the size of
Z_BEST_COMPRESSION).
SEE ALSO, `zlib.decompress()`, `zlib.compressobj`
zlib.crc32(s [,crc])
Return the CRC32 checksum of the first argument 's'. If
the second argument 'crc' is specified, it will be used as
an initial checksum. This allows partial computation of a
checksum and continuation. Unlike [md5] or [sha], a
CRC32 checksum is not sufficient for cryptographic hashes,
but merely for detection of accidental corruption of data.
Identical to `binascii.crc32()` (example appears there).
SEE ALSO, `binascii.crc32()`, `zlib.adler32()`, [md5],
[sha]
zlib.decompress(s [,winsize [,buffsize]])
Return the decompressed version of the zlib compressed
string in the first argument 's'. If the second argument
'winsize' is specified, it determines the base 2 logarithm
of the history buffer size. The default 'winsize' is 15.
If the third argument 'buffsize' is specified, it
determines the size of the decompression buffer. The
default 'buffsize' is 16384, but more is dynamically
allocated if needed. One rarely needs to use 'winsize'
and 'buffsize' values other than the defaults.
SEE ALSO, `zlib.compress()`, `zlib.decompressobj`
CLASS FACTORIES:
[zlib] does not define true classes that can be specialized.
`zlib.compressobj()` and `zlib.decompressobj()` are actually
factory-functions rather than classes. That is, they return
instance objects, just as classes do, but they do not have
unbound data and methods. For most users, the difference is not
important: To get a `zlib.compressobj` or `zlib.decompressobj`
object, you just call that factory-function in the same manner
you would a class object.
zlib.compressobj([level])
Create a compression object. A compression object is able
to incrementally compress new strings that are fed to it
while maintaining the seeded symbol table from previously
compressed byte-streams. If argument 'level' is specified,
the compression technique can be fine-tuned. The
compression-level ranges from 1 to 9. The default value
for 'level' is 6 and is usually the desired compression
level.
SEE ALSO, `zlib.compress()`, `zlib.decompressobj()`
zlib.decompressobj([winsize])
Create a decompression object. A decompression object is
able to incrementally decompress new strings that are
fed to it while maintaining the seeded symbol table from
previously decompressed byte-streams. If the argument
'winsize' is specified, it determines the base 2 logarithm
of the history buffer size. The default 'winsize' is 15.
SEE ALSO, `zlib.decompress()`, `zlib.compressobj()`
METHODS AND ATTRIBUTES:
zlib.compressobj.compress(s)
Add more data to the compression object. If symbol table
becomes full, compressed data is returned, otherwise an
empty string. All returned output from each repeated call
to `zlib.compressobj.compress()` should be concatenated to
a decompression byte-stream (either a string or a
decompression object). The example below, if run in a
directory with some files, lets one examine the buffering
behavior of compression objects:
#---------- zlib_objs.py ----------#
# Demonstrate compression object streams
import zlib, glob
decom = zlib.decompressobj()
com = zlib.compressobj()
for file in glob.glob('*'):
s = open(file).read()
c = com.compress(s)
print 'COMPRESSED:', len(c), 'bytes out'
d = decom.decompress(c)
print 'DECOMPRESS:', len(d), 'bytes out'
print 'UNUSED DATA:', len(decom.unused_data), 'bytes'
raw_input('-- %s (%s bytes) --' % (file, `len(s)`))
f = com.flush()
m = decom.decompress(f)
print 'DECOMPRESS:', len(m), 'bytes out'
print 'UNUSED DATA:', len(decom.unused_data), 'byte'
SEE ALSO, `zlib.compressobj.flush()`,
`zlib.decompressobj.decompress()`,
`zlib.compress()`
zlib.compressobj.flush([mode])
Flush any buffered data from the compression object. As in
the example in `zlib.compressobj.compress()`, the output of
a `zlib.compressobj.flush()` should be concatenated to the
same decompression byte-stream as `zlib.compressobj.compress()`
calls are. If the first argument 'mode' is left empty, or
the default Z_FINISH is specified, the compression object
cannot be used further, and one should `delete` it.
Otherwise, if Z_SYNC_FLUSH or Z_FULL_FLUSH are specified,
the compression object can still be used, but some
uncompressed data may not be recovered by the decompression
object.
SEE ALSO, `zlib.compress()`, `zlib.compressobj.compress()`
zlib.decompressobj.unused_data
As indicated, `zlib.decompressobj.unused_data` is an
instance attribute rather than a method. If any partial
compressed stream cannot be decompressed immediately based
on the byte-stream received, the remainder is buffered in
this instance attribute. Normally, any output of a
compression object forms a complete decompression block,
and nothing is left in this instance attribute. However,
if data is received in bits over a channel, only partial
decompression may be possible on a particular
`zlib.decompressobj.decompress()` call.
SEE ALSO, `zlib.decompress()`,
`zlib.decompressobj.decompress()`
zlib.decompressobj.decompress(s)
Return the decompressed data that may be derived from the
current decompression object state and the argument 's'
data passed in. If all of 's' cannot be decompressed in
this pass, the remainder is left in
`zlib.decompressobj.unused_data`.
zlib.decompressobj.flush()
Return the decompressed data from any bytes buffered by
the decompression object. After this call, the
decompression object cannot be used further, and you
should `del` it.
EXCEPTIONS:
zlib.error
Exception that is raised by compression or decompression
errors.
SEE ALSO, [gzip], [zipfile]
TOPIC -- Unicode
--------------------------------------------------------------------
Note that Appendix C (Understanding Unicode) also discusses
Unicode issues.
Unicode is an enhanced set of character entities, well beyond
the basic 128 characters defined in ASCII encoding and the
codepage-specific national language sets that contain 128
characters each. The full Unicode character set--evolving
continuously, but with a large number of codepoints already
fixed--can contain literally millions of distinct characters.
This allows the representation of a large number of national
character sets within a unified encoding space, even the large
character sets of Chinese-Japanese-Korean (CJK) alphabets.
Although Unicode defines a unique codepoint for each distinct
character in its range, there are numerous -encodings- that
correspond to each character. The encoding called 'UTF-8'
defines ASCII characters as single bytes with standard ASCII
values. However, for non-ASCII characters, a variable number
of bytes (up to 6) are used to encode characters, with the
"escape" to Unicode being indicated by high-bit values in
initial bytes of multibyte sequences. 'UTF-16' is similar,
but uses either 2 or 4 bytes to encode each character (but
never just 1). 'UTF-32' is a format that uses a fixed 4-byte
value for each Unicode character. 'UTF-32', however, is not
currently supported by Python.
Native Unicode support was added to Python 2.0. On the face of
it, it is a happy situation that Python supports Unicode--it
brings the world closer to multinational language support in
computer applications. But in practice, you have to be careful
when working with Unicode, because it is all too easy to
encounter glitches like the one below:
>>> alef, omega = unichr(1488), unichr(969)
>>> unicodedata.name(alef)
>>> print alef
Traceback (most recent call last):
File "", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)
>>> print chr(170)
ª
>>> if alef == chr(170): print "Hebrew is Roman diacritic"
...
Traceback (most recent call last):
File "", line 1, in ?
UnicodeError: ASCII decoding error: ordinal not in range(128)
A Unicode string that is composed of only ASCII characters,
however, is considered equal (but not identical) to a Python
string of the same characters.
>>> u"spam" == "spam"
1
>>> u"spam" is "spam"
0
>>> "spam" is "spam" # string interning is not guaranteed
1
>>> u"spam" is u"spam" # unicode interning not guaranteed
1
Still, the care you take should not discourage you from working
with multilanguage strings, as Unicode enables. It is really
amazingly powerful to be able to do so. As one says of a talking
dog: It is not that he speaks so -well-, but that he speaks at
all.
=================================================================
Built-In Unicode Functions/Methods
=================================================================
The Unicode string method `u"".encode()` and the built-in
function `unicode()` are inverse operations. The Unicode
string method returns a plain string with the 8-bit bytes
needed to represent it (using the specified or default
encoding). The built-in `unicode()` takes one of these encoded
strings, and produces the Unicode object represented by the
encoding. Specifically, suppose we define the function:
>>> chk_eq = lambda u,enc: u == unicode(u.encode(enc),enc)
The call `chk_eq(u,enc)` should return 1 for every value of
'u' and 'enc'--as long as 'enc' is a valid encoding name and 'u'
is capable of being represented in that encoding.
The set of encodings supported for both built-ins are listed
below. Additional encodings may be registered using the
[codecs] module. Each encoding is indicated by the string that
names it, and the case of the string is normalized before
comparison (case-insensitive naming of encodings):
ascii, us-ascii
Encode using 7-bit ASCII.
base64
Encode Unicode strings using the base64 4-to-3 encoding
format.
latin-1, iso-8859-1
Encode using common European accent characters in high-bit
values of 8-bit bytes. Latin-1 character's `ord()` values
are identical to their Unicode codepoints.
quopri
Encode in quoted printable format.
rot13
Not really a Unicode encoding, but "rotate 13 chars" is
included with Python 2.2+ as an example and convenience.
utf-7
Encode using variable byte-length encoding that is
restricted to 7-bit ASCII octets. As with 'utf-8', ASCII
characters encode themselves.
utf-8
Encode using variable byte-length encoding that preserves
ASCII value bytes.
utf-16
Encoding using 2/4 byte encoding. Include "endian" lead
bytes (platform-specific selection).
utf-16-le
Encoding using 2/4 byte encoding. Assume "little
endian," and do not prepend "endian" indicator bytes.
utf-16-be
Encoding using 2/4 byte encoding. Assume "big endian,"
and do not prepend "endian" indicator bytes.
unicode-escape
Encode using Python-style Unicode string constants
('u"\uXXXX"').
raw-unicode-escape
Encode using Python-style Unicode raw string constants
('ur"\uXXXX"').
The error modes for both built-ins are listed below. Errors in
encoding transformations may be handled in any of several ways:
strict
Raise 'UnicodeError' for all decoding errors. Default
handling.
ignore
Skip all invalid characters.
replace
Replace invalid characters with '?' (string target) or
'u"\xfffd"' (Unicode target).
u"".encode([enc [,errmode]])
"".encode([enc [,errmode]])
Return an encoded string representation of a Unicode string
(or of a plain string). The representation is in the style
of encoding 'enc' (or system default). This string is
suitable for writing to a file or stream that other
applications will treat as Unicode data. Examples show
several encodings:
>>> alef = unichr(1488)
>>> s = 'A'+alef
>>> s
u'A\u05d0'
>>> s.encode('unicode-escape')
'A\\u05d0'
>>> s.encode('utf-8')
'A\xd7\x90'
>>> s.encode('utf-16')
'\xff\xfeA\x00\xd0\x05'
>>> s.encode('utf-16-le')
'A\x00\xd0\x05'
>>> s.encode('ascii')
Traceback (most recent call last):
File "", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)
>>> s.encode('ascii','ignore')
'A'
unicode(s [,enc [,errmode]])
Return a Unicode string object corresponding to the encoded
string passed in the first argument 's'. The string 's'
might be a string that is read from another Unicode-aware
application. The representation is treated as conforming
to the style of the encoding 'enc' if the second argument
is specified, or system default otherwise (usually 'utf-8').
Errors can be handled in the default 'strict' style or in
a style specified in the third argument 'errmode'
unichr(cp)
Return a Unicode string object containing the single
Unicode character whose integer codepoint is passed in the
argument 'cp'.
=================================================================
MODULE -- codecs : Python Codec Registry, API, and helpers
=================================================================
The [codecs] module contains a lot of sophisticated
functionality to get at the internals of Python's Unicode
handling. Most of those capabilities are at a lower level than
programmers who are just interested in text processing need to
worry about. The documentation of this module, therefore, will
break slightly with the style of most of the documentation and
present only two very useful wrapper functions within the
[codecs] module.
codecs.open(filename=... [,mode='rb' [,encoding=... [,errors='strict'
-¯ [,buffering=1]]]])
This wrapper function provides a simple and direct means of
opening a Unicode file, and treating its contents directly
as Unicode. In contrast, a file opened with the built-in
`open()` function, its contents are written and read as
strings; to read/write Unicode data to such a file involves
multiple passes through `u"".encode()` and `unicode()`.
The first argument 'filename' specifies the name of the
file to access. If the second argument 'mode' is
specified, the read/write mode can be selected. These
arguments work identically to those used by `open()`. If
the third argument 'encoding' is specified, this encoding
will be used to interpret the file (an incorrect encoding
will probably result in a 'UnicodeError'). Error handling
may be modified by specifying the fourth argument 'errors'
(the options are the same as with the built-in `unicode()`
function). A fifth argument 'buffering' may be specified
to use a specific buffer size (on platforms that support
this).
An example of usage clarifies the difference between
`codecs.open()` and the built-in `open()`:
>>> import codecs
>>> alef = unichr(1488)
>>> open('unicode_test','wb').write(('A'+alef).encode('utf-8'))
>>> open('unicode_test').read() # Read as plain string
'A\xd7\x90'
>>> # Now read directly as Unicode
>>> codecs.open('unicode_test', encoding='utf-8').read()
u'A\u05d0'
Data written back to a file opened with `codecs.open()`
should likewise be Unicode data.
SEE ALSO, `open()`
codecs.EncodedFile(file=..., data_encoding=... [,file_encoding=...
-¯ [,errors='strict']])
This function allows an already opened file to be wrapped
inside an "encoding translation" layer. The mode and
buffering are taken from the underlying file. By
specifying a second argument 'data_encoding' and a third
argument 'file_encoding', it is possible to generate
strings in one encoding within an application, then write
them directly into the appropriate file encoding. As with
`codecs.open()` and `unicode()`, an error handling style
may be specified with the fourth argument 'errors'.
The most likely purpose for `codecs.EncodedFile()` is where
an application is likely to receive byte-streams from
multiple sources, encoded according to multiple Unicode
encodings. By wrapping file objects (or file-like objects)
in an encoding translation layer, the strings coming in one
encoding can be transparently written to an output in the
format the output expects. An example clarifies:
>>> import codecs
>>> alef = unichr(1488)
>>> open('unicode_test','wb').write(('A'+alef).encode('utf-8'))
>>> fp = open('unicode_test','rb+')
>>> fp.read() # Plain string w/ two-byte UTF-8 char in it
'A\xd7\x90'
>>> utf16_writer = codecs.EncodedFile(fp,'utf-16','utf-8')
>>> ascii_writer = codecs.EncodedFile(fp,'ascii','utf-8')
>>> utf16_writer.tell() # Wrapper keeps same current position
3
>>> s = alef.encode('utf-16')
>>> s # Plain string as UTF-16 encoding
'\xff\xfe\xd0\x05'
>>> utf16_writer.write(s)
>>> ascii_writer.write('XYZ')
>>> fp.close() # File should be UTF-8 encoded
>>> open('unicode_test').read()
'A\xd7\x90\xd7\x90XYZ'
SEE ALSO, `codecs.open()`
=================================================================
MODULE -- unicodedata : Database of Unicode characters
=================================================================
The module [unicodedata] is a database of Unicode character
entities. Most of the functions in [unicodedata] take as an
argument one Unicode character and return some information about
the character contained in a plain (non-Unicode) string. The
function of [unicodedata] is essentially informational, rather
than transformational. Of course, an application might make
decisions about the transformations performed based on the
information returned by [unicodedata]. The short utility below
provides all the information available for any Unicode
codepoint:
#------------------ unichr_info.py ----------------------#
# Return all the information [unicodedata] has
# about the single unicode character whose codepoint
# is specified as a command-line argument.
# Arg may be any expression evaluating to an integer
from unicodedata import *
import sys
char = unichr(eval(sys.argv[1]))
print 'bidirectional', bidirectional(char)
print 'category ', category(char)
print 'combining ', combining(char)
print 'decimal ', decimal(char,0)
print 'decomposition', decomposition(char)
print 'digit ', digit(char,0)
print 'mirrored ', mirrored(char)
print 'name ', name(char,'NOT DEFINED')
print 'numeric ', numeric(char,0)
try: print 'lookup ', `lookup(name(char))`
except: print "Cannot lookup"
The usage of 'unichr_info.py' is illustrated below by the runs
with two possible arguments:
#*--------------- Using unichr_info.py ------------------#
% python unichr_info.py 1488
bidirectional R
category Lo
combining 0
decimal 0
decomposition
digit 0
mirrored 0
name HEBREW LETTER ALEF
numeric 0
lookup u'\u05d0'
% python unichr_info.py ord('1')
bidirectional EN
category Nd
combining 0
decimal 1
decomposition
digit 1
mirrored 0
name DIGIT ONE
numeric 1.0
lookup u'1'
For additional information on current Unicode character
codepoints and attributes, consult:
FUNCTIONS:
unicodedata.bidirectional(unichr)
Return the bidirectional characteristic of the character
specified in the argument 'unichr'. Possible values are
AL, AN, B, BN, CS, EN, ES, ET, L, LRE, LRO, NSM, ON, PDF, R,
RLE, RLO, S, and WS. Consult the URL above for details on
these. Particularly notable values are L (left-to-right), R
(right-to-left), and WS (whitespace).
unicodedata.category(unichr)
Return the category of the character specified in the
argument 'unichr'. Possible values are Cc, Cf, Cn, Ll,
Lm, Lo, Lt, Lu, Mc, Me, Mn, Nd, Nl, No, Pc, Pd, Pe, Pf,
Pi, Po, Ps, Sc, Sk , Sm, So, Zl, Zp, and Zs. The first
(capital) letter indicates L (letter), M (mark), N
(number), P (punctuation), S (symbol), Z (separator), or
C (other). The second letter is generally mnemonic within
the major category of the first letter. Consult the URL
above for details.
unicodedata.combining(unichr)
Return the numeric combining class of the character
specified in the argument 'unichr'. These include values
such as 218 (below left) or 210 (right attached). Consult
the URL above for details.
unicodedata.decimal(unichr [,default])
Return the numeric decimal value assigned to the character
specified in the argument 'unichr'. If the second argument
'default' is specified, return that if no value is assigned
(otherwise raise 'ValueError').
unicodedata.decomposition(unichr)
Return the decomposition mapping of the character specified
in the argument 'unichr', or empty string if none exists.
Consult the URL above for details. An example shows that
some characters may be broken into component characters:
>>> from unicodedata import *
>>> name(unichr(190))
'VULGAR FRACTION THREE QUARTERS'
>>> decomposition(unichr(190))
' 0033 2044 0034'
>>> name(unichr(0x33)), name(unichr(0x2044)), name(unichr(0x34))
('DIGIT THREE', 'FRACTION SLASH', 'DIGIT FOUR')
unicodedata.digit(unichr [,default])
Return the numeric digit value assigned to the character
specified in the argument 'unichr'. If the second argument
'default' is specified, return that if no value is assigned
(otherwise raise 'ValueError').
unicodedata.lookup(name)
Return the Unicode character with the name specified in
the first argument 'name'. Matches must be exact, and
'ValueError' is raised if no match is found. For example:
>>> from unicodedata import *
>>> lookup('GREEK SMALL LETTER ETA')
u'\u03b7'
>>> lookup('ETA')
Traceback (most recent call last):
File "", line 1, in ?
KeyError: undefined character name
SEE ALSO, `unicodedata.name()`
unicodedata.mirrored(unichr)
Return 1 if the character specified in the argument
'unichr' is a mirrored character in bidirection text.
Return 0 otherwise.
unicodedata.name(unichr)
Return the name of the character specified in the argument
'unichr'. Names are in all caps and have a regular form
by descending category importance. Consult the URL above
for details.
SEE ALSO, `unicodedata.lookup()`
unicodedata.numeric(unichr [,default])
Return the floating point numeric value assigned to the
character specified in the argument 'unichr'. If the
second argument 'default' is specified, return that if no
value is assigned (otherwise raise 'ValueError').
SECTION 3 -- Solving Problems
------------------------------------------------------------------------
EXERCISE: Many ways to take out the garbage
--------------------------------------------------------------------
DISCUSSION:
Recall, if you will, the dictum in "The Zen of Python" that
"There should be one--and preferably only one--obvious way to
do it." As with most dictums, the real world sometimes fails
our ideals. Also as with most dictums, this is not necessarily
such a bad thing.
A discussion on the newsgroup '' in 2001 posed
an apparently rather simple problem. The immediate problem was
that one might encounter telephone numbers with a variety of
dividers and delimiters inside them. For example, '(123)
456-7890', '123-456-7890', or '123/456-7890' might all represent
the same telephone number, and all forms might be encountered in
textual data sources (such as ones entered by users of a
free-form entry field. For purposes of this problem, the
canonical form of this number should be '1234567890'.
The problem mentioned here can be generalized in some natural
ways: Maybe we are interested in only some of the characters
within a longer text field (in this case, the digits), and the
rest is simply filler. So the general problem is how to
extract the content out from the filler.
The first and "obvious" approach might be a procedural loop
through the initial string. One version of this approach might
look like:
>>> s = '(123)/456-7890'
>>> result = ''
>>> for c in s:
... if c in '0123456789':
... result = result + c
...
>>> result
'1234567890'
This first approach works fine, but it might seem a bit bulky for
what is, after all, basically a single action. And it might also
seem odd that you need to loop though character-by-character
rather than just transform the whole string.
One possibly simpler approach is to use a regular expression. For
readers who have skipped to the next chapter, or who know regular
expressions already, this approach seems obvious:
>>> import re
>>> s = '(123)/456-7890'
>>> re.sub(r'\D', '', s)
'1234567890'
The actual work done (excluding defining the initial string and
importing the [re] module) is just one short expression. Good
enough, but one catch with regular expressions is that they are
frequently far slower than basic string operations. This makes
no difference for the tiny example presented, but for
processing megabytes, it could start to matter.
Using a functional style of programming is one way to express
the "filter" in question rather tersely, and perhaps more
efficiently. For example:
>>> s = '(123)/456-7890'
>>> filter(lambda c:c.isdigit(), s)
'1234567890'
We also get something short, without needing to use regular
expressions. Here is another technique that utilizes string
object methods and list comprehensions, and also pins some hopes
on the great efficiency of Python dictionaries:
>>> isdigit = {'0':1,'1':1,'2':1,'3':1,'4':1,
... '5':1,'6':1,'7':1,'8':1,'9':1}.has_key
>>> ''.join([x for x in s if isdigit(x)])
'1234567890'
QUESTIONS:
1. Which content extraction technique seems most natural to
you? Which would you prefer to use? Explain why.
2. What intuitions do you have about the performance of these
different techniques, if applied to large data sets? Are
there differences in comparative efficiency of techniques
between operating on one single large string input and
operating on a large number of small string inputs?
3. Construct a program to verify or refute your intuitions
about performance of the constructs.
4. Can you think of ways of combining these techniques to
maximize efficiency? Are there any other techniques available
that might be even better (hint: think about what
`string.translate()` does)? Construct a faster technique,
and demonstrate its efficiency.
5. Are there reasons other than raw processing speed to prefer
some of these techniques over others? Explain these reasons,
if they exist.
EXERCISE: Making sure things are what they should be
--------------------------------------------------------------------
DISCUSSION:
The concept of a "digital signature" was introduced in Section
2.2.4. As was mentioned, the Python standard library does not
include (directly) any support for digital signatures. One way to
characterize a digital signature is as some information that
-proves- or -verifies- that some other information really is what
it purports to be. But this characterization actually applies to
a broader set of things than just digital signatures. In
cryptology literature one is accustomed to talk about the "threat
model" a crypto-system defends against. Let us look at a few.
Data may be altered by malicious tampering, but it may also be
altered by packet loss, storage-media errors, or by program
errors. The threat of accidental damage to data is the easiest
threat to defend against. The standard technique is to use a
hash of the correct data and send that also. The receiver of
the data can simply calculate the hash of the data
herself--using the same algorithm--and compare it with the
hash sent. A very simple utility like the one below does this:
#---------- crc32.py ----------#
# Calculate CRC32 hash of input files or STDIN
# Incremental read for large input sources
# Usage: python crc32.py [file1 [file2 [...]]]
# or: python crc32.py < STDIN
import binascii
import fileinput
filelist = []
crc = binascii.crc32('')
for line in fileinput.input():
if fileinput.isfirstline():
if fileinput.isstdin():
filelist.append('STDIN')
else:
filelist.append(fileinput.filename())
crc = binascii.crc32(line,crc)
print 'Files:', ' '.join(filelist)
print 'CRC32:', crc
A slightly faster version could use `zlib.adler32()` instead of
`binascii.crc32`. The chance that a randomly corrupted file would
have the right CRC32 hash is approximately (2**-32)--unlikely
enough not to worry about most times.
A CRC32 hash, however, is far too weak to be used
cryptographically. While random data error will almost surely not
create a chance hash collision, a malicious tamperer-- Mallory,
in crypto-parlance--can find one relatively easily. Specifically,
suppose the true message is M, Mallory can find an M' such that
CRC32(M) equals CRC32(M'). Moreover, even imposing the condition
that M' appears plausible as a message to the receiver does not
make Mallory's tasks particularly difficult.
To thwart fraudulent messages, it is necessary to use a
cryptographically strong hash, such as [SHA] or [MD5]. Doing
so is almost the same utility as above:
#---------- sha.py ----------#
# Calculate SHA hash of input files or STDIN
# Usage: python sha.py [file1 [file2 [...]]]
# or: python sha.py < STDIN
import sha, fileinput, os, sys
filelist = []
sha = sha.sha()
for line in fileinput.input():
if fileinput.isfirstline():
if fileinput.isstdin():
filelist.append('STDIN')
else:
filelist.append(fileinput.filename())
sha.update(line[:-1]+os.linesep) # same as binary read
sys.stderr.write('Files: '+' '.join(filelist)+'\nSHA: ')
print sha.hexdigest()
An SHA or MD5 hash cannot be forged practically, but if our
threat model includes a malicious tamperer, we need to worry
about whether the hash itself is authentic. Mallory, our
tamperer, can produce a false SHA hash that matches her false
message. With CRC32 hashes, a very common procedure is to attach
the hash to the data message itself--for example, as the first or
last line of the data file, or within some wrapper lines. This is
called an "in band" or "in channel" transmission. One alternative
is "out of band" or "off channel" transmission of cryptographic
hashes. For example, a set of cryptographic hashes matching data
files could be placed on a Web page. Merely transmitting the hash
off channel does not guarantee security, but it does require
Mallory to attack both channels effectively.
By using encryption, it is possible to transmit a secured hash
in channel. The key here is to encrypt the hash and attach
that encrypted version. If the hash is appended with some
identifying information before the encryption, that can be
recovered to prove identity. Otherwise, one could simply
include both the hash and its encrypted version. For the
encryption of the hash, an asymmetrical encryption algorithm is
ideal; however, with the Python standard library, the best we
can do is to use the (weak) symmetrical encryption in [rotor].
For example, we could use the utility below:
#---------- hash_rotor.py ----------#
#!/usr/bin/env python
# Encrypt hash on STDIN using sys.argv[1] as password
import rotor, sys, binascii
cipher = rotor.newrotor(sys.argv[1])
hexhash = sys.stdin.read()[:-1] # no newline
print hexhash
hash = binascii.unhexlify(hexhash)
sys.stderr.write('Encryption: ')
print binascii.hexlify(cipher.encrypt(hash))
The utilities could then be used like:
#*-------- hash_rotor at work --------#
% cat mary.txt
Mary had a little lamb
% python sha.py mary.txt | hash_rotor.py mypassword >> mary.txt
Files: mary.txt
SHA: Encryption:
% cat mary.txt
Mary had a little lamb
c49bf9a7840f6c07ab00b164413d7958e0945941
63a9d3a2f4493d957397178354f21915cb36f8f8
The penultimate line of the file now has its SHA hash, and the
last line has an encryption of the hash. The password used will
somehow need to be transmitted securely for the receiver to
validate the appended document (obviously, the whole system make
more sense with longer and more proprietary documents than in the
example).
QUESTIONS:
1. How would you wrap up the suggestions in the small
utilities above into a more robust and complete
"digital_signatures.py" utility or module? What concerns
would come into a completed utility?
2. Why is CRC32 not suitable for cryptographic purposes? What
sets SHA and MD5 apart (you should not need to know the
details of the algorithm for this answer)? Why is
uniformity of coverage of hash results important for any
hash algorithm?
3. Explain in your own words why hashes serve to verify
documents. If you were actually the malicious attacker in
the scenarios above, how would you go about interfering
with the crypto-systems outlined here? What lines of
attack are left open by the system you sketched out or
programmed in (1)?
4. If messages are subject to corruptions, including
accidental corruption, so are hashes. The short length of
hashes may make problems in them less likely, but not
impossible. How might you enhance the document verification
systems above to detect corruption within a hash itself?
How might you allow more accurate targeting of corrupt
versus intact portions of a large document (it may be
desirable to recover as much as possible from a corrupt
document)?
5. Advanced: The RSA public-key algorithm is actually quite
simple; it just involves some modulo exponentiation
operations and some large primes. An explanation can be
found, among other places, at the author's -Introduction
to Cryptology Concepts II-:
Try implementing an RSA public-key algorithm in Python, and
use this to enrich the digital signature system you
developed above.
EXERCISE: Finding needles in haystacks (full-text indexing)
--------------------------------------------------------------------
DISCUSSION:
Many texts you deal with are loosely structured and prose-like,
rather than composed of well-ordered records. For documents of
that sort, a very frequent question you want answered is, "What
is (or isn't) in the documents?"--at a more general level than
the semantic richness you might obtain by actually -reading- the
documents. In particular, you often want to check a large
collection of documents to determine the (comparatively) small
subset of them that are relevant to a given area of interest.
A certain category of questions about document collections has
nothing much to do with text processing. For example, to locate
all the files modified within a certain time period, and having a
certain file size, some basic use of the [os.path] module
suffices. Below is a sample utility to do such a search, which
includes some typical argument parsing and help screens. The
search itself is only a few lines of code:
#---------- findfile1.py ----------#
# Find files matching date and size
_usage = """
Usage:
python findfile1.py [-start=days_ago] [-end=days_ago]
[-small=min_size] [-large=max_size] [pattern]
Example:
python findfile1.py -start=10 -end=5 -small=1000 -large=5000 *.txt
"""
import os.path
import time
import glob
import sys
def parseargs(args):
"""Somewhat flexible argument parser for multiple platforms.
Switches can start with - or /, keywords can end with = or :.
No error checking for bad arguments is performed, however.
"""
now = time.time()
secs_in_day = 60*60*24
start = 0 # start of epoch
end = time.time() # right now
small = 0 # empty files
large = sys.maxint # max file size
pat = '*' # match all
for arg in args:
if arg[0] in '-/':
if arg[1:6]=='start': start = now-(secs_in_day*int(arg[7:]))
elif arg[1:4]=='end': end = now-(secs_in_day*int(arg[5:]))
elif arg[1:6]=='small': small = int(arg[7:])
elif arg[1:6]=='large': large = int(arg[7:])
elif arg[1] in 'h?': print _usage
else:
pat = arg
return (start,end,small,large,pat)
if __name__ == '__main__':
if len(sys.argv) > 1:
(start,end,small,large,pat) = parseargs(sys.argv[1:])
for fname in glob.glob(pat):
if not os.path.isfile(fname):
continue # don't check directories
modtime = os.path.getmtime(fname)
size = os.path.getsize(fname)
if small <= size <= large and start <= modtime <= end:
print time.ctime(modtime),'%8d '%size,fname
else: print _usage
What about searching for text inside files? The `string.find()`
function is good for locating contents quickly and could be
used to search files for contents. But for large document
collections, hits may be common. To make sense of search
results, ranking the results by number of hits can help. The
utility below performs a match-accuracy ranking (for brevity,
without the argument parsing of 'findfile1.py'):
#---------- findfile2.py ----------#
# Find files that contain a word
_usage = "Usage: python findfile.py word"
import os.path
import glob
import sys
if len(sys.argv) == 2:
search_word = sys.argv[1]
results = []
for fname in glob.glob('*'):
if os.path.isfile(fname): # don't check directories
text = open(fname).read()
fsize = len(text)
hits = text.count(search_word)
density = (fsize > 0) and float(hits)/(fsize)
if density > 0: # consider when density==0
results.append((density,fname))
results.sort()
results.reverse()
print 'RANKING FILENAME'
print '------- --------------------------'
for match in results:
print '%6d '%int(match[0]*1000000), match[1]
else:
print _usage
Variations on these are, of course, possible. But generally
you could build pretty sophisticated searches and rankings by
adding new search options incrementally to 'findfile2.py'. For
example, adding some regular expression options could give the
utility capabilities similar to the 'grep' utility.
The place where a word search program like the one above falls
terribly short is in speed of locating documents in -very-
large document collections. Even something as fast, and well
optimized, as 'grep' simply takes a while to search a lot of
source text. Fortunately, it is possible to -shortcut- this
search time, as well as add some additional capabilities.
A technique for rapid searching is to perform a generic search
just once (or periodically) and create an index--i.e.,
database--of those generic search results. Performing a later
search need not -really- search contents, but only check the
abstracted and structured index of possible searches. The utility
'indexer.py' is a functional example of such a computed search
index. The most current version may be downloaded from the
book's Web site .
The utility 'indexer.py' allows very rapid searching for the
simultaneous occurrence of multiple words within a file. For
example, one might want to locate all the document files (or
other text sources, such as VARCHAR database fields) that
contain the words 'Python', 'index', and 'search'. Supposing
there are many thousands of candidate documents, searching them
on an ad hoc basis could be slow. But 'indexer.py' creates a
comparatively compact collection of persistent dictionaries
that provide answers to such inquiries.
The full source code to 'indexer.py' is worth reading, but most
of it deals with a variety of persistence mechanisms and with an
object-oriented programming (OOP) framework for reuse. The
underlying idea is simple, however. Create three dictionaries
based on scanning a collection of documents:
#*---------- Index dictionaries ----------#
*Indexer.fileids: fileid --> filename
*Indexer.files: filename --> (fileid, wordcount)
*Indexer.words: word --> {fileid1:occurs, fileid2:occurs, ...}
The essential mapping is '*Indexer.words'. For each word, what
files does it occur in and how often? The mappings
'*Indexer.fileids' and '*Indexer.files' are ancillary. The
first just allows shorter numeric aliases to be used instead of
long filenames in the '*Indexer.words' mapping (a performance
boost and storage saver). The second, '*Indexer.files', also
holds a total wordcount for each file. This allows a ranking
of the importance of different matches. The thought is that a
megabyte file with ten occurrences of 'Python' is less focused
on the topic of Python than is a kilobyte file with the same
ten occurrences.
Both generating and utilizing the mappings above is
straightforward. To search multiple words, one basically
simply needs the intersection of the results of several values
of the '*Indexer.words' dictionary, one value for each word
key. Generating the mappings involves incrementing counts in
the nested dictionary of '*Indexer.words', but is not
complicated.
QUESTIONS:
1. One of the most significant--and surprisingly
subtle--concerns in generating useful word indexes is
figuring out just what a "word" is. What considerations
would you bring to determine word identities? How might
you handle capitalization? Punctuation? Whitespace? How
might you disallow binary strings that are not "real"
words. Try performing word-identification tests against
real-world documents. How successful were you?
2. Could other data structures be used to store word index
information than those proposed above? If other data
structures are used, what efficiency (speed) advantages or
disadvantages do you expect to encounter? Are there other
data structures that would allow for additional search
capabilities than the multiword search of 'indexer.py'?
If so, what other indexed search capabilities would have
the most practical benefit?
3. Consider adding integrity guarantees to index results.
What if an index falls out of synchronization with the
underlying documents? How might you address referential
integrity? Hint: consider `binascii.crc32`, [sha], and
[md5]. What changes to the data structures would be needed
for integrity checks? Implement such an improvement.
4. The utility 'indexer.py' has some ad hoc exclusions of
nontextual files from inclusion in an index, based simply
on some file extensions. How might one perform accurate
exclusion of nontextual data? What does it mean for a
document to contain text? Try writing a utility
'istextual.py' that will identify text and nontext
real-world documents. Does it work to your satisfaction?
5. Advanced: 'indexer.py' implements several different
persistence mechanisms. What other mechanisms might you
use from those implemented? Benchmark your mechanism.
Does it do better than 'SlicedZPickleIndexer' (the best
variant ncluded in both speed and space)?