CHARMING PYTHON #3
My first web-based filtering proxy
David Mertz, Ph.D.
Flutist, Gnosis Software, Inc.
June 2000
This article examines Txt2Html, a public-domain working
project created by the author to illustrate Python
programming techniques. 'Txt2Html' is a "web-based filtering
proxy"--a program that reads web-based documents for the
user, then presents a modified page to the user's browser.
In order to accomplish these tasks, 'Txt2Html' runs as a CGI
program, queries outside web resources, and makes use of
regular-expressions; each of these general-purpose sub-tasks
is explained, clarified, and demonstrated in this article.
WHAT IS PYTHON?
------------------------------------------------------------------------
Python is a freely available, very-high-level, interpreted
language developed by Guido van Rossum. It combines a clear
syntax with powerful (but optional) object-oriented semantics.
Python is available for almost every computer platform you
might find yourself working on, and is highly portable between
platforms.
INTRODUCTION TO 'Txt2Html'
------------------------------------------------------------------------
In the course of writing articles in this series, the author
faced a quandry about the best format to write in.
Wordprocessor formats are proprietary, and conversions between
formats tend to be imperfect and troublesome (and they bind
one to proprietary tools; contrary to an open-source spirit).
HTML is fairly neutral--and is probably the form you are
reading this article in--but it also adds tags that are easy to
mistype (or commits one to an HTML-enhanced editor). DocBook
is an interesting XML format that can be converted to many
target formats, and has the right semantics for technical
articles (or books); but like HTML, there are lots of tags to
worry about during the writing process. LaTeX is great for
sophisticated typography; but lots of tags again, and these
articles don't need typographic sophistication.
For real ease of composition--and especially for platform and
tool neutrality--plain ASCII just cannot be beat. Beyond
completely plain text, however, the internet (especially
Usenet) has prompted the development of an informal standard of
"smart ASCII" documents. "Smart ASCII" adds just a little bit
of extra semantic content and context in ways that look
"natural" in text displays. Emails, newsgroup posts, FAQs,
project READMEs, and other electronic documents often include a
few typographic/semantic elements like asterisks around
emphasized words, underscores surrounding titles, vertical and
horizontal whitespace to describe textual relations, selective
ALLCAPS, and a few other tidbits. Project Guttenburg is a
wonderful effort that put quite a bit of thought into its own
consideration of formats, and decided on "smart ASCII" as the
best choice for preserving and distributng great books for a
long time. Even if these articles won't live as such literary
classics, the decision was made to write them as "smart ASCII",
and automate any conversions to other formats with handy Python
scripts.
'Txt2Html' started out as a simple file converter, as the name
suggests. But the internet posed several obvious enhancements
to the tool. Since many of the documents one might want to
view in an "html-ized" form live somewhere at the end of
http: or ftp: links, the tool should really handle such remote
documents straightforwardly (without the need for a
download/convert/view cycle). And since the target of the
conversion is HTML after all, what we would generally want to
do with the target is view it in a web-browser. Putting these
things together, it emerged that 'Txt2Html' should be a
"web-based filtering proxy." Fancy words there, maybe even
"fully buzzword compliant." What it amounts to is the idea that
a program might read a web-page (or other resource) on your
behalf, massage the contents in some way, then present you with
something that is *better* than the original page (at least for
some particular purpose). A good example of such a tool is the
Babelfish translation service. After running a URL through
Babelfish, you see a webpage that looks pretty much like the
original one, but has the pleasant feature of having words in
a language you can read instead of in a language you do not
understand. In a way, all the search-engines that present
little synopses of the pages they find for a search do the same
thing. But those search-engines (by design) take a lot more
liberty with the formatting and appearance of a target page,
and leave out a lot more. 'Txt2Html' is certainly a lot less
ambitious than Babelfish is; but conceptually, both do largely
the same thing. See the resources for some more examples, some
rather humorous.
Best of all, 'Txt2Html' uses a number of programming techniques
that are common to a lot of different web-oriented uses of
Python. This article will introduce those techniques, and give
some pointers on coding techniques and the scope of some Python
modules. Note: the actual module in 'Txt2Html' is called
[dmTxt2Html] to avoid conflict with the naming of a module
written by someone else.
USING THE [cgi] MODULE
------------------------------------------------------------------------
Python's [cgi] module--in the standard distribution--is a
godsend for anyone developing "Common Gateway Interface"
applications in Python. You could create CGI's without it, but
you wouldn't want to.
Most typically, one interacts with CGI applications by means of
an HTML form. One fills out some values in the form that
specify details of the action to be performed, then call on the
CGI to perform its action using your specifications. For
example, the 'Txt2Html' documentation uses this example for a
calling HTML form (the one generated by 'Txt2Html' itself is a
bit more complicated, and may change; but the example will work
perfectly well, even from within your own web pages):
#-------------- HTML form to call 'Txt2Html' ------------#
You may include many input fields within an HTML form, and the
fields can be of a number of different types (text, checkboxes,
picklists, radio buttons, etc.). Any good book on HTML can
help a beginner with creating custom HTML forms. The main
thing to remember here is that each field has a 'name'
attribute, and that name is used later to refer to the field in
our CGI script. Another detail worth knowing about is that
forms can have one of two 'method' attributes: "get" and
"post". The basic difference is that the "get" method encodes
all the details of how the form was filled out right into the
URL generated (by including a bunch of "URL encoded" stuff
after the URL indicated in the 'action' attribute). Using this
method makes it easier for a user to save a specific query for
later reuse. Then again, if you do not want users to save
queries, use the "post" method.
The Python script that gets called by the above form does an
'import cgi' to make sorting out its calling form easy. One
thing this module does is hide any details of the difference
between "get" and "post" methods from the CGI script. By the
time the call is made, this is not a detail the CGI creator
needs to worry about. The main thing done by the CGI module is
to treat all the fields in the calling HTML form in a
dictionary-like fashion. What you get is not *quite* a
Python dictionary, but it is close enough to be easy to work
with:
#-------------- Using Python [cgi] module ---------------#
import cgi, sys
cfg_dict = {'target': ''}
sys.stderr = sys.stdout
form = cgi.FieldStorage()
if form.has_key('source'):
cfg_dict['source'] = form['source'].value
There are a couple little details to notice in the above few
lines. One trick we do is to set 'sys.stderr = sys.stdout'.
By doing this, if our script encounters an untrapped error, the
traceback will display back to the client browser. This can
save a lot of time in debugging a CGI application. But it
might not be what you want users to see (or it might, if they
are likely to report problem details to you). Next, we read
the HTML form values into the dictionary-like 'form' instance.
Much like a true Python dictionary, 'form' has a '.has_key()'
method. However, unlike a Python dictionary, to actually pull
off the value within a key, we have to look at the '.value'
attribute for the key.
From here, we have everything in the HTML form in plain Python
variables, and we can handle them as in any other Python
program.
USING THE [urllib] MODULE
------------------------------------------------------------------------
Like most things Python, [urllib] makes a whole bunch of
complicated things happen in an obvious and simple way. The
'urlopen()' function in [urllib] treats any remote
resource--whether http:, ftp:, or even gopher:-- just like it
was a local file. Once you grab a remote (pseudo-)filehandle
using 'urlopen()', you can do everything you would with the
filehandle of a local (read-only) file:
#----------- Using Python [urllib] module --------------#
from urllib import urlopen
import string
source = cfg_dict['source']
if source == '':
fhin = sys.stdin
else:
try:
fhin = urlopen(source)
except:
ErrReport(source+' could not be opened!', cfg_dict)
return
doc = ''
for line in fhin.readlines(): # Need to normalize line endings!
doc = doc+string.rstrip(line)+'\n'
One minor problem that this author has encountered is that
depending on the end-of-line convention used on the platform
that produced the resource and on your own platform, some odd
things can happen to the resulting text (this appears to be a
bug in [urllib]). The cure for this problems is to perform
the little '.readlines()' loop in the above code. Doing this
gives you a string that has the right end-of-line conventions
for the platform you are running on, regardless of what the
source resource looked like (within reason, presumably).
USING THE [re] MODULE
------------------------------------------------------------------------
There is certainly a lot more to regular expressions than can
fit into this article. A widely read reference book on the
topic is listed under Resources. The [re] module is fairly
widely used in 'Txt2Html' to identify various textual patterns
in the source texts. A moderately complex example is worth
looking at.
#------------ Using Python [re] module -----------------#
import re
def URLify(txt):
txt = re.sub('((?:http|ftp|gopher|file)://(?:[^ \n\r<\)]+))(\s)',
'\\1\\2', txt)
return txt
'URLify()' is a nice little function to do pretty much what it
says. If something that looks like a URL is encountered in the
"smart ASCII" file, it is converted into an actual hotlink to
that same URL within the HTML output. Let us look at what the
're.sub()' is doing. First, in broadest terms, the function's
purpose is to "match what is in the first pattern, then replace
it with the second pattern, using the third argument as the
string to operate on." Good enough, not much different from
'string.replace()' in those terms.
The first pattern has several elements. Notice the parentheses
first: the highest level consists of two pairs: a complicated
bunch of stuff followed by '(\s)'. Sets of parentheses match
"subexpressions" that can potentially make up part of the
replacement pattern. The second subexpression, '(\s)' just
means "match any whitespace character (and let us refer back to
the particular type of whitespace it was). So let's look at
the first subexpression.
Python regular expressions have a couple tricks of their own.
One such trick is the '?:' operator at the beginning of a
subexpression. This means "match a subpattern, but don't
include the match in the back-references." So let us examine
the subexpression
'((?:http|ftp|gopher|file)://(?:[^ \n\r<\)]+))'.
First notice that this subexpression is itself composed of two
child subexpressions, with some stuff in the middle that is not
part of any child subexpression. However, each of the children
starts with '?:', which means that they get matched, but don't
count for reference purposes. The first of these
"non-reference" child subexpressions just says "match something
that looks like 'http' *or* that looks like 'ftp' *or* ...".
Next we get the short string '://' which means to match
anything that looks exactly like it (simple, huh?). Finally, we
get the second child subexpression, which other than the "don't
refer" operator consists of some stuff in square brackets, and
a plus sign.
In regular expressions, square brackets just mean "match any
character in the brackets." However, if the first character is
a caret (^), the meaning is reversed, and it means "match
anything *not* in the next characters." So we are looking for
stuff that is *not* a space, CR, LF, "<" or ")" (notice also
that characters that have special meaning to regular
expressions can be "escaped" by having a "\" in front of them).
The plus sign at the end means "match one or more of the last
thing" (asterisk is for "zero or more", and question-mark is
for "zero or one").
This regular expression has a bunch to digest, but if you walk
through it a few times, you can see that this is what a URL has
to look like.
Next is the replacement chunk. This is simpler. The parts
that look like '\\1' and '\\2' (or '\\3', '\\4', etc., if we
needed them) are those "back references" discussed. They mean
"the pattern matched by the first (second) subexpression of the
match expression. All the rest of the stuff in the replacement
chunk just is what it is: some characters that are easily
recognized as HTML codes. One little thing that is a bit
subtle is that we bother to match '\\2'--which looking above is
just a whitespace character. One might ask, "why bother? why
not just insert a space as a literal?" Fair question, and we do
not really *need* to do what we did for HTML. But
aesthetically, it is better to let the HTML output stay as
much as possible like the source text file was before our our
HTML markup. Particularly, let us keep the line-breaks as
line-breaks, and spaces as spaces (and tabs as tabs).
RESOURCES
------------------------------------------------------------------------
To obtain or use Txt2Html, just point at it on the author's
website. The navigation bar attached to the top of all proxied
pages (by default) will include a link to download all the
source files.
http://gnosis.cx/cgi-bin/txt2html.cgi
This article as "smart ASCII" text.
http://gnosis.cx/publish/programming/charming_python_3.txt
Project Guttenburg Homepage.
http://somewhere.or.another/guttenburg/
Ka-Ping Yee's list and discussion of what he calls "mediators"
(and why he doesn't think "proxy" covers it).
http://www.lfw.org/ping/mediator.html
Babelfish Homepage. Translate web pages from one language to
another.
http://babel.altavista.com/
The Malkovich web-based filtering proxy (there is no point
trying to explain, you just have to see this one!).
http://www.lfw.org/jminc/
Strictly for those over 18, a funny web-based filtering proxy
can be found here.
http://pornolize.com/
The Anonymizer: a genuinely useful filtering proxy for folks
who want to browse the web without leaking information about
their personal identity (not just cookies, but also IP address,
browser version, OS, or other information potentially
correlated with identity).
http://www.anonymizer.com/
Friedl, Jeffrey E. F., _Mastering Regular Expressions_,
O'Reilly, Cambridge, MA 1997 is a fairly standard and
definitive reference on RegEx's.
ABOUT THE AUTHOR
------------------------------------------------------------------------
{Picture of Author: http://gnosis.cx/cgi-bin/img_dqm.cgi}
David Mertz is not really quite sure if the open source
movement is a dinner party. But he would certainly like to
think that proprietary intellectual property is a paper tiger.
His own ideas, certainly, want to be free. David may be
reached at mertz@gnosis.cx; his life pored over at
http://gnosis.cx/publish/. Suggestions and recommendations on
this, past, or future, columns are welcomed.