David Mertz, Ph.D.
Flutist, Gnosis Software, Inc.
June 2000
This article examines Txt2Html, a public-domain working project created by the author to illustrate Python programming techniques.Txt2Html
is a "web-based filtering proxy"--a program that reads web-based documents for the user, then presents a modified page to the user's browser. In order to accomplish these tasks,Txt2Html
runs as a CGI program, queries outside web resources, and makes use of regular-expressions; each of these general-purpose sub-tasks is explained, clarified, and demonstrated in this article.
Python is a freely available, very-high-level, interpreted language developed by Guido van Rossum. It combines a clear syntax with powerful (but optional) object-oriented semantics. Python is available for almost every computer platform you might find yourself working on, and is highly portable between platforms.
In the course of writing articles in this series, the author faced a quandry about the best format to write in. Wordprocessor formats are proprietary, and conversions between formats tend to be imperfect and troublesome (and they bind one to proprietary tools; contrary to an open-source spirit). HTML is fairly neutral--and is probably the form you are reading this article in--but it also adds tags that are easy to mistype (or commits one to an HTML-enhanced editor). DocBook is an interesting XML format that can be converted to many target formats, and has the right semantics for technical articles (or books); but like HTML, there are lots of tags to worry about during the writing process. LaTeX is great for sophisticated typography; but lots of tags again, and these articles don't need typographic sophistication.
For real ease of composition--and especially for platform and tool neutrality--plain ASCII just cannot be beat. Beyond completely plain text, however, the internet (especially Usenet) has prompted the development of an informal standard of "smart ASCII" documents. "Smart ASCII" adds just a little bit of extra semantic content and context in ways that look "natural" in text displays. Emails, newsgroup posts, FAQs, project READMEs, and other electronic documents often include a few typographic/semantic elements like asterisks around emphasized words, underscores surrounding titles, vertical and horizontal whitespace to describe textual relations, selective ALLCAPS, and a few other tidbits. Project Guttenburg is a wonderful effort that put quite a bit of thought into its own consideration of formats, and decided on "smart ASCII" as the best choice for preserving and distributng great books for a long time. Even if these articles won't live as such literary classics, the decision was made to write them as "smart ASCII", and automate any conversions to other formats with handy Python scripts.
started out as a simple file converter, as the name
suggests. But the internet posed several obvious enhancements
to the tool. Since many of the documents one might want to
view in an "html-ized" form live somewhere at the end of
http: or ftp: links, the tool should really handle such remote
documents straightforwardly (without the need for a
download/convert/view cycle). And since the target of the
conversion is HTML after all, what we would generally want to
do with the target is view it in a web-browser. Putting these
things together, it emerged that Txt2Html
should be a
"web-based filtering proxy." Fancy words there, maybe even
"fully buzzword compliant." What it amounts to is the idea that
a program might read a web-page (or other resource) on your
behalf, massage the contents in some way, then present you with
something that is better than the original page (at least for
some particular purpose). A good example of such a tool is the
Babelfish translation service. After running a URL through
Babelfish, you see a webpage that looks pretty much like the
original one, but has the pleasant feature of having words in
a language you can read instead of in a language you do not
understand. In a way, all the search-engines that present
little synopses of the pages they find for a search do the same
thing. But those search-engines (by design) take a lot more
liberty with the formatting and appearance of a target page,
and leave out a lot more. Txt2Html
is certainly a lot less
ambitious than Babelfish is; but conceptually, both do largely
the same thing. See the resources for some more examples, some
rather humorous.
Best of all, Txt2Html
uses a number of programming techniques
that are common to a lot of different web-oriented uses of
Python. This article will introduce those techniques, and give
some pointers on coding techniques and the scope of some Python
modules. Note: the actual module in Txt2Html
is called
to avoid conflict with the naming of a module
written by someone else.
Python's cgi
module--in the standard distribution--is a
godsend for anyone developing "Common Gateway Interface"
applications in Python. You could create CGI's without it, but
you wouldn't want to.
Most typically, one interacts with CGI applications by means of
an HTML form. One fills out some values in the form that
specify details of the action to be performed, then call on the
CGI to perform its action using your specifications. For
example, the Txt2Html
documentation uses this example for a
calling HTML form (the one generated by Txt2Html
itself is a
bit more complicated, and may change; but the example will work
perfectly well, even from within your own web pages):
<form method="get" action="http://gnosis.cx/cgi-bin/txt2html.cgi"> URL: <input type="text" name="source" size=40> <input type="submit" name="go" value="Display!"> </form>
You may include many input fields within an HTML form, and the
fields can be of a number of different types (text, checkboxes,
picklists, radio buttons, etc.). Any good book on HTML can
help a beginner with creating custom HTML forms. The main
thing to remember here is that each field has a name
attribute, and that name is used later to refer to the field in
our CGI script. Another detail worth knowing about is that
forms can have one of two method
attributes: "get" and
"post". The basic difference is that the "get" method encodes
all the details of how the form was filled out right into the
URL generated (by including a bunch of "URL encoded" stuff
after the URL indicated in the action
attribute). Using this
method makes it easier for a user to save a specific query for
later reuse. Then again, if you do not want users to save
queries, use the "post" method.
The Python script that gets called by the above form does an
import cgi
to make sorting out its calling form easy. One
thing this module does is hide any details of the difference
between "get" and "post" methods from the CGI script. By the
time the call is made, this is not a detail the CGI creator
needs to worry about. The main thing done by the CGI module is
to treat all the fields in the calling HTML form in a
dictionary-like fashion. What you get is not quite a
Python dictionary, but it is close enough to be easy to work
import cgi, sys cfg_dict = {'target': '<STDOUT>'} sys.stderr = sys.stdout form = cgi.FieldStorage() if form.has_key('source'): cfg_dict['source'] = form['source'].value
There are a couple little details to notice in the above few
lines. One trick we do is to set sys.stderr = sys.stdout
By doing this, if our script encounters an untrapped error, the
traceback will display back to the client browser. This can
save a lot of time in debugging a CGI application. But it
might not be what you want users to see (or it might, if they
are likely to report problem details to you). Next, we read
the HTML form values into the dictionary-like form
Much like a true Python dictionary, form
has a .has_key()
method. However, unlike a Python dictionary, to actually pull
off the value within a key, we have to look at the .value
attribute for the key.
From here, we have everything in the HTML form in plain Python variables, and we can handle them as in any other Python program.
Like most things Python, urllib
makes a whole bunch of
complicated things happen in an obvious and simple way. The
function in urllib
treats any remote
resource--whether http:, ftp:, or even gopher:-- just like it
was a local file. Once you grab a remote (pseudo-)filehandle
using urlopen()
, you can do everything you would with the
filehandle of a local (read-only) file:
from urllib import urlopen import string source = cfg_dict['source'] if source == '<STDIN>': fhin = sys.stdin else: try: fhin = urlopen(source) except: ErrReport(source+' could not be opened!', cfg_dict) return doc = '' for line in fhin.readlines(): # Need to normalize line endings! doc = doc+string.rstrip(line)+'\n'
One minor problem that this author has encountered is that
depending on the end-of-line convention used on the platform
that produced the resource and on your own platform, some odd
things can happen to the resulting text (this appears to be a
bug in urllib
). The cure for this problems is to perform
the little .readlines()
loop in the above code. Doing this
gives you a string that has the right end-of-line conventions
for the platform you are running on, regardless of what the
source resource looked like (within reason, presumably).
There is certainly a lot more to regular expressions than can
fit into this article. A widely read reference book on the
topic is listed under Resources. The re
module is fairly
widely used in Txt2Html
to identify various textual patterns
in the source texts. A moderately complex example is worth
looking at.
import re def URLify(txt): txt = re.sub('((?:http|ftp|gopher|file)://(?:[^ \n\r<\)]+))(\s)', '<a href="\\1">\\1</a>\\2', txt) return txt
is a nice little function to do pretty much what it
says. If something that looks like a URL is encountered in the
"smart ASCII" file, it is converted into an actual hotlink to
that same URL within the HTML output. Let us look at what the
is doing. First, in broadest terms, the function's
purpose is to "match what is in the first pattern, then replace
it with the second pattern, using the third argument as the
string to operate on." Good enough, not much different from
in those terms.
The first pattern has several elements. Notice the parentheses
first: the highest level consists of two pairs: a complicated
bunch of stuff followed by (\s)
. Sets of parentheses match
"subexpressions" that can potentially make up part of the
replacement pattern. The second subexpression, (\s)
means "match any whitespace character (and let us refer back to
the particular type of whitespace it was). So let's look at
the first subexpression.
Python regular expressions have a couple tricks of their own.
One such trick is the ?:
operator at the beginning of a
subexpression. This means "match a subpattern, but don't
include the match in the back-references." So let us examine
the subexpression
((?:http|ftp|gopher|file)://(?:[^ \n\r<\)]+))
First notice that this subexpression is itself composed of two
child subexpressions, with some stuff in the middle that is not
part of any child subexpression. However, each of the children
starts with ?:
, which means that they get matched, but don't
count for reference purposes. The first of these
"non-reference" child subexpressions just says "match something
that looks like http
or that looks like ftp
or ...".
Next we get the short string ://
which means to match
anything that looks exactly like it (simple, huh?). Finally, we
get the second child subexpression, which other than the "don't
refer" operator consists of some stuff in square brackets, and
a plus sign.
In regular expressions, square brackets just mean "match any character in the brackets." However, if the first character is a caret (^), the meaning is reversed, and it means "match anything not in the next characters." So we are looking for stuff that is not a space, CR, LF, "<" or ")" (notice also that characters that have special meaning to regular expressions can be "escaped" by having a "\" in front of them). The plus sign at the end means "match one or more of the last thing" (asterisk is for "zero or more", and question-mark is for "zero or one").
This regular expression has a bunch to digest, but if you walk through it a few times, you can see that this is what a URL has to look like.
Next is the replacement chunk. This is simpler. The parts
that look like \\1
and \\2
(or \\3
, \\4
, etc., if we
needed them) are those "back references" discussed. They mean
"the pattern matched by the first (second) subexpression of the
match expression. All the rest of the stuff in the replacement
chunk just is what it is: some characters that are easily
recognized as HTML codes. One little thing that is a bit
subtle is that we bother to match \\2
--which looking above is
just a whitespace character. One might ask, "why bother? why
not just insert a space as a literal?" Fair question, and we do
not really need to do what we did for HTML. But
aesthetically, it is better to let the HTML output stay as
much as possible like the source text file was before our our
HTML markup. Particularly, let us keep the line-breaks as
line-breaks, and spaces as spaces (and tabs as tabs).
To obtain or use Txt2Html, just point at it on the author's website. The navigation bar attached to the top of all proxied pages (by default) will include a link to download all the source files.
This article as "smart ASCII" text.
Project Guttenburg Homepage.
Ka-Ping Yee's list and discussion of what he calls "mediators" (and why he doesn't think "proxy" covers it).
Babelfish Homepage. Translate web pages from one language to another.
The Malkovich web-based filtering proxy (there is no point trying to explain, you just have to see this one!).
Strictly for those over 18, a funny web-based filtering proxy can be found here.
The Anonymizer: a genuinely useful filtering proxy for folks who want to browse the web without leaking information about their personal identity (not just cookies, but also IP address, browser version, OS, or other information potentially correlated with identity).
Friedl, Jeffrey E. F., Mastering Regular Expressions, O'Reilly, Cambridge, MA 1997 is a fairly standard and definitive reference on RegEx's.
David Mertz is not really quite sure if the open source
movement is a dinner party. But he would certainly like to
think that proprietary intellectual property is a paper tiger.
His own ideas, certainly, want to be free. David may be
reached at [email protected]; his life pored over at
http://gnosis.cx/publish/. Suggestions and recommendations on
this, past, or future, columns are welcomed.