David Mertz, Ph.D.
President, Gnosis Software, Inc.
May 2000
A major element of getting started on working with XML in Python is sorting out the comparative capabilities of all the available modules. This article provides brief descriptions of the most popular and useful XML-related Python modules, and points the reader to resources where she can download and read more about individual modules. A focus of this article is to help the reader determine which modules are most appropriate for a specific task at hand.
Python is a freely available, very-high-level, interpreted language developed by Guido van Rossum. It combines a clear syntax with powerful (but optional) object-oriented semantics. Python is available for almost every computer platform you might find yourself working on, and has strong portability between platforms.
XML is a simplified dialect of the Standard Generalized Markup Language (SGML). Many readers will be most familiar with SGML via one particular document type, HTML. XML documents are similar to HTML in being composed of text interspersed with and structured by markup tags in angle-brackets. But XML encompasses many systems of tags that allow XML documents to be used for many purposes: magazine articles and user documentation, files of structured data (like CSV or EDI files), messages for interprocess communication between programs, architectural diagrams (like CAD formats), and many other purposes. A set of tags can be created to capture any sort of structured information one might want to represent, which is why XML is growing in popularity as a common standard for representing diverse information.
Python is, in many ways, an ideal language for working with XML documents. Like Perl, REBOL, REXX, and TCL it is a flexible scripting language with powerful text manipulation capabilities. Moreover, more than most types of text files (or streams), XML documents typically encode rich and complex data structures. The familiar "read some lines and compare them to some regular expressions" style of text processing is generally not well-suited to adequately parsing and processing XML. Python, fortunately (and more so than most other languages), has both straightforward ways of dealing with complex data structures (usually with classes and attributes), and a range of XML-related modules to aid in parsing, processing, and generating XML.
One general concept to keep in mind about XML is that XML documents can be processed in either a validating or non-validating fashion. In the former type of processing, it is necessary to read a "Document Type Definition" (DTD) prior to reading an XML document it applies to. The processing in this case will evaluate not just the simple syntactic rules for XML documents in general, but also the specific grammatical constraints of the DTD. In many cases, non-validating processing is adequate (and generally both faster to run, and easier to program)--we trust the document creator to follow the rules of the document domain. Most modules discussed below are non-validating; descriptions will indicate where validation options exist.
The Vaults of Parnassus (http://www.vex.net/parnassus/) has become the standard means of finding Python resources of late. All of the modules discussed below can be found at that site (via links to the respective module owner's sites). In particular, the PyXML distribution can be found as both a tarball and as a Win32 installer in the Vaults.
Much--or most--of the effort of maintaining a range of XML tools for Python is performed by members of the XML-SIG. As with other Python Special Interest Groups, the XML-SIG maintains a mailing list, list archive, helpful references, documentation, a standard packaging, and other resources. Probably the best place to start after reading the summaries in this article is with the XML-SIG web pages.
Of specific interest for this article, the XML-SIG maintains the PyXML distribution. This package contains many of the modules discussed in this article, some "getting started" documentation, some demonstration code, and whatever else the XML-SIG might decide to throw into the distribution. A given package may not always contain the "bleeding edge" version of each individual module or tool, but downloading the PyXML distribution is a good place to start. You can always add any modules that are not included, or any new versions of included modules, later (and many of the modules that are not included themselves rely on services provided by the PyXML distribution).
"Out of the box," Python 1.5.x comes with the module xmllib
.
Python 1.6 is likely to incorporate more of the XML-SIG's
efforts, but that version is still in alpha. xmllib
is a
non-validating and low-level parser. The way xmllib
works is
by the application programmer overriding the class XMLParser,
and providing methods to handle document elements, such as
specific or generic tags, or character entities.
As an example of xmllib
in action, the PyXML distribution
includes a DTD called quotations.dtd
and a document called
sample.xml
of this DTD (see Resources for an archive of files
mentioned in this article). The below code will display the
first few lines of each quotation in sample.xml
, and produce
very simple ASCII indicators of unknown tags and entities. The
parsed text is handled as a sequential stream, and any
accumulators used are the programmer's responsibility (such as
the string of characters (#PCDATA) within a tag, or a
list/dictionary of tags encountered).
import xmllib, string class QuotationParser(xmllib.XMLParser): """Crude xmllib extractor for quotations.dtd document""" def __init__(self): xmllib.XMLParser.__init__(self) self.thisquote = '' # quotation accumulator def handle_data(self, data): self.thisquote = self.thisquote + data def syntax_error(self, message): pass def start_quotations(self, attrs): # top level tag print '--- Begin Document ---' def start_quotation(self, attrs): print 'QUOTATION:' def end_quotation(self): print string.join(string.split(self.thisquote[:230]))+'...', print '('+str(len(self.thisquote))+' bytes)\n' self.thisquote = '' def unknown_starttag(self, tag, attrs): self.thisquote = self.thisquote + '{' def unknown_endtag(self, tag): self.thisquote = self.thisquote + '}' def unknown_charref(self, ref): self.thisquote = self.thisquote + '?' def unknown_entityref(self, ref): self.thisquote = self.thisquote + '#' if __name__ == '__main__': parser = QuotationParser() for c in open("sample.xml").read(): parser.feed(c) parser.close()
Several additional parsing modules with varying capabilities
are included in the PyXML distribution. These all aim to
provide some improvement over the base xmllib
module.
pyexpat
is a wrapper for the GPL'd XML Parser Toolkit
expat
. expat
in turn is a library written in C that is
meant to be available from any language that wants to utilize
it. expat
is non-validating, and should be much faster than
a native Python parser. sgmlop
is similar in purpose to
pyexpat
. It is also non-validating, and also written in C.
pyexpat
is available as a MacOS binary, and sgmlop
is
available as a Win32 binary; but if you need a different
platform than these, you will need to use a C compiler to
build the modules for your own platform.
xmlproc
is a python native parser, which performs nearly
complete validation. If you need a validating parser,
xmlproc
is currently your only choice in Python. As well,
xmlproc
provides a variety of high-level and experimental
interfaces that other parsers do not.
If you decide to use the Simple API for XML (SAX)--which you
should for anything sophisticated, since most other tools are
built on top of it--much of the work of sorting through parsers
can be done for you. In the PyXML distribution,
xml.sax.drivers
contains thin wrappers for a number of
parsers, including all those discussed, with names of the form
drv_*.py
. However, generally you will access the drivers by
a higher level SAX facility that will automatically choose the
"best" parser available on the system where run:
from xml.sax.saxext import * parser = XMLParserFactory.make_parser()
These lines will select a parser for you (including xmllib
,
as a fallback). You may also select a specific parser by
passing an argument in; but for portability--and also for
upward compatibility with an even better parser yet to come--it
is probably best to let make_parser()
do the work for you.
We have mentioned above that SAX can automatically choose a parser to use; but just what is SAX? A good answer is:
SAX (Simple API for XML) is a common parser interface for XML parsers. It allows application writers to write applications that use XML parsers, but are independent of which parser is actually used. (Think of it as JDBC for XML.)" (Lars Marius Garshol, SAX for Python, see Resources)
SAX--like the parser modules it provides an API for--is
essentially a sequential processor of an XML document. You use
it in a manner largely similar to the xmllib
example, but
with a somewhat higher-level of abstraction. Instead of
defining a parser class, an application programmer defines a
handler
class that is registered with whatever parser is
used. Four SAX interfaces must be defined (each with several
methods): DocumentHandler, DTDHandler, EntityResolver and
ErrorHandler. Base classes of all of these are provided, but
in most cases it is easiest to inherit from HandlerBase
,
which itself inherits from all four interfaces. You can
override whatever you wish to. Some code will help illustrate
this; the sample performs the same task as the xmllib
example.
import string from xml.sax import saxlib, saxexts class QuotationHandler(saxlib.HandlerBase): """Crude sax extractor for quotations.dtd document""" def __init__(self): self.in_quote = 0 self.thisquote = '' def startDocument(self): print '--- Begin Document ---' def startElement(self, name, attrs): if name == 'quotation': print 'QUOTATION:' self.in_quote = 1 else: self.thisquote = self.thisquote + '{' def endElement(self, name): if name == 'quotation': print string.join(string.split(self.thisquote[:230]))+'...', print '('+str(len(self.thisquote))+' bytes)\n' self.thisquote = '' self.in_quote = 0 else: self.thisquote = self.thisquote + '}' def characters(self, ch, start, length): if self.in_quote: self.thisquote = self.thisquote + ch[start:start+length] if __name__ == '__main__': parser = saxexts.XMLParserFactory.make_parser() handler = QuotationHandler() parser.setDocumentHandler(handler) parser.parseFile(open("sample.xml")) parser.close()
Two small things to notice about the example in contrast to
xmllib
are: the parseFile()
/ parse()
methods handle a whole
stream/string so there is no need to create a loop to feed the
parser; relatedly, characters()
is fed chunks of data whose
size and position with the passed string are indicated by
arguments. Don't make any assumptions about what the ch
variable will as passed to characters()
.
DOM is a very-high-level tree-based representation of an XML document. The model is not specific to Python, but is a common XML model (see Resources for further information). Python's DOM package is built upon SAX, and is included in the PyXML distribution. Length contraints prevent code samples in this article, but an excellent general description is given in the XML-SIG's "Python/XML HOWTO":
The Document Object Model specifies a tree-based representation for an XML document. A top-level Document instance is the root of the tree, and has a single child which is the top-level Element instance; this Element has children nodes representing the content and any sub-elements, which may have further children, and so forth. Functions are defined which let you traverse the resulting tree any way you like, access element and attribute values, insert and delete nodes, and convert the tree back into XML.
The DOM is useful for modifying XML documents, because you can create a DOM tree, modify it by adding new nodes and moving subtrees around, and then produce a new XML document as output. You can also construct a DOM tree yourself, and convert it to XML; this is often a more flexible way of producing XML output than simply writing <tag1>...</tag1> to a file.
The pyxie
module is built on top of the PyXML distribution
from the XML-SIG, and provides additional high-level interfaces
to an XML document. pyxie
does two basic things: it
transforms XML documents to a more easily parsed line-oriented
format; and it provides methods to treat an XML document as a
walkable tree. The line-oriented PYX format used by pyxie
is
language-independent, and tools are available for several
languages. In general, a PYX representation of a document is
much easier to process using familiar line-oriented
text-processing tools like grep, sed, awk, bash, perl--or
standard python modules like, string
and re
--than is its
XML representation. Depending on what is downstream, a
transformation from XML to PYX might save a lot of work.
pyxie
's concept of treating an XML document like a tree is
similar to the ideas in DOM. Since the DOM standard is gaining
widespread support across a number of programming languages, it
will probably make sense for most programmers to focus on that
standard rather than on pyxie
if tree-representation of XML
documents is a requirement.
The too generically--and perhaps a bit wrongly--named XML
Parser
is a somewhat older tool to check the syntacticality
and well-formedness of an XML document (but not to validate
against a DTD). One extra utility class implements a bit of
fuzziness in the checking to get HTML documents to pass (even
without having all the closing tags XML requires). The range
of applicability of this module is not as broad as those in the
PyXML distribution. But it is easy to get up-and-running with
XML Parser if your requirement is just to verify some XML
documents. The module will check an XML document on STDIN if
run from the command line without even bothering to import it
into your program. You can't get much easier than that.
Like other high-level tools, xml_objects
is built on top of
SAX. The purpose of xml_objects
is to transform an XML
document into a two dimensional grid representation that can
more easily be stored in a relational database.
"Python 101" A general first introduction to Python:
http://www-4.ibm.com/software/developer/library/python101.html
Other introductions, from the Python web site:
http://www.python.org/doc/Intros.html
The Python Special Interest Group on XML:
http://www.python.org/sigs/xml-sig/
SAX for Python Home Page:
http://www.stud.ifi.uio.no/~lmariusg/download/python/xml/saxlib.html
Other Python Special Interest Groups:
http://www.python.org/sigs/
The Vaults of Parnassus (Python code/tool repository) XML page:
http://www.vex.net/parnassus/apyllo.py?i=2678626
Pyxie Home Page:
http://www.pyxie.org
Files used and mentioned in this article:
http://gnosis.cx/download/charming_python_1.zip
"Processing XML with Perl" A good article giving brief descriptions of XML modules available for Perl (similar overview to that in this article, for another popular P--- language):
http://www.xml.com/pub/2000/04/05/feature/index.html
There must be some enthymetic necessity to David Mertz writing a column on Python. Like the Monty crew, whose phonorecordings he imbibed as a teenager, he wound up with graduate degrees in philosophy. Now that he writes computer programs for a living--and writes about writing computer programs--a certain symmetry is served by writing such in and about Python. David may be reached at [email protected]; his life pored over at http://gnosis.cx/publish/.