Charming Python (special Installment)
Revisiting XML Tools for Python
David Mertz, Ph.D.
Ugly American, Gnosis Software, Inc.
May 2001
The first two installments of my Charming Python column provided an overview of working with XML in Python. However, in the year since those columns were written, the state of XML-tools for Python has advanced quite a bit. Unfortunately, most of these advances have not been backwards compatible. This special installment revisits my initial discussion of XML tools, and provides up-to-date code samples.
Introduction
Python is, in many ways, an ideal language for working with XML documents. Like Perl, REBOL, REXX, and TCL it is a flexible scripting language with powerful text manipulation capabilities. Moreover, more than most types of text files (or streams), XML documents typically encode rich and complex data structures. The familiar "read some lines and compare them to some regular expressions" style of text processing is generally not well-suited to adequately parsing and processing XML. Python, fortunately (and more so than most other languages), has both straightforward ways of dealing with complex data structures (usually with classes and attributes), and a range of XML-related modules to aid in parsing, processing, and generating XML.
Much of the effort of maintaining a range of XML tools for Python is performed by members of the XML-SIG. As with other Python Special Interest Groups, the XML-SIG maintains a mailing list, list archive, helpful references, documentation, a standard packaging, and other resources.
Starting with Python 2.0, Python includes most of the XML-SIG
project in its standard distribution. Some "bleeding-edge"
features might be contained in the latest XML-SIG package that
are not in a standard Python distribution. But for the vast
majority of purposes--including the discussion in this
article--the XML support in Python 2.0 will be what you are
interested in. Fortunately, Python 2.0+ has advanced quite a
way past the rudimentary support provided by xmllib
in
earlier Python versions. Nowadays, Python users have a healthy
choice of DOM
, SAX
and expat
techniques for handling XML
(all of these will be recognized by XML developers who have
used other programming languages).
Module: Xmllib
xmllib
is a non-validating and low-level parser. The way
xmllib
works is by the application programmer overriding the
class XMLParser, and providing methods to handle document
elements, such as specific or generic tags, or character
entities. The use of xmllib
is unchanged in Python 2.0+ from
that in Python 1.5x; in most cases you will be better off
with a SAX technique, which is also stream-oriented, but is
more standard across languages and developers.
The examples in this article will be the same files used in the
original column: a DTD called quotations.dtd
and a document
called sample.xml
of this DTD (see Resources for an archive
of files mentioned in this article). The below code will
display the first few lines of each quotation in sample.xml
,
and produce very simple ASCII indicators of unknown tags and
entities. The parsed text is handled as a sequential stream,
and any accumulators used are the programmer's responsibility
(such as the string of characters (#PCDATA) within a tag, or a
list/dictionary of tags encountered).
File: try_xmllib.py
import xmllib, string class QuotationParser(xmllib.XMLParser): """Crude xmllib extractor for quotations.dtd document""" def __init__(self): xmllib.XMLParser.__init__(self) self.thisquote = '' # quotation accumulator def handle_data(self, data): self.thisquote = self.thisquote + data def syntax_error(self, message): pass def start_quotations(self, attrs): # top level tag print '--- Begin Document ---' def start_quotation(self, attrs): print 'QUOTATION:' def end_quotation(self): print string.join(string.split(self.thisquote[:230]))+'...', print '('+str(len(self.thisquote))+' bytes)\n' self.thisquote = '' def unknown_starttag(self, tag, attrs): self.thisquote = self.thisquote + '{' def unknown_endtag(self, tag): self.thisquote = self.thisquote + '}' def unknown_charref(self, ref): self.thisquote = self.thisquote + '?' def unknown_entityref(self, ref): self.thisquote = self.thisquote + '#' if __name__ == '__main__': parser = QuotationParser() for c in open("sample.xml").read(): parser.feed(c) parser.close() |
Validation
One reason you might want to look beyond the standard XML support is if you need to perform validation along with your parsing. Unfortunately, the standard Python 2.0 XML package does not contain a validating parser.
xmlproc
is a python native parser, which performs nearly
complete validation. If you need a validating parser,
xmlproc
is currently your only choice in Python. As well,
xmlproc
provides a variety of high-level and experimental
interfaces that other parsers do not.
Choosing A Parser
If you decide to use the Simple API for XML (SAX)--which you
should for anything sophisticated, since most other tools are
built on top of it--much of the work of sorting through parsers
can be done for you. The module xml.sax
contains a facility
for automatically selecting the "best" parser. With a standard
Python 2.0 installation, the only parser to choose from is
expat
, which is a speedy extension, written in C. However,
it is possible to install another parser into
$PYTHONLIB/xml/parsers
and have it available for selection.
Setting up a parser is a simple matter:
Python lines for selecting best parser
import xml.sax parser = xml.sax.make_parser() |
You may also select a specific parser by passing an argument
in; but for portability--and also for upward compatibility with
an even better parser yet to come--it is probably best to let
make_parser()
do the work for you.
It is possible to import xml.parsers.expat
directly. If you
do this, you get a few special techniques that the SAX
interface does not provide. In this sense, xml.parsers.expat
is a bit "lower level" than SAX. But the SAX techniques are
quite standard, and quite good for stream-oriented processing;
much of the time SAX is just the right level to work with. The
raw speed differences are likely to be minimal, since the
make_parser()
function already manages to get the performance
expat
offers for general cases.
What Is Sax?
By way of background, just what is SAX? A good answer is:
SAX (Simple API for XML) is a common parser interface for XML parsers. It allows application writers to write applications that use XML parsers, but are independent of which parser is actually used. (Think of it as JDBC for XML.)" (Lars Marius Garshol, SAX for Python, see Resources)
SAX--like the parser modules it provides an API for--is
essentially a sequential processor of an XML document. You use
it in a manner largely similar to the xmllib
example, but
with a somewhat higher-level of abstraction. Instead of
defining a parser class, an application programmer defines a
handler
class that is registered with whatever parser is
used. Four SAX interfaces must be defined (each with several
methods): DocumentHandler, DTDHandler, EntityResolver and
ErrorHandler. Creating a parser also attaches default
interfaces unless overridden. Here is some code performs the
same task as the xmllib
example.
File: try_sax.py
"Simple SAX example, updated for Python 2.0+" import string import xml.sax from xml.sax.handler import * class QuotationHandler(ContentHandler): """Crude extractor for quotations.dtd compliant XML document""" def __init__(self): self.in_quote = 0 self.thisquote = '' def startDocument(self): print '--- Begin Document ---' def startElement(self, name, attrs): if name == 'quotation': print 'QUOTATION:' self.in_quote = 1 else: self.thisquote = self.thisquote + '{' def endElement(self, name): if name == 'quotation': print string.join(string.split(self.thisquote[:230]))+'...', print '('+str(len(self.thisquote))+' bytes)\n' self.thisquote = '' self.in_quote = 0 else: self.thisquote = self.thisquote + '}' def characters(self, ch): if self.in_quote: self.thisquote = self.thisquote + ch if __name__ == '__main__': parser = xml.sax.make_parser() handler = QuotationHandler() parser.setContentHandler(handler) parser.parse("sample.xml") |
Two small things to notice about the example in contrast to
xmllib
are: the .parse()
methods handle a whole
stream/string so there is no need to create a loop to feed the
parser; .parse()
is also flexible enough to accept either a
filename, a file object, or a most any file-like object
(something that has a .read()
method).
Package: Dom
DOM is a very-high-level tree-based representation of an XML document. The model is not specific to Python, but is a common XML model (see Resources for further information). Python's DOM package is built upon SAX, and is included in Python 2.0's standard XML support. Length contraints prevent code samples in this article, but an excellent general description is given in the XML-SIG's "Python/XML HOWTO":
The Document Object Model specifies a tree-based representation for an XML document. A top-level Document instance is the root of the tree, and has a single child which is the top-level Element instance; this Element has children nodes representing the content and any sub-elements, which may have further children, and so forth. Functions are defined which let you traverse the resulting tree any way you like, access element and attribute values, insert and delete nodes, and convert the tree back into XML.
The DOM is useful for modifying XML documents, because you can create a DOM tree, modify it by adding new nodes and moving subtrees around, and then produce a new XML document as output. You can also construct a DOM tree yourself, and convert it to XML; this is often a more flexible way of producing XML output than simply writing <tag1>...</tag1> to a file.
The syntax of using the module xml.dom
has changed a bit
since my earlier columns. The implementation of DOM that comes
with Python 2.0 is called xml.dom.minidom
, and provides a
lightweight and small-footprint version of DOM. Obviously,
there are a few experimental features of the full XML-SIG's DOM
left our of xml.dom.minidom
, but nothing most people will
notice.
Generating a DOM object is simple to accomplish, just use:
Create a Python DOM object from an XML file
from xml.dom.minidom import parse, parseString dom1 = parse('mydata.xml') # parse an XML file by name |
Working with a DOM object is a fairly straightforward OOP-style affair. However, one tends to encounter a lot of list-like attributes in the hierarchy, which are not immediately easy to distinguish (except by enumeration in loops). For example, this is an average snippet of DOM Python code:
Iterate through a Python DOM node object
for node in dom_node.childNodes: if node.nodeName == '#text': # PCDATA is a kind of node, PCDATA = node.nodeValue # but not a new subtag elif non.nodeName == spam': spam_node_list.append(node) # Create list of <spam> nodes |
The Python standard documentation contains some more detailed DOM examples. The earlier column's examples of working with DOM objects still points in the right direction, but some method and attribute names have changed since then, so take a look at the Python documentation.
Module: Pyxie
The pyxie
module is built on top of Python's standard XML
support, and provides additional high-level interfaces to an
XML document. pyxie
does two basic things: it transforms
XML documents to a more easily parsed line-oriented format; and
it provides methods to treat an XML document as a walkable
tree. The line-oriented PYX format used by pyxie
is
language-independent, and tools are available for several
languages. In general, a PYX representation of a document is
much easier to process using familiar line-oriented
text-processing tools like grep, sed, awk, bash, perl--or
standard python modules like, string
and re
--than is its
XML representation. Depending on what is downstream, a
transformation from XML to PYX might save a lot of work.
pyxie
's concept of treating an XML document like a tree is
similar to the ideas in DOM. Since the DOM standard is gaining
widespread support across a number of programming languages, it
will probably make sense for most programmers to focus on that
standard rather than on pyxie
if tree-representation of XML
documents is a requirement.
More Modules: xml_pickle
And xml_objectify
I have produced my own set of high-level modules for dealing
with XML, called xml_pickle
and xml_objectify
. I have also
written enough about these elsewhere (see Resources) that there
is no need to go into a lot of details here. But these modules
are often very useful when you want to "think in Python" rather
than "think in XML." xml_objectify
especially hides almost
all the traces of XML itself from a Python programmer, and lets
her work with perfectly "native" Python objects within a
program. The actual XML data format that underlies things is
abstracted almost to the point of invisibility. Likewise,
xml_pickle
lets a Python programmer start out with "native"
Python objects whose data comes from any source, and dump
(serialize) them into an XML format that other users might want
downstream.
Resources
The best place to start for detailed documentation of Python
2.0+'s modules for handling XML is below. Take a look for all
the packages whose namespace begins with xml
:
http://python.org/doc/current/lib/markup.html
The Python Special Interest Group on XML:
http://www.python.org/sigs/xml-sig/
Other Python Special Interest Groups:
http://www.python.org/sigs/
The Vaults of Parnassus (Python code/tool repository) XML page:
http://www.vex.net/parnassus/apyllo.py?i=2678626
Pyxie Home Page:
http://www.pyxie.org
An updated discussion of xml_pickle
and xml_objectify
can
be found in XML Matters #11: Lessons in Open Source and
Common Sense :
http://gnosis.cx/publish/programming/xml_matters_11.html
Files used and mentioned in this article:
http://gnosis.cx/download/charming_python_1r.zip
About The Author
David, feeling that a foolish consistency is the hobgoblin of little minds, strives for it in all his writing. David may be reached at [email protected]; his life pored over at http://gnosis.cx/publish/.