CHARMING PYTHON (Special Installment)
Revisiting XML Tools for Python
David Mertz, Ph.D.
Ugly American, Gnosis Software, Inc.
May 2001
The first two installments of my _Charming Python_ column
provided an overview of working with XML in Python. However,
in the year since those columns were written, the state of
XML-tools for Python has advanced quite a bit.
Unfortunately, most of these advances have not been backwards
compatible. This special installment revisits my initial
discussion of XML tools, and provides up-to-date code samples.
INTRODUCTION
------------------------------------------------------------------------
Python is, in many ways, an ideal language for working with XML
documents. Like Perl, REBOL, REXX, and TCL it is a flexible
scripting language with powerful text manipulation
capabilities. Moreover, more than most types of text files (or
streams), XML documents typically encode rich and complex data
structures. The familiar "read some lines and compare them to
some regular expressions" style of text processing is generally
not well-suited to adequately parsing and processing XML.
Python, fortunately (and more so than most other languages),
has both straightforward ways of dealing with complex data
structures (usually with classes and attributes), and a range
of XML-related modules to aid in parsing, processing, and
generating XML.
Much of the effort of maintaining a range of XML tools for
Python is performed by members of the XML-SIG. As with other
Python Special Interest Groups, the XML-SIG maintains a mailing
list, list archive, helpful references, documentation, a
standard packaging, and other resources.
Starting with Python 2.0, Python includes most of the XML-SIG
project in its standard distribution. Some "bleeding-edge"
features might be contained in the latest XML-SIG package that
are not in a standard Python distribution. But for the vast
majority of purposes--including the discussion in this
article--the XML support in Python 2.0 will be what you are
interested in. Fortunately, Python 2.0+ has advanced quite a
way past the rudimentary support provided by [xmllib] in
earlier Python versions. Nowadays, Python users have a healthy
choice of 'DOM', 'SAX' and 'expat' techniques for handling XML
(all of these will be recognized by XML developers who have
used other programming languages).
MODULE: XMLLIB
------------------------------------------------------------------------
[xmllib] is a non-validating and low-level parser. The way
[xmllib] works is by the application programmer overriding the
class XMLParser, and providing methods to handle document
elements, such as specific or generic tags, or character
entities. The use of [xmllib] is unchanged in Python 2.0+ from
that in Python 1.5x; in most cases you will be better off
with a SAX technique, which is also stream-oriented, but is
more standard across languages and developers.
The examples in this article will be the same files used in the
original column: a DTD called 'quotations.dtd' and a document
called 'sample.xml' of this DTD (see Resources for an archive
of files mentioned in this article). The below code will
display the first few lines of each quotation in 'sample.xml',
and produce very simple ASCII indicators of unknown tags and
entities. The parsed text is handled as a sequential stream,
and any accumulators used are the programmer's responsibility
(such as the string of characters (#PCDATA) within a tag, or a
list/dictionary of tags encountered).
#--------------- File: try_xmllib.py -------------------#
import xmllib, string
class QuotationParser(xmllib.XMLParser):
"""Crude xmllib extractor for quotations.dtd document"""
def __init__(self):
xmllib.XMLParser.__init__(self)
self.thisquote = '' # quotation accumulator
def handle_data(self, data):
self.thisquote = self.thisquote + data
def syntax_error(self, message):
pass
def start_quotations(self, attrs): # top level tag
print '--- Begin Document ---'
def start_quotation(self, attrs):
print 'QUOTATION:'
def end_quotation(self):
print string.join(string.split(self.thisquote[:230]))+'...',
print '('+str(len(self.thisquote))+' bytes)\n'
self.thisquote = ''
def unknown_starttag(self, tag, attrs):
self.thisquote = self.thisquote + '{'
def unknown_endtag(self, tag):
self.thisquote = self.thisquote + '}'
def unknown_charref(self, ref):
self.thisquote = self.thisquote + '?'
def unknown_entityref(self, ref):
self.thisquote = self.thisquote + '#'
if __name__ == '__main__':
parser = QuotationParser()
for c in open("sample.xml").read():
parser.feed(c)
parser.close()
VALIDATION
------------------------------------------------------------------------
One reason you might want to look beyond the standard XML
support is if you need to perform validation along with your
parsing. Unfortunately, the standard Python 2.0 XML package
does not contain a validating parser.
[xmlproc] is a python native parser, which performs nearly
complete validation. If you need a validating parser,
[xmlproc] is currently your only choice in Python. As well,
[xmlproc] provides a variety of high-level and experimental
interfaces that other parsers do not.
CHOOSING A PARSER
------------------------------------------------------------------------
If you decide to use the Simple API for XML (SAX)--which you
should for anything sophisticated, since most other tools are
built on top of it--much of the work of sorting through parsers
can be done for you. The module [xml.sax] contains a facility
for automatically selecting the "best" parser. With a standard
Python 2.0 installation, the only parser to choose from is
[expat], which is a speedy extension, written in C. However,
it is possible to install another parser into
'$PYTHONLIB/xml/parsers' and have it available for selection.
Setting up a parser is a simple matter:
#------- Python lines for selecting best parser --------#
import xml.sax
parser = xml.sax.make_parser()
You may also select a specific parser by passing an argument
in; but for portability--and also for upward compatibility with
an even better parser yet to come--it is probably best to let
'make_parser()' do the work for you.
It is possible to import [xml.parsers.expat] directly. If you
do this, you get a few special techniques that the SAX
interface does not provide. In this sense, [xml.parsers.expat]
is a bit "lower level" than SAX. But the SAX techniques are
quite standard, and quite good for stream-oriented processing;
much of the time SAX is just the right level to work with. The
raw speed differences are likely to be minimal, since the
'make_parser()' function already manages to get the performance
'expat' offers for general cases.
WHAT IS SAX?
------------------------------------------------------------------------
By way of background, just what is SAX? A good answer is:
SAX (Simple API for XML) is a common parser interface for XML
parsers. It allows application writers to write applications
that use XML parsers, but are independent of which parser is
actually used. (Think of it as JDBC for XML.)" (Lars Marius
Garshol, SAX for Python, see Resources)
SAX--like the parser modules it provides an API for--is
essentially a sequential processor of an XML document. You use
it in a manner largely similar to the [xmllib] example, but
with a somewhat higher-level of abstraction. Instead of
defining a parser class, an application programmer defines a
'handler' class that is registered with whatever parser is
used. Four SAX interfaces must be defined (each with several
methods): DocumentHandler, DTDHandler, EntityResolver and
ErrorHandler. Creating a parser also attaches default
interfaces unless overridden. Here is some code performs the
same task as the [xmllib] example.
#----------------- File: try_sax.py --------------------#
"Simple SAX example, updated for Python 2.0+"
import string
import xml.sax
from xml.sax.handler import *
class QuotationHandler(ContentHandler):
"""Crude extractor for quotations.dtd compliant XML document"""
def __init__(self):
self.in_quote = 0
self.thisquote = ''
def startDocument(self):
print '--- Begin Document ---'
def startElement(self, name, attrs):
if name == 'quotation':
print 'QUOTATION:'
self.in_quote = 1
else:
self.thisquote = self.thisquote + '{'
def endElement(self, name):
if name == 'quotation':
print string.join(string.split(self.thisquote[:230]))+'...',
print '('+str(len(self.thisquote))+' bytes)\n'
self.thisquote = ''
self.in_quote = 0
else:
self.thisquote = self.thisquote + '}'
def characters(self, ch):
if self.in_quote:
self.thisquote = self.thisquote + ch
if __name__ == '__main__':
parser = xml.sax.make_parser()
handler = QuotationHandler()
parser.setContentHandler(handler)
parser.parse("sample.xml")
Two small things to notice about the example in contrast to
[xmllib] are: the '.parse()' methods handle a whole
stream/string so there is no need to create a loop to feed the
parser; '.parse()' is also flexible enough to accept either a
filename, a file object, or a most any file-like object
(something that has a '.read()' method).
PACKAGE: DOM
------------------------------------------------------------------------
DOM is a very-high-level tree-based representation of an XML
document. The model is not specific to Python, but is a common
XML model (see Resources for further information). Python's
DOM package is built upon SAX, and is included in Python 2.0's
standard XML support. Length contraints prevent code samples
in this article, but an excellent general description is given
in the XML-SIG's "Python/XML HOWTO":
The Document Object Model specifies a tree-based
representation for an XML document. A top-level Document
instance is the root of the tree, and has a single child
which is the top-level Element instance; this Element has
children nodes representing the content and any sub-elements,
which may have further children, and so forth. Functions are
defined which let you traverse the resulting tree any way you
like, access element and attribute values, insert and delete
nodes, and convert the tree back into XML.
The DOM is useful for modifying XML documents, because you
can create a DOM tree, modify it by adding new nodes and
moving subtrees around, and then produce a new XML document
as output. You can also construct a DOM tree yourself, and
convert it to XML; this is often a more flexible way of
producing XML output than simply writing ... to
a file.
The syntax of using the module [xml.dom] has changed a bit
since my earlier columns. The implementation of DOM that comes
with Python 2.0 is called [xml.dom.minidom], and provides a
lightweight and small-footprint version of DOM. Obviously,
there are a few experimental features of the full XML-SIG's DOM
left our of [xml.dom.minidom], but nothing most people will
notice.
Generating a DOM object is simple to accomplish, just use:
#------ Create a Python DOM object from an XML file -----#
from xml.dom.minidom import parse, parseString
dom1 = parse('mydata.xml') # parse an XML file by name
Working with a DOM object is a fairly straightforward OOP-style
affair. However, one tends to encounter a lot of list-like
attributes in the hierarchy, which are not immediately easy to
distinguish (except by enumeration in loops). For example,
this is an average snippet of DOM Python code:
#------- Iterate through a Python DOM node object -------#
for node in dom_node.childNodes:
if node.nodeName == '#text': # PCDATA is a kind of node,
PCDATA = node.nodeValue # but not a new subtag
elif non.nodeName == spam':
spam_node_list.append(node) # Create list of nodes
The Python standard documentation contains some more detailed
DOM examples. The earlier column's examples of working with
DOM objects still points in the right direction, but some
method and attribute names have changed since then, so take a
look at the Python documentation.
MODULE: PYXIE
------------------------------------------------------------------------
The [pyxie] module is built on top of Python's standard XML
support, and provides additional high-level interfaces to an
XML document. [pyxie] does two basic things: it transforms
XML documents to a more easily parsed line-oriented format; and
it provides methods to treat an XML document as a walkable
tree. The line-oriented PYX format used by [pyxie] is
language-independent, and tools are available for several
languages. In general, a PYX representation of a document is
much easier to process using familiar line-oriented
text-processing tools like grep, sed, awk, bash, perl--or
standard python modules like, [string] and [re]--than is its
XML representation. Depending on what is downstream, a
transformation from XML to PYX might save a lot of work.
[pyxie]'s concept of treating an XML document like a tree is
similar to the ideas in DOM. Since the DOM standard is gaining
widespread support across a number of programming languages, it
will probably make sense for most programmers to focus on that
standard rather than on [pyxie] if tree-representation of XML
documents is a requirement.
MORE MODULES: [xml_pickle] AND [xml_objectify]
------------------------------------------------------------------------
I have produced my own set of high-level modules for dealing
with XML, called [xml_pickle] and [xml_objectify]. I have also
written enough about these elsewhere (see Resources) that there
is no need to go into a lot of details here. But these modules
are often very useful when you want to "think in Python" rather
than "think in XML." [xml_objectify] especially hides almost
all the traces of XML itself from a Python programmer, and lets
her work with perfectly "native" Python objects within a
program. The actual XML data format that underlies things is
abstracted almost to the point of invisibility. Likewise,
[xml_pickle] lets a Python programmer start out with "native"
Python objects whose data comes from any source, and dump
(serialize) them into an XML format that other users might want
downstream.
RESOURCES
------------------------------------------------------------------------
The best place to start for detailed documentation of Python
2.0+'s modules for handling XML is below. Take a look for all
the packages whose namespace begins with 'xml':
http://python.org/doc/current/lib/markup.html
The Python Special Interest Group on XML:
http://www.python.org/sigs/xml-sig/
Other Python Special Interest Groups:
http://www.python.org/sigs/
The Vaults of Parnassus (Python code/tool repository) XML page:
http://www.vex.net/parnassus/apyllo.py?i=2678626
Pyxie Home Page:
http://www.pyxie.org
An updated discussion of [xml_pickle] and [xml_objectify] can
be found in _XML Matters #11: Lessons in Open Source and
Common Sense_ :
http://gnosis.cx/publish/programming/xml_matters_11.html
Files used and mentioned in this article:
http://gnosis.cx/download/charming_python_1r.zip
ABOUT THE AUTHOR
------------------------------------------------------------------------
{Picture of Author: http://gnosis.cx/cgi-bin/img_dqm.cgi}
David, feeling that a foolish consistency is the hobgoblin of
little minds, strives for it in all his writing. David may be
reached at mertz@gnosis.cx; his life pored over at
http://gnosis.cx/publish/.