CHARMING PYTHON (Special Installment) -- Revisiting XML Tools for Python --

Charming Python (special Installment)
Revisiting XML Tools for Python

David Mertz, Ph.D.
Ugly American, Gnosis Software, Inc.
May 2001

The first two installments of my Charming Python column provided an overview of working with XML in Python. However, in the year since those columns were written, the state of XML-tools for Python has advanced quite a bit. Unfortunately, most of these advances have not been backwards compatible. This special installment revisits my initial discussion of XML tools, and provides up-to-date code samples.

Introduction

Python is, in many ways, an ideal language for working with XML documents. Like Perl, REBOL, REXX, and TCL it is a flexible scripting language with powerful text manipulation capabilities. Moreover, more than most types of text files (or streams), XML documents typically encode rich and complex data structures. The familiar "read some lines and compare them to some regular expressions" style of text processing is generally not well-suited to adequately parsing and processing XML. Python, fortunately (and more so than most other languages), has both straightforward ways of dealing with complex data structures (usually with classes and attributes), and a range of XML-related modules to aid in parsing, processing, and generating XML.

Much of the effort of maintaining a range of XML tools for Python is performed by members of the XML-SIG. As with other Python Special Interest Groups, the XML-SIG maintains a mailing list, list archive, helpful references, documentation, a standard packaging, and other resources.

Starting with Python 2.0, Python includes most of the XML-SIG project in its standard distribution. Some "bleeding-edge" features might be contained in the latest XML-SIG package that are not in a standard Python distribution. But for the vast majority of purposes--including the discussion in this article--the XML support in Python 2.0 will be what you are interested in. Fortunately, Python 2.0+ has advanced quite a way past the rudimentary support provided by xmllib in earlier Python versions. Nowadays, Python users have a healthy choice of DOM, SAX and expat techniques for handling XML (all of these will be recognized by XML developers who have used other programming languages).

Module: Xmllib

xmllib is a non-validating and low-level parser. The way xmllib works is by the application programmer overriding the class XMLParser, and providing methods to handle document elements, such as specific or generic tags, or character entities. The use of xmllib is unchanged in Python 2.0+ from that in Python 1.5x; in most cases you will be better off with a SAX technique, which is also stream-oriented, but is more standard across languages and developers.

The examples in this article will be the same files used in the original column: a DTD called quotations.dtd and a document called sample.xml of this DTD (see Resources for an archive of files mentioned in this article). The below code will display the first few lines of each quotation in sample.xml, and produce very simple ASCII indicators of unknown tags and entities. The parsed text is handled as a sequential stream, and any accumulators used are the programmer's responsibility (such as the string of characters (#PCDATA) within a tag, or a list/dictionary of tags encountered).

File: try_xmllib.py

import xmllib, string class QuotationParser(xmllib.XMLParser): """Crude xmllib extractor for quotations.dtd document""" def __init__(self): xmllib.XMLParser.__init__(self) self.thisquote = '' # quotation accumulator def handle_data(self, data): self.thisquote = self.thisquote + data def syntax_error(self, message): pass def start_quotations(self, attrs): # top level tag print '--- Begin Document ---' def start_quotation(self, attrs): print 'QUOTATION:' def end_quotation(self): print string.join(string.split(self.thisquote[:230]))+'...', print '('+str(len(self.thisquote))+' bytes)\n' self.thisquote = '' def unknown_starttag(self, tag, attrs): self.thisquote = self.thisquote + '{' def unknown_endtag(self, tag): self.thisquote = self.thisquote + '}' def unknown_charref(self, ref): self.thisquote = self.thisquote + '?' def unknown_entityref(self, ref): self.thisquote = self.thisquote + '#' if __name__ == '__main__': parser = QuotationParser() for c in open("sample.xml").read(): parser.feed(c) parser.close()

Validation

One reason you might want to look beyond the standard XML support is if you need to perform validation along with your parsing. Unfortunately, the standard Python 2.0 XML package does not contain a validating parser.

xmlproc is a python native parser, which performs nearly complete validation. If you need a validating parser, xmlproc is currently your only choice in Python. As well, xmlproc provides a variety of high-level and experimental interfaces that other parsers do not.

Choosing A Parser

If you decide to use the Simple API for XML (SAX)--which you should for anything sophisticated, since most other tools are built on top of it--much of the work of sorting through parsers can be done for you. The module xml.sax contains a facility for automatically selecting the "best" parser. With a standard Python 2.0 installation, the only parser to choose from is expat, which is a speedy extension, written in C. However, it is possible to install another parser into $PYTHONLIB/xml/parsers and have it available for selection. Setting up a parser is a simple matter:

Python lines for selecting best parser

import xml.sax parser = xml.sax.make_parser()

You may also select a specific parser by passing an argument in; but for portability--and also for upward compatibility with an even better parser yet to come--it is probably best to let make_parser() do the work for you.

It is possible to import xml.parsers.expat directly. If you do this, you get a few special techniques that the SAX interface does not provide. In this sense, xml.parsers.expat is a bit "lower level" than SAX. But the SAX techniques are quite standard, and quite good for stream-oriented processing; much of the time SAX is just the right level to work with. The raw speed differences are likely to be minimal, since the make_parser() function already manages to get the performance expat offers for general cases.

What Is Sax?

By way of background, just what is SAX? A good answer is:

SAX (Simple API for XML) is a common parser interface for XML parsers. It allows application writers to write applications that use XML parsers, but are independent of which parser is actually used. (Think of it as JDBC for XML.)" (Lars Marius Garshol, SAX for Python, see Resources)

SAX--like the parser modules it provides an API for--is essentially a sequential processor of an XML document. You use it in a manner largely similar to the xmllib example, but with a somewhat higher-level of abstraction. Instead of defining a parser class, an application programmer defines a handler class that is registered with whatever parser is used. Four SAX interfaces must be defined (each with several methods): DocumentHandler, DTDHandler, EntityResolver and ErrorHandler. Creating a parser also attaches default interfaces unless overridden. Here is some code performs the same task as the xmllib example.

File: try_sax.py

"Simple SAX example, updated for Python 2.0+" import string import xml.sax from xml.sax.handler import * class QuotationHandler(ContentHandler): """Crude extractor for quotations.dtd compliant XML document""" def __init__(self): self.in_quote = 0 self.thisquote = '' def startDocument(self): print '--- Begin Document ---' def startElement(self, name, attrs): if name == 'quotation': print 'QUOTATION:' self.in_quote = 1 else: self.thisquote = self.thisquote + '{' def endElement(self, name): if name == 'quotation': print string.join(string.split(self.thisquote[:230]))+'...', print '('+str(len(self.thisquote))+' bytes)\n' self.thisquote = '' self.in_quote = 0 else: self.thisquote = self.thisquote + '}' def characters(self, ch): if self.in_quote: self.thisquote = self.thisquote + ch if __name__ == '__main__': parser = xml.sax.make_parser() handler = QuotationHandler() parser.setContentHandler(handler) parser.parse("sample.xml")

Two small things to notice about the example in contrast to xmllib are: the .parse() methods handle a whole stream/string so there is no need to create a loop to feed the parser; .parse() is also flexible enough to accept either a filename, a file object, or a most any file-like object (something that has a .read() method).

Package: Dom

DOM is a very-high-level tree-based representation of an XML document. The model is not specific to Python, but is a common XML model (see Resources for further information). Python's DOM package is built upon SAX, and is included in Python 2.0's standard XML support. Length contraints prevent code samples in this article, but an excellent general description is given in the XML-SIG's "Python/XML HOWTO":

The Document Object Model specifies a tree-based representation for an XML document. A top-level Document instance is the root of the tree, and has a single child which is the top-level Element instance; this Element has children nodes representing the content and any sub-elements, which may have further children, and so forth. Functions are defined which let you traverse the resulting tree any way you like, access element and attribute values, insert and delete nodes, and convert the tree back into XML.

The DOM is useful for modifying XML documents, because you can create a DOM tree, modify it by adding new nodes and moving subtrees around, and then produce a new XML document as output. You can also construct a DOM tree yourself, and convert it to XML; this is often a more flexible way of producing XML output than simply writing <tag1>...</tag1> to a file.

The syntax of using the module xml.dom has changed a bit since my earlier columns. The implementation of DOM that comes with Python 2.0 is called xml.dom.minidom, and provides a lightweight and small-footprint version of DOM. Obviously, there are a few experimental features of the full XML-SIG's DOM left our of xml.dom.minidom, but nothing most people will notice.

Generating a DOM object is simple to accomplish, just use:

Create a Python DOM object from an XML file

from xml.dom.minidom import parse, parseString dom1 = parse('mydata.xml') # parse an XML file by name

Working with a DOM object is a fairly straightforward OOP-style affair. However, one tends to encounter a lot of list-like attributes in the hierarchy, which are not immediately easy to distinguish (except by enumeration in loops). For example, this is an average snippet of DOM Python code:

Iterate through a Python DOM node object

for node in dom_node.childNodes: if node.nodeName == '#text': # PCDATA is a kind of node, PCDATA = node.nodeValue # but not a new subtag elif non.nodeName == spam': spam_node_list.append(node) # Create list of <spam> nodes

The Python standard documentation contains some more detailed DOM examples. The earlier column's examples of working with DOM objects still points in the right direction, but some method and attribute names have changed since then, so take a look at the Python documentation.

Module: Pyxie

The pyxie module is built on top of Python's standard XML support, and provides additional high-level interfaces to an XML document. pyxie does two basic things: it transforms XML documents to a more easily parsed line-oriented format; and it provides methods to treat an XML document as a walkable tree. The line-oriented PYX format used by pyxie is language-independent, and tools are available for several languages. In general, a PYX representation of a document is much easier to process using familiar line-oriented text-processing tools like grep, sed, awk, bash, perl--or standard python modules like, string and re--than is its XML representation. Depending on what is downstream, a transformation from XML to PYX might save a lot of work.

pyxie's concept of treating an XML document like a tree is similar to the ideas in DOM. Since the DOM standard is gaining widespread support across a number of programming languages, it will probably make sense for most programmers to focus on that standard rather than on pyxie if tree-representation of XML documents is a requirement.

More Modules: xml_pickle And xml_objectify

I have produced my own set of high-level modules for dealing with XML, called xml_pickle and xml_objectify. I have also written enough about these elsewhere (see Resources) that there is no need to go into a lot of details here. But these modules are often very useful when you want to "think in Python" rather than "think in XML." xml_objectify especially hides almost all the traces of XML itself from a Python programmer, and lets her work with perfectly "native" Python objects within a program. The actual XML data format that underlies things is abstracted almost to the point of invisibility. Likewise, xml_pickle lets a Python programmer start out with "native" Python objects whose data comes from any source, and dump (serialize) them into an XML format that other users might want downstream.

Resources

The best place to start for detailed documentation of Python 2.0+'s modules for handling XML is below. Take a look for all the packages whose namespace begins with xml:

http://python.org/doc/current/lib/markup.html

The Python Special Interest Group on XML:

http://www.python.org/sigs/xml-sig/

Other Python Special Interest Groups:

http://www.python.org/sigs/

The Vaults of Parnassus (Python code/tool repository) XML page:

http://www.vex.net/parnassus/apyllo.py?i=2678626

Pyxie Home Page:

http://www.pyxie.org

An updated discussion of xml_pickle and xml_objectify can be found in XML Matters #11: Lessons in Open Source and Common Sense :

http://gnosis.cx/publish/programming/xml_matters_11.html

Files used and mentioned in this article:

http://gnosis.cx/download/charming_python_1r.zip

About The Author

Picture of Author David, feeling that a foolish consistency is the hobgoblin of little minds, strives for it in all his writing. David may be reached at [email protected]; his life pored over at http://gnosis.cx/publish/.