CHARMING PYTHON (Special Installment) Revisiting XML Tools for Python David Mertz, Ph.D. Ugly American, Gnosis Software, Inc. May 2001 The first two installments of my _Charming Python_ column provided an overview of working with XML in Python. However, in the year since those columns were written, the state of XML-tools for Python has advanced quite a bit. Unfortunately, most of these advances have not been backwards compatible. This special installment revisits my initial discussion of XML tools, and provides up-to-date code samples. INTRODUCTION ------------------------------------------------------------------------ Python is, in many ways, an ideal language for working with XML documents. Like Perl, REBOL, REXX, and TCL it is a flexible scripting language with powerful text manipulation capabilities. Moreover, more than most types of text files (or streams), XML documents typically encode rich and complex data structures. The familiar "read some lines and compare them to some regular expressions" style of text processing is generally not well-suited to adequately parsing and processing XML. Python, fortunately (and more so than most other languages), has both straightforward ways of dealing with complex data structures (usually with classes and attributes), and a range of XML-related modules to aid in parsing, processing, and generating XML. Much of the effort of maintaining a range of XML tools for Python is performed by members of the XML-SIG. As with other Python Special Interest Groups, the XML-SIG maintains a mailing list, list archive, helpful references, documentation, a standard packaging, and other resources. Starting with Python 2.0, Python includes most of the XML-SIG project in its standard distribution. Some "bleeding-edge" features might be contained in the latest XML-SIG package that are not in a standard Python distribution. But for the vast majority of purposes--including the discussion in this article--the XML support in Python 2.0 will be what you are interested in. Fortunately, Python 2.0+ has advanced quite a way past the rudimentary support provided by [xmllib] in earlier Python versions. Nowadays, Python users have a healthy choice of 'DOM', 'SAX' and 'expat' techniques for handling XML (all of these will be recognized by XML developers who have used other programming languages). MODULE: XMLLIB ------------------------------------------------------------------------ [xmllib] is a non-validating and low-level parser. The way [xmllib] works is by the application programmer overriding the class XMLParser, and providing methods to handle document elements, such as specific or generic tags, or character entities. The use of [xmllib] is unchanged in Python 2.0+ from that in Python 1.5x; in most cases you will be better off with a SAX technique, which is also stream-oriented, but is more standard across languages and developers. The examples in this article will be the same files used in the original column: a DTD called 'quotations.dtd' and a document called 'sample.xml' of this DTD (see Resources for an archive of files mentioned in this article). The below code will display the first few lines of each quotation in 'sample.xml', and produce very simple ASCII indicators of unknown tags and entities. The parsed text is handled as a sequential stream, and any accumulators used are the programmer's responsibility (such as the string of characters (#PCDATA) within a tag, or a list/dictionary of tags encountered). #--------------- File: try_xmllib.py -------------------# import xmllib, string class QuotationParser(xmllib.XMLParser): """Crude xmllib extractor for quotations.dtd document""" def __init__(self): xmllib.XMLParser.__init__(self) self.thisquote = '' # quotation accumulator def handle_data(self, data): self.thisquote = self.thisquote + data def syntax_error(self, message): pass def start_quotations(self, attrs): # top level tag print '--- Begin Document ---' def start_quotation(self, attrs): print 'QUOTATION:' def end_quotation(self): print string.join(string.split(self.thisquote[:230]))+'...', print '('+str(len(self.thisquote))+' bytes)\n' self.thisquote = '' def unknown_starttag(self, tag, attrs): self.thisquote = self.thisquote + '{' def unknown_endtag(self, tag): self.thisquote = self.thisquote + '}' def unknown_charref(self, ref): self.thisquote = self.thisquote + '?' def unknown_entityref(self, ref): self.thisquote = self.thisquote + '#' if __name__ == '__main__': parser = QuotationParser() for c in open("sample.xml").read(): parser.feed(c) parser.close() VALIDATION ------------------------------------------------------------------------ One reason you might want to look beyond the standard XML support is if you need to perform validation along with your parsing. Unfortunately, the standard Python 2.0 XML package does not contain a validating parser. [xmlproc] is a python native parser, which performs nearly complete validation. If you need a validating parser, [xmlproc] is currently your only choice in Python. As well, [xmlproc] provides a variety of high-level and experimental interfaces that other parsers do not. CHOOSING A PARSER ------------------------------------------------------------------------ If you decide to use the Simple API for XML (SAX)--which you should for anything sophisticated, since most other tools are built on top of it--much of the work of sorting through parsers can be done for you. The module [xml.sax] contains a facility for automatically selecting the "best" parser. With a standard Python 2.0 installation, the only parser to choose from is [expat], which is a speedy extension, written in C. However, it is possible to install another parser into '$PYTHONLIB/xml/parsers' and have it available for selection. Setting up a parser is a simple matter: #------- Python lines for selecting best parser --------# import xml.sax parser = xml.sax.make_parser() You may also select a specific parser by passing an argument in; but for portability--and also for upward compatibility with an even better parser yet to come--it is probably best to let 'make_parser()' do the work for you. It is possible to import [xml.parsers.expat] directly. If you do this, you get a few special techniques that the SAX interface does not provide. In this sense, [xml.parsers.expat] is a bit "lower level" than SAX. But the SAX techniques are quite standard, and quite good for stream-oriented processing; much of the time SAX is just the right level to work with. The raw speed differences are likely to be minimal, since the 'make_parser()' function already manages to get the performance 'expat' offers for general cases. WHAT IS SAX? ------------------------------------------------------------------------ By way of background, just what is SAX? A good answer is: SAX (Simple API for XML) is a common parser interface for XML parsers. It allows application writers to write applications that use XML parsers, but are independent of which parser is actually used. (Think of it as JDBC for XML.)" (Lars Marius Garshol, SAX for Python, see Resources) SAX--like the parser modules it provides an API for--is essentially a sequential processor of an XML document. You use it in a manner largely similar to the [xmllib] example, but with a somewhat higher-level of abstraction. Instead of defining a parser class, an application programmer defines a 'handler' class that is registered with whatever parser is used. Four SAX interfaces must be defined (each with several methods): DocumentHandler, DTDHandler, EntityResolver and ErrorHandler. Creating a parser also attaches default interfaces unless overridden. Here is some code performs the same task as the [xmllib] example. #----------------- File: try_sax.py --------------------# "Simple SAX example, updated for Python 2.0+" import string import xml.sax from xml.sax.handler import * class QuotationHandler(ContentHandler): """Crude extractor for quotations.dtd compliant XML document""" def __init__(self): self.in_quote = 0 self.thisquote = '' def startDocument(self): print '--- Begin Document ---' def startElement(self, name, attrs): if name == 'quotation': print 'QUOTATION:' self.in_quote = 1 else: self.thisquote = self.thisquote + '{' def endElement(self, name): if name == 'quotation': print string.join(string.split(self.thisquote[:230]))+'...', print '('+str(len(self.thisquote))+' bytes)\n' self.thisquote = '' self.in_quote = 0 else: self.thisquote = self.thisquote + '}' def characters(self, ch): if self.in_quote: self.thisquote = self.thisquote + ch if __name__ == '__main__': parser = xml.sax.make_parser() handler = QuotationHandler() parser.setContentHandler(handler) parser.parse("sample.xml") Two small things to notice about the example in contrast to [xmllib] are: the '.parse()' methods handle a whole stream/string so there is no need to create a loop to feed the parser; '.parse()' is also flexible enough to accept either a filename, a file object, or a most any file-like object (something that has a '.read()' method). PACKAGE: DOM ------------------------------------------------------------------------ DOM is a very-high-level tree-based representation of an XML document. The model is not specific to Python, but is a common XML model (see Resources for further information). Python's DOM package is built upon SAX, and is included in Python 2.0's standard XML support. Length contraints prevent code samples in this article, but an excellent general description is given in the XML-SIG's "Python/XML HOWTO": The Document Object Model specifies a tree-based representation for an XML document. A top-level Document instance is the root of the tree, and has a single child which is the top-level Element instance; this Element has children nodes representing the content and any sub-elements, which may have further children, and so forth. Functions are defined which let you traverse the resulting tree any way you like, access element and attribute values, insert and delete nodes, and convert the tree back into XML. The DOM is useful for modifying XML documents, because you can create a DOM tree, modify it by adding new nodes and moving subtrees around, and then produce a new XML document as output. You can also construct a DOM tree yourself, and convert it to XML; this is often a more flexible way of producing XML output than simply writing ... to a file. The syntax of using the module [xml.dom] has changed a bit since my earlier columns. The implementation of DOM that comes with Python 2.0 is called [xml.dom.minidom], and provides a lightweight and small-footprint version of DOM. Obviously, there are a few experimental features of the full XML-SIG's DOM left our of [xml.dom.minidom], but nothing most people will notice. Generating a DOM object is simple to accomplish, just use: #------ Create a Python DOM object from an XML file -----# from xml.dom.minidom import parse, parseString dom1 = parse('mydata.xml') # parse an XML file by name Working with a DOM object is a fairly straightforward OOP-style affair. However, one tends to encounter a lot of list-like attributes in the hierarchy, which are not immediately easy to distinguish (except by enumeration in loops). For example, this is an average snippet of DOM Python code: #------- Iterate through a Python DOM node object -------# for node in dom_node.childNodes: if node.nodeName == '#text': # PCDATA is a kind of node, PCDATA = node.nodeValue # but not a new subtag elif non.nodeName == spam': spam_node_list.append(node) # Create list of nodes The Python standard documentation contains some more detailed DOM examples. The earlier column's examples of working with DOM objects still points in the right direction, but some method and attribute names have changed since then, so take a look at the Python documentation. MODULE: PYXIE ------------------------------------------------------------------------ The [pyxie] module is built on top of Python's standard XML support, and provides additional high-level interfaces to an XML document. [pyxie] does two basic things: it transforms XML documents to a more easily parsed line-oriented format; and it provides methods to treat an XML document as a walkable tree. The line-oriented PYX format used by [pyxie] is language-independent, and tools are available for several languages. In general, a PYX representation of a document is much easier to process using familiar line-oriented text-processing tools like grep, sed, awk, bash, perl--or standard python modules like, [string] and [re]--than is its XML representation. Depending on what is downstream, a transformation from XML to PYX might save a lot of work. [pyxie]'s concept of treating an XML document like a tree is similar to the ideas in DOM. Since the DOM standard is gaining widespread support across a number of programming languages, it will probably make sense for most programmers to focus on that standard rather than on [pyxie] if tree-representation of XML documents is a requirement. MORE MODULES: [xml_pickle] AND [xml_objectify] ------------------------------------------------------------------------ I have produced my own set of high-level modules for dealing with XML, called [xml_pickle] and [xml_objectify]. I have also written enough about these elsewhere (see Resources) that there is no need to go into a lot of details here. But these modules are often very useful when you want to "think in Python" rather than "think in XML." [xml_objectify] especially hides almost all the traces of XML itself from a Python programmer, and lets her work with perfectly "native" Python objects within a program. The actual XML data format that underlies things is abstracted almost to the point of invisibility. Likewise, [xml_pickle] lets a Python programmer start out with "native" Python objects whose data comes from any source, and dump (serialize) them into an XML format that other users might want downstream. RESOURCES ------------------------------------------------------------------------ The best place to start for detailed documentation of Python 2.0+'s modules for handling XML is below. Take a look for all the packages whose namespace begins with 'xml': http://python.org/doc/current/lib/markup.html The Python Special Interest Group on XML: http://www.python.org/sigs/xml-sig/ Other Python Special Interest Groups: http://www.python.org/sigs/ The Vaults of Parnassus (Python code/tool repository) XML page: http://www.vex.net/parnassus/apyllo.py?i=2678626 Pyxie Home Page: http://www.pyxie.org An updated discussion of [xml_pickle] and [xml_objectify] can be found in _XML Matters #11: Lessons in Open Source and Common Sense_ : http://gnosis.cx/publish/programming/xml_matters_11.html Files used and mentioned in this article: http://gnosis.cx/download/charming_python_1r.zip ABOUT THE AUTHOR ------------------------------------------------------------------------ {Picture of Author: http://gnosis.cx/cgi-bin/img_dqm.cgi} David, feeling that a foolish consistency is the hobgoblin of little minds, strives for it in all his writing. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/publish/.