David Mertz, Ph.D.
Transformer, Gnosis Software, Inc.
XML is a simplified dialect of the Standard Generalized Markup Language (SGML). Many readers will be most familiar with SGML via one particular document type, HTML. XML documents are similar to HTML in being composed of text interspersed with and structured by markup tags in angle-brackets. But XML encompasses many systems of tags that allow XML documents to be used for many purposes: magazine articles and user documentation, files of structured data (like CSV or EDI files), messages for interprocess communication between programs, architectural diagrams (like CAD formats), and many other purposes. A set of tags can be created to capture any sort of structured information one might want to represent, which is why XML is growing in popularity as a common standard for representing diverse information.
Python is a freely available, very-high-level, interpreted language developed by Guido van Rossum. It combines a clear syntax with powerful (but optional) object-oriented semantics. Python is available for almost every computer platform you might find yourself working on, and has strong portability between platforms.
XML Matters #1 introduced the author's project for
creating more seamless and natural integration between XML and
Python. Modules and packages such as
xml.dom provide ways of handling XML documents
that are common in the XML community. Tools such as these are
extremely similar to corresponding modules and libraries for
other programming languages. In fact, the organization of many
such modules is a direct result of standards created by
(language-neutral) XML standards bodies. Consequently, one
thing all the abovementioned modules have in common is that
they implement very XML-oriented ways of thinking about
documents and objects; in many cases this conceptual framework
feels like it is tacked on to Python, rather than an integral
part of it.
Using Python implementations of general XML protocols has many uses. Standards like DOM are easily portable between programming languages; and programmers of one language can easily pick up and work with DOM-oriented code written in another language. However, there are also times when a Python programmer prefers to code in ways that feel much more like "normal" Python. The Project discussed in these columns is simply to provide a set of "Pythonic" modules for working with XML documents.
As a result of those asymmetries that exist between XML and
Python, the Project--at least initially--contains two separate
modules: one for representing arbitrary Python objects in XML
xml_pickle); a second one (
xml_objectify) for "native"
representation of XML documents as Python objects. This
article will address the latter module.
-- SIDEBAR: --------------------------------------------
The XML-SIG distribution is changed fairly frequently while it is in beta versions. The changes in turn are extremely likely to affect the functioning of
xml_objectify. Therefore, an XML-SIG version known to be compatible with
xml_objectifymay be found at:
The first URL is the Windows self-installer, the latter is simply an archive of those files to be unpacked under $PYTHONPATH/xml.
Whenever the XML-SIG distribution reaches a release version and/or when the XML package is a part of an official Python release, the current
xml_objectifyshould be updated to work with the official release. The most current
xml_objectifyshould always be available at:
The usage of
xml_objectify is extremely simple, and is well
documented in module docstring comments. Let us take a quick
from xml_objectify import XML_Objectify xml_obj = XML_Objectify('address.xml') py_obj = xml_obj.make_instance()
There are two steps involved in creating a "native" Python
object from a generic XML document (use
handling of special
PyObjects.DTD format documents). First
you want to create an intermediate DOM-like factory object.
Second, you may generate (one or more) Python object instances
XML_Objectify instance. There is no reason you
could not do both steps on the same line, as:
py_obj = XML_Objectify('address.xml').make_instance()
Of course, in this latter case, the factory object does not
stay around to produce more "native" objects, and
._dom data member (which contains a full
DOM instance) is cleared.
For comparison, creation of a DOM object need be no more difficult in Python:
from xml.dom.utils import FileReader dom_obj = FileReader().readXml(open('address.xml'))
FileReader().readXml() requires an actual file object, while
XML_Objectify() may accept either a file object or a plain
filename, but in either case, the object creation is a two line
The difference between using the
xml_objectify module and the
xml.dom package is in the type of object one winds up with.
A Python DOM object is a genuine Python object, but its
attributes and methods do not correspond to the data and
structure of the original XML document in as straightforward a
way as with the
XML_Objectify object. For example, to access
the same XML attribute in the sample document, you have a
print py_obj.person.address.city print dom_obj.get_childNodes().get_childNodes().\ get_childNodes().get_attributes()['city'].value print dom_obj._node.children.children.children.\ attributes['city'].children.value
The basic organization of a DOM tree is a strict ordered tree
of nodes. It is not hard to enumerate over these nodes, but it
is quite cumbersome to refer to specific ones. Making matters
worse is that some nodes are whitespace text nodes and
processing instruction nodes--which you rarely care about--and
finding the subtags in the node list is mostly trial-and-error.
In the above example both access to the "native" attributes
.children) and the DOM-style methods (e.g.
.get_childNodes()) are used in different
On the other hand, the first
"addressbook", corresponding to the root element of the XML
document (and every attribute that might contain more than
simple text is an instance of a class named according to the
XML tag defining it).
-WHY NOT JUST USE
xml_objectify makes wide use of DOM internally. In fact,
XML_Objectify instance contains a
that is a DOM tree for the XML document opened (but the
instance created by
.make_instance does not contain any DOM,
and is the class type of the root tag). The problem with DOM
is that it is just to hard to use, and the syntax is too
obscure. The above examples illustrate this. Python "native"
objects are much easier to program with.
xml_objectify, you can take advantage of all your
existing generic functions. The function
included in the module is a sample generic function. This
function produces a "pretty" recursive representation of
any Python object. By representing your XML documents as
"native" Python documents, you can get a lot of reuse out of
existing functions that deal with Python objects in abstract
ways. Of course, a DOM object still is a Python object of
sorts, but as in the above usage example, its attributes are
mostly a bunch of nested
.children lists; and these are not
all that helpful semantically (try printing a DOM object with
the provided generic function).
-TRICKS WITH CLASS BEHAVIOR-
A subtle trick done by
xml_objectify is that it will only
dynamically define a class for an attribute value if that class
has not already been defined. What this accomplishes is that
it lets you define classes with complex behavior and attributes
in which to pour specific XML document contents. For example,
if the class
person is pre-defined with various methods
.__init__() method if needed), each "person" in
the XML addressbook imported in the above sample code will have
whatever behaviors it has been given, including methods that
operate on the data poured into the instance. Of course, if a
class is not pre-defined prior to 'XML_Objectify()'ing a
document, the class is just a container for the attributes
defined in the actual XML.
Some XML tags are block-level, but some are character-level.
The most natural Python representation--at least to the
author's mind--is different in the two cases. Block-level tags
are the norm; each block-level subtag is easily represented by
an attribute of the parent tag that is named after the subtag,
and the value of the subtag-attribute is a new Python object
also of a type named after the subtag. For example, a
<person> might have an
is nice to Pythonically refer to these as
However, when the contents of a tag are a mixture of some
text data and some markup of that data (often typographic in
nature), the subtags really are not something the parent tag
has. For example, a
misc_info object does not really
ital attributes in the above hierarchical way. So how
should we represent some XML like?
<misc-info>One of the <ital>most</ital> talented actresses on TV.</misc-info>
The approach of
xml_objectify is to add a special attribute
._XML to objects/tags that seem to contain marked-up
character data. This attribute contains the literal XML inside
a tag, if the programmer wants it. The
function, for example, will display this literal XML instead of
recursive attributes if the
._XML attribute exists for a
given nested object. However, the standard recursive
subtag-object creation is still performed, so the programming
requirement can look at whatever attributes and structures are
-NATIVE PYTHON OBJECTS CONTAIN ROOT DOCUMENT ONLY-
Many XML documents contain processing instructions and/or
comments in addition to their tags and character data contents.
However, the Python "native" object created by the
.make_instance() method of an
XML_Objectify object contains
only the contents of the document root tag. Furthermore, XML
comments are ignored; only tag attributes and character data is
If you keep around the original
xml_obj in the first example above), you can access its
.processing_instruction attribute, or even its
attribute to look at what was left out of the "native" Python
-ATTRIBUTE TYPE SIMPLIFICATION-
All XML attributes are converted to Python object attributes of
string type. No effort is currently made to represent XML
enumerated types, or even to represent numeric types for
attributes. Such capabilities might be added to later
versions, but these would generally require the presence of a
xml_objectify does not assume.
-SUBTAGS ATTRIBUTES ARE EITHER LISTS OR INSTANCES-
XML subtags are represented by either Python attributes of
object type or by lists of such objects (depending on whether
there are one or several such subtags of the same type). The
decision whether to use a list of objects as an attribute value
is decided simply by whether a particular tag contains
multiple subtags of the same type. For example, in the
address.xml sample, some person's contact-info
includes one home-phone, some includes zero, and some includes
several. Corresponding to this, some
will have no
.home_phone attribute at all, some will have a
.home_phone attribute containing a
home_phone object, and
some will have a
.home_phone attribute containing a list of
home_phone objects. It would be possible to impose more
order if a DTD was used, but the author believes this kind of
dynamism is appropriate to most types of Python programming.
-PYTHON NAMESPACE RESTRICTIONS-
The Python namespace is smaller than the XML namespace.
Therefore, sometimes XML names (of either tags or attributes)
have to be modified. The specific transformation made is
changing dashes, colons, and the pound/hash mark, into
underscores. Further namespace collision is not avoided. For
example, if your XML document has tags,
<spam:eggs> and <spam#eggs>,
will create Python objects that do not correctly represent your
XML document. Or maybe the module will outright crash, or
maybe it will break your data and fry your machine. Probably
not the latter problems, but the current version of
xml_objectify simply does not take this namespace collision
into account. In real-life, it will rarely create any
problems, since you probably do not have XML documents with
those conflicting tags.
-NO EXPORT BACK TO XML-
Initially, the author considered including a capability for
converting Python "native" objects back to XML documents with
the same structure as those read-in. That goal was not
included in the current version because there are many
implementation issues that are not easy to resolve. Basically,
xml_objectify deliberately throws out some information in XML
documents in order to produce far friendlier Python objects.
However, once information is gone, the best you can do is guess
at what it was. One principle type of information that is lost
is about order. Python attributes do not have any
predetermined order among them, but XML tags and attributes
might be required to occur in specific sequence. Or even where
XML tags are not required to occur in specific order, the
order might be semantically important (in the case of repeated
common subtags, Python lists maintain order). In order to
convert back to XML, we would either need to choose arbitrary
orders, or somehow tuck away order information within the
"native" Python object (which might make it feel less
One option that would make it possible to go much of the way
towards reconstructing dropped information in the Python
objects produced by
.make_instance() would be to enforce a
DTD in writing back to XML. Even if this additional work was
performed, questions would still exist about how to handle
attributes added, deleted, or modified at Python runtime
(modifying a Python object could produce something that was not
conformant to the DTD of the original XML document). However,
any capability added will be added to later version of
xml_objectify (and probably only if a specific need arises
Charming Python #1: An Introduction to XML Tools for Python
Charming Python #2: A Closer Look at Python's
XML Matters #1: About
The Python Special Interest Group on XML:
The World Wide Web Consortium's DOM page.
The DOM Level 1 Recommendation.
Files used and mentioned in this article:
David Mertz cannot fool all of the people all of the time, but sometimes he wishes he could. David may be reached at [email protected]; his life pored over athttp://gnosis.cx/publish/. Suggestions and recommendations on this, past, or future, columns are welcomed.