Xml Programming Paradigms (part One)

Object Oriented Programming with the Document Object Model

David Mertz, Ph.D. <[email protected]>
Gnosis Software, Inc. <http://gnosis.cx/publish/>
November 2001

Several very distinct conceputation models exist for manipulation of XML documents. This series will look at each model. The Document Object Model is a World Wide Web Consortium Recommendation for an object oriented approach to XML programming. This article gives a primer on how DOM represents XML trees conceptually. Source code examples of using DOM are provided, along with pointers to further resources.

About This Series

As XML has developed into a widely used data format, a number of programming models have arisen for manipulating XML documents. Some of these models--or "paradigms"--have been enshrined as standards, while others remain only informally specified (but equally widely used nonetheless). In a general way, the several models available for manipulating XML documents closely mirror the underlying approaches and techniques that programmers in different traditions bring to the task of working with XML. It is worth noticing that "models" are at a higher level of abstraction than a particular programming language; most of the models discussed in this series are associated with APIs that have been implemented in multiple programming languages.

In part, the richness of available XML programming models simply allows programmers and projects to work in the ways that are most comfortable and familiar to them. In many ways, there is overlap--at least in achievable outcomes--between all the XML programming models. However, different models also carry with them specific pros and cons in the context of XML manipulation; and these might urge the use of particular models for particular projects. This series of five articles aims to provide readers with an overview of the costs, benefits, and motivations for all of the major approaches to programmatic manipulation of XML documents (manipulation here, should be understood also to mean "using XML to drive or communicate other application processes").

The current article, Part 1, discusses the Document Object Model (DOM), which is a W3C Recommendation. DOM is an Object Oriented Programming (OOP) approach to the manipulation of XML documents. Part 2 of this series will address the Simple API for XML (SAX), which is an event-driven and procedural style of XML programming. Part 3 will look at XSLT, which brings a declarative programming style to transformations of XML documents. In Part 4, we will see the application of full-fledged Functional Programming (FP) techniques to XML manipulation--these in some ways unify the earlier models (but are less commonly used). The final installment, Part 5, will look briefly at a number of tools and techniques that did not quite fit into the previous discussion, but that readers would do well to be aware of.

DOM's Conceptual Framework

DOM represents an XML document as a Node, which contains, hierarchically, zero or more "child" Node's. What is OOP about a particular DOM tree is its containment tree, not any inheritence trees (although there is also an inheritence structure to the specialNode types). In fact, every such descendent Node has pretty much the same structure as the Document itself. Conceptually, every Node is a fairly ordinary in-memory object from the point-of-view of an object-oriented programming language. A Node is just an object which has a few API-specified attributes/data-members, and a few methods/member-functions (the terminology varies between languages).

There are two wrinkles to the system to address immediately. First, there are actually two sorts of "children" that aNode might have. Some types of Node's have children, while others do not. One type of child is an 'ELEMENT_NODE's subelements, the other is its XML attributes. These two types of children are both specialized 'Node's themselves. Subelements are found in the DOM Node data-member childNodes; XML attributes are found in the data-member attributes.

This brings us to the second wrinkle. As well as Node objects, DOM has two types of collection objects. Ordered collections of Node objects live in a NodeList object. Unordered collections of named Node objects line in a NamedNodeMap . The value of the childNodes data-member, for example, is a NodeList (several Node methods also return a NodeList). The value of the attributes data-member, in contrast, is a NamedNodeMap.

Node collections have a nice property: they are "live" connections to the underlying DOM Document. A list or map is not just a snapshot of an XML document at a moment in time, but is the actual collection of Node objects that fulfill some property (like living under a particular parent). The W3C DOM FAQ emphasizes this:

NodeList, although it resembles an array or vector (it has a length attribute, and you can access the members of the list via an integer index), is not an array. Think of it instead as another way of looking at the DOM's document tree. If that tree changes if something inserts or appends or removes Nodes the NodeList will be automatically adjusted at the same time. The result is that a NodeList is always an accurate representation of the getChildNodes or getElementsByTagName results as if you had just issued that call, so there is no need to refresh the NodeList to pick up changes to the underlying document.

The overall picture here is that a DOM object represents an entire XML document, but it is possible to provide "handles" or "pointers" or "proxies" into particular parts or samples of the structure for convenience.

Types Of Nodes

Nodes are lower level and more numerous than one might expect at first. Basically, everything is a Node, but they come in different types. In terms of OOP inheritance, all the specific types descend from the abstract class Node. Each concrete node is one of the following types: ELEMENT_NODE, ATTRIBUTE_NODE, TEXT_NODE, CDATA_SECTION_NODE, ENTITY_NODE, PROCESSING_INSTRUCTION_NODE, COMMENT_NODE, DOCUMENT_NODE, DOCUMENT_TYPE_NODE, NOTATION_NODE. These types are fairly self-explanatory, and the uppercase names indicated are usually defined as constants in a particular DOM implementation (import syntax and name qualification will depend on the programming language).

The top Node of any DOM representation of an XML document is a DOCUMENT_NODE. This Document node will contain zero or more PROCESSING_INSTRUCTION_NODE's and COMMENT_NODE's, and exactly one ELEMENT_NODE. The one ELEMENT_NODE is the root element of the XML document. The root element, in turn, may contain nodes of the various other types--and perhaps children of the root will contain nodes of the various types.

TEXT_NODE's in particular are a little suprising in their behavior. On the one hand, one might expect the text of an element to simply be a string-type data-member of an ELEMENT_NODE. In some ways that would be more convenient. But instead, the textual contents inside an XML element is contained in its own TEXT_NODE. And being a type of Node , a TEXT_NODE is not itself a string, but is instead a DOM object whose attributenodeValue has a string in it. Its nodeName, for what it's worth, contains the string #text.

But the thing that is most surprising about TEXT_NODE's is that they are not guaranteed to contain all the text inside a particular element, even if there is nothing other than text in the element. A TEXT_NODE is simply promised to contain some text--the entire textual content of an ELEMENT_NODE might be split between any number of TEXT_NODE's. Fortunately, the DOM method normalize() can be used to transform a DOM tree, and minimize the number of TEXT_NODE's contained (and more importantly, make each such node of maximal relevant length).

DOM At Work

If you are like me, it is hard to process the abstract descriptions of DOM without some concrete code to look at. To show readers what is going on, I will use some commands pasted from a Python interactive shell. DOM implementations exist in Java, C++, ECMAScript, Perl, Ruby, and many other programming languages. A nice thing about DOM is that the code usually looks almost the same across languages. But Python has two advantages for presentation: (1) it has an interactive shell to try variations on commands (what the user types is preceded by>>> ); (2) its code is particularly concise, and resembles pseudo-code. In general, in this series I will make efforts to avoid using constructs which are idiomatic to one particular programming language (so Python programmers may find that there is a more "Pythonic" way of implementing my examples). In the code samples, those elements that are generic DOM (and common to all languages) are marked in red.

For this simple demonstration, I created the following trivial XML document:

test.xml

<?xml version="1.0"?>
<!DOCTYPE Spam SYSTEM "spam.dtd" >
<!-- Document Comment -->
<?xml-stylesheet href="test.css" type="text/css"?>
<Spam flavor="pork" size="8oz">
  <Eggs>Some text about eggs.</Eggs>
  <MoreSpam>Ode to Spam</MoreSpam>
</Spam>

Let's work with the XML document a bit:

Python interactive shell: DOMify XML

>>> from xml.dom import minidom
>>> dom = minidom.parse('test.xml')
>>> print dom.attributes
None
>>> for node in dom.childNodes: print node
...
<xml.dom.minidom.ProcessingInstruction instance at 0x938ec>
<DOM Element: Spam at 1644588>
>>> print dom.documentElement
<DOM Element: Spam at 1644588>
>>> dom.documentElement.isSameNode(dom.childNodes[0])
0
>>> dom.documentElement.isSameNode(dom.childNodes[1])
1

Our start shows off several DOM features. After the requisite imports, we generally create a DOM object by parsing an XML document using a parse() or parseString() method. The latter allows XML documents to come from a source other than a file. Neither of these methods is actually part of the W3C standard, but both are present in almost every implementation. We also might have created a new DOM object from scratch, but starting with an existing XML document is more common.

When we check, we find that a Document node has no attributes; it does, however, have some childNodes . One of the childNodes is the same as the documentElement (as is demonstrated). In our example, another child of Document is a processing instruction. In this paricular DOM implementation--and in a number of others--comment nodes are ignored by the parser (and do not appear in the childNodes ). While this limitation is not usually a problem since the programmatic content should not be in comments, it is something to be aware of.

Let us start working with the root node of the document, which is what we are usually concerned with (and the descendents thereof). Any descendent element will behave almost exactly the same way the root node does.

Python interactive shell: XML attributes

>>> Spam = dom.documentElement
>>> Spam.getAttribute('flavor')
u'pork'
>>> Spam.attributes
<xml.dom.minidom.NamedNodeMap instance at 0x141b0c>
>>> Spam..attributes.length
2
>>> i = 0
>>> while i < Spam.attributes.length:
...     item = Spam.attributes.item(i)
...     print item.name, item.nodeValue
...     i += 1
...
size 8oz
flavor pork

The attributes are interesting in their behavior. If you happen to know that an ELEMENT_NODE has a specific attribute, it is easy to get its value with a method call. But if you are not sure what XML attributes are present, you have to use a slightly more roundabout technique of first looking at the NamedNodeMap data-member length , then iterating through the attributes based on the number of them. Guessing an XML attribute that might be present is not really reliable--this Python DOM implementation returns an empty string if the attribute does not exist. But this implementation is flawed, since it cannot distinguish the below cases (vis-a-vis the color attribute):

Absent versus empty XML attributes

<Spam color="" />
<Spam />

To round the brief examples, let us take a look at TEXT_NODE objects, which is usually our main interest in the end. First, let's see the non-normalized and normalized forms:

Python interactive shell: normalized TEXT_NODE's

>>> for node in Spam.childNodes: print node
...
<DOM Text node "\n">
<DOM Text node "  ">
<DOM Element: Eggs at 1510572>
<DOM Text node "\n">
<DOM Text node "  ">
<DOM Element: MoreSpam at 1513292>
<DOM Text node "\n">
>>> Spam.normalize()
>>> for node in Spam.childNodes: print node
...
<DOM Text node "\n  ">
<DOM Element: Eggs at 1510572>
<DOM Text node "\n  ">
<DOM Element: MoreSpam at 1513292>
<DOM Text node "\n">

To follow, let us see a typical usage of the contents in a TEXT_NODE:

Python interactive shell: extracting TEXT_NODE's

>>> firstEggs = Spam.getElementsByTagName('Eggs')[0]
>>> print firstEggs.childNodes[0].nodeValue
Some text about eggs.
>>> print firstEggs.childNodes[0].nodeName
#text

The Good And The Bad Of DOM

The strong point of DOM is that it provides an OOP framework for manipulating XML documents that will be familiar to programmers in many object oriented languages. The method and attribute names suggest a particular affinity with Java, but the model is common to OOP thinking. Moreover, although the short article has not touched on it in any detail, DOM provides a strong set of Node methods for filtering and modifying nodes and collections.

The weak point of DOM is that it is very poorly suited for handling large XML documents. For the trivial case we have presented there is no issue, but DOM is extremely memory hungry. A DOM representation of an XML document is likely to be several times the size of the underlying document (each Node needs an object, and carries a variety of instance data). While DOM readers that operate incrementally upon file-based XML documents are in experimental stages, by far the rule is to read in an entire XML document at one time in order to create a DOM object. OOP programmers are often lured by the familiarity of DOM techniques, but consequently make decisions that use resources inefficiently. In some ways, the SAX approach that the next installment will address answers the limits of DOM (but introduces its own set of limitations.

Resources

The official word on everything DOM is at the World Wide Web Consortium's (W3C) website. Many links to further resources can be found there also:

http://www.w3.org/DOM/

One particularly useful source of hints at the W3C's DOM site is the DOM FAQ:

http://www.w3.org/DOM/faq.html:

A good starting point for the conceptual framework of DOM is Jonathan Robie's "What is the Document Object Model?":

http://www.w3.org/TR/2000/WD-DOM-Level-1-20000929/introduction.html