XML Programming Paradigms (part Five)

Miscellaneous special approaches to XML processing.


David Mertz, Ph.D. <[email protected];
Gnosis Software, Inc. <http://gnosis.cx/publish/>
January 2002

This series looks at several distinct conceptual models for manipulating XML documents. But this final article serves as a reflection and critique on programmers assumptions that XML manipulation should have any conceptual model. Rather than find a particular paradigm for thinking about XML, and providing API's for many programming languages, some tools take the opposite approach. That is, these latter tools start with the paradigms and, even more so, idioms of a particular programming language, then try to figure out the most natural way to manipulate XML within these idioms

About This Series

As XML has developed into a widely used data format, a number of programming models have arisen for manipulating XML documents. Some of these models--or "paradigms"--have been enshrined as standards, while others remain only informally specified (but equally widely used nonetheless). In a general way, the several models available for manipulating XML documents closely mirror the underlying approaches and techniques that programmers in different traditions bring to the task of working with XML. It is worth noticing that "models" are at a higher level of abstraction than a particular programming language; most of the models discussed in this series are associated with APIs that have been implemented in multiple programming languages.

In part, the richness of available XML programming models simply allows programmers and projects to work in the ways that are most comfortable and familiar to them. In many ways, there is overlap--at least in achievable outcomes--between all the XML programming models. However, different models also carry with them specific pros and cons in the context of XML manipulation; and these might urge the use of particular models for particular projects. This series of five articles aims to provide readers with an overview of the costs, benefits, and motivations for all of the major approaches to programmatic manipulation of XML documents (manipulation here, should be understood also to mean "using XML to drive or communicate other application processes").

The first article, Part 1, discussed the OOP-style Document Object Model (DOM), which is a W3C Recommendation. Part 2 discussed the Simple API for XML (SAX) and similar event-driven and procedural styles of XML programming. Part 3, covered eXtensible Stylesheet Language Transformations (XSLT) Part 4, addressed functional programming (FP) approaches to XML processing, with examples from the Haskell HaXml library. This final installment, Part 5, looks briefly at a number of tools and techniques that did not quite fit into the previous discussion. What these tools have in common is that their structure and usage has a lot more to do with the programming languages used to manipulate XML than with the inherent structure of XML.

When Standards Get in the Way

The techniques presented in the previous installments of this series have many conceptual differences between them. But at a meta-conceptual level, all the techniques share a surprisingly close affinity. That is, even though DOM is pure OOP, SAX is event-driven and procedural, and XSLT is pure declarative programming, developing the paradigm for each one followed the same path. For all of them, someone started with a question like "How should I think about XML documents, and about the transformations one can make upon them?" In each case the answer was a particular programming paradigm. Actually implementing the abstract API implied by the paradigm choice came later when libraries were written for numerous programming languages.

DOM, SAX, and XSLT each have their own sort of conceptual purity. Perhaps the developers of each were most familiar or most comfortable with the corresponding paradigm; or more likely these developers merely wanted to develop APIs that flowed from underlying conceptual models. The drawback of each of these techniques is that they need to be shoehorned into particular programming languages. The APIs and proscribed styles of programming around these APIs only fit concrete programming languages in a rough way. The data structures of DOM, for example, can certainly be represented in many programming languages, but they are not really the most idiomatic way of using data in any of them. XSLT, as was described in the previous installment, is its own complete--albeit special--language, rather than really an API. But inasmuch as XSLT transformations are often defined and executed inside other programs, they have a "bolted-on" feel to them (you need this whole other language, living inside your main application language). SAX, just because it does the least also creates the least "impedance mismatch" when called from various programming languages.

Some library developers have decided to approach XML manipulation from a different direction. Rather than start out worrying about what the XML format suggests as a paradigm, these latter developers start by thinking about which native idioms of a given programming language most naturally express the contents of an XML document. A set of built-in, or at least ubiquitous in usage, data structures are chosen to hold all the bits-and-pieces of XML, and the language-native flow constructs are chosen to descibe the manipulation process. This alternative approach eschews paradigms, and focusses on practicality and simplicity. Naturally, one drawback to this practical approach is that every library and technique is wholly like every other, and much less knowledge and practice transfers between different languages, libraries and techniques. It is a drawback often worth living with.

One thing left out of the above summary is the subject of the fourth installment of this series. The Haskell library HaXml--as well as the other functional programming language libraries indicated in the Resources of that installment--have already taken the main steps towards a programming-language focus. In a sense, HaXml would fit well in this installment. However, given the special strengths of functional programming for XML processing--type safety, complex data types, higher-order functions, and the general stark contrast with imperative languages--FP deserved its own installment.

In the rest of this installment, we take a look at a handful of different "native-oriented" XML libraries in a half-dozen programming languages. Other libraries probably exist, but this gives readers a feel for the terrain.

A Line-Oriented Xml

The PYX format is a line-oriented representation of XML documents that is derived from the SGML ESIS format. PYX is not itself XML, but it is able to represent all the information within an XML document in a manner that is easier to handle using familiar text processing tools. Moreover, PYX documents can themselves be transformed back into XML as needed. It is worth noting that PYX documents are approximately the same size as the corresponding XML versions (sometimes a little larger, sometimes a little smaller); so storage and transmission considerations do not significantly enter into the transformation between XML and PYX.

The PYX format is extremely simple to describe and understand. The first character on each line identifies the content-type of the line. Content does not directly span lines, although successive lines might contain the same content-type. In the case of tag attributes, the attribute name and value are simply separated by a space, without use of extra quotes. The prefix characters are:

    (  start-tag     
    )  end-tag      
    A  attribute    
    -  character data    
    ?  processing instruction 
    

The motivation for PYX is the wide usage, convenience, and familiarity of line-oriented text processing tools and techniques. The GNU textutils, for example, include tools like wc, tail, head, uniq; other familiar text processing tools are grep, sed, awk, and in a more sophisticated way perl and other scripting languages. These types of tools both generally expect newline-delimited records and rely on regular expression patterns to identify parts of texts. Neither of the expectations is a good match for XML, as it happens.

Let us take a look at PYX in action. PYX libraries exist for several programming languages, but much of the time it is most useful simply to use the command line tools xmln and xmlv . The first is a non-validating transformation tool, the second adds validation against a DTD.

XML and PYX versions of a document

[PYX]# cat test.xml
<?xml version="1.0"?>
<!DOCTYPE Spam SYSTEM "spam.dtd" >
<!-- Document Comment -->
<?xml-stylesheet href="test.css" type="text/css"?>
<Spam flavor="pork" size="8oz">
<Eggs>Some text about eggs.</Eggs>
<MoreSpam>Ode to Spam (spam="smoked-pork")</MoreSpam>
</Spam>

[PYX]# ./xmln test.xml
?xml-stylesheet href="test.css" type="text/css"
(Spam
Aflavor pork
Asize 8oz
-\n
-
(Eggs
-Some text about eggs.
)Eggs
-\n
-
(MoreSpam
-Ode to Spam (spam="smoked-pork")
)MoreSpam
-\n
)Spam

One should notice that the transformation loses the DOCTYPE declaration and the comment in the original XML document. For many purposes, this is not important (parsers often discard this information also). The PYX format, in contrast to the XML format, allows one to easily pose a variety of ad hoc questions about a document. For example: what are all the attribute values in the sample document? Using PYX, we can simply ask:

An ad hoc query using PYX format

[PYX]# ./xmln test.xml | grep "^A" | awk '{print $2}'
pork
8oz

Getting this answer out of the original XML is a huge challenge. Either one needs to create a whole program that calls a parser, and looks for tag attribute dictionaries, or one needs to come up with a quite complex regular expression that will find the information of interest (left as an exercise for readers). Complicating things is the contents of the <MoreSpam> element, which contains something that looks a lot like a tag attribute, but is not.

Sean McGrath's article (see Resources) has additional similar examples.

Going Native

The API methods of DOM give one access to a certain data-structure that represents an XML document. The problem is that this data-structure is very little like the built-in data types of programming languages. A number of libraries have made the move to "native" versions of XML documents. But in my opinion, none has done it as thoroughly as my own Python xml_objectify module.

When xml_objectify is used to read in an XML document, what one gets is a very simple Python object, whose object attributes correspond to the subelements and attributes of the root document element. The only difference between subelements and tag attributes is whether they contain futher objects or plain text. Testing the type of thing a (Python) attribute contains is sufficient for determining whether it started out as a subelement or XML attribute (but in Python data terms, the difference is not really that important--it usually reflects a fairly arbitrary XML design choice).

The earlier articles listed in the Resources go into the design and limitations of xml_objectify, but an example of its usage illustrates its strength. Remember first how one might use DOM to look at data in an XML document:

Python DOM access to XML data structures

from xml.dom import minidom
dom = minidom.parse(
'test.xml')
print dom
print 'flavor='+dom.childNodes[1]._attrs['flavor'].nodeValue
print 'PCDATA='+dom.childNodes[1].childNodes[5].childNodes[0].nodeValue

This produces something like:

Python DOM output from data structures

% python dom.py
<xml.dom.minidom.Document instance at 0x8aa0c>
flavor=pork
PCDATA=Ode to Spam (spam="smoked-pork")

In contrast, xml_objectify lets a user refer to the XML document data in much more intuitive, and Pythonic, ways:

xml_objectify access to XML data structures

from xml_objectify import XML_Objectify
py_obj = XML_Objectify('test.xml').make_instance()
print py_obj
print type(py_obj.flavor), 'flavor=' + py_obj.flavor
print type(py_obj.MoreSpam), 'PCDATA=' + py_obj.MoreSpam.PCDATA

The result is similar to the DOM version:

xml_objectify output from data structures

% python xo.py
<xml_objectify._XO_Spam instance at 0xe626c>
<type 'unicode'> flavor=pork
<type 'instance'> PCDATA=Ode to Spam (spam="smoked-pork")

There's More Than One Way To Do It

In the culture of the Perl programming language, there is a motto held by programmers: "there's more than one way to do it." This slogan is well enough known that it is usually just abbreviated as TMTOWTDI. As one would expect from this motto, Perl developers have come up with quite a few different ways of handling XML. While there do exist Perl modules to support standards like DOM and SAX, most Perl programmers prefer modules that embody the Perlish cardinal virtues: laziness, hubris, and impatience. Perl programmers are the types of folks who think they can do it better, faster, and with less work than conformance with rigid and complex standards allow.

What most of the Perl modules for XML manipulation have in common is that they convert XML documents into data structures that are more idiomatic Perl. API calls like one might find in DOM or SAX are eschewed in favor of standard data structures--generally arrays and hashes. In keeping with the Perl motto, several different modules often do almost, but not quite, the same thing. Let us look first at the example of XML::Grove . As I have suggested, XML::Grove takes an XML document, and creates a representation of it in terms of Perl hashes, accessed with normal Perl syntax. A short example lets us extract almost the same information in the PYX example above:

XML::Grove script to print root attribute values

use XML::Grove;
use XML::Grove::Builder;
use XML::Parser::PerlSAX;
$grove_builder = XML::Grove::Builder->new;
$parser = XML::Parser::PerlSAX->new ( Handler => $grove_builder );
$document = $parser->parse ( Source => { SystemId => 'test.xml' } );

# the particular has a PI and a comment before the body
$root = $document->{Contents}[2]; # name the document root element
$attrs = $root->{Attributes}; # name the root's attributs
foreach $attr (keys %$attrs) # print the attribute values
{ print %$attrs->{$attr} . "\n" };

This usage is fairly straightforward, or at least rather Perlish. The result is:

Perl XML::Grove attribute query

[PERL]# perl grove.pl
pork
8oz

The XML::Parser module is quite versatile, and allows several "styles" of handling XML documents. Some of these styles are SAX-like insofar as they utilize callbacks from serial processing of a document. But the structure is much more Perl-native than is using all the SAX API. But the style that best illustrates how far XML::Parser has "gone native" is the "tree" style. In fact, this style has an even more Perlish feel than XML::Grove does:

XML::Parser "tree" style native data structure

use XML::Parser;
$tree_parse = new XML::Parser(Style => 'Tree');
$tree = $tree_parse->parsefile('test.xml');

use Data::Dumper;
($Data::Dumper::Indent,$Data::Dumper::Terse) = (1,1);
print Dumper($tree);

The result of this program shows all its Perl roots, and has very little XML-specific about it:

Perl XML::Parser tree data structure result

[PERL]# perl parser.pl
[
'Spam',
[
{
'flavor' => 'pork',
'size' => '8oz'
},
0,
'
',
'Eggs',
[
{},
0,
'Some text about eggs.'
],
0,
'
',
'MoreSpam',
[
{},
0,
'Ode to Spam (spam="smoked-pork")'
],
0,
'
'
]
]

Even More Native Libraries

By this point, the idea of using programming language native data structures has probably sunk in. A couple more libraries for doing this are worth mentioning briefly.

The REXML library is a very well thought out library for the Ruby programming language. Much like XML::Parser for Perl, REXML operates in multiple modes. The stream parser works in a way similar to SAX, but with a more Ruby-oriented syntax. The tree mode is the most interesting. Basically, this mode is quite similar to the data representation one gets with xml_objectify of XML::Parser "Tree" style. One advantage the REXML library has is its integration of an XPATH-like region specifier syntax. Combined with the rest of Ruby's concise syntax, one can be extremely expressive without DOM-style contortions. For example, the REXML tutorial shows these lines:

REXML tree mode parsing and data structure

require "rexml/document"
include REXML # don't have to prefix everything with REXML::
doc = Document.new File.new("mydoc.xml")
doc.elements.each("inventory/section")
{ |element| puts element.attributes["name"] }
# -> health
# -> food

Java gets in the native game also. Even though DOM was itself largely styled around Java, the programming language neutral methods of DOM are still unnecessarily complex to work with (even in Java). JDOM is a more Java-native version of XML processing. Let's just look at the JDOM mission statement to make the point:

There is no compelling reason for a Java API to manipulate XML to be complex, tricky, unintuitive, or a pain in the neck. JDOM is both Java-centric and Java-optimized. It behaves like Java, it uses Java collections, it is completely natural API for current Java developers, and it provides a low-cost entry point for using XML.
While JDOM interoperates well with existing standards such as the Simple API for XML (SAX) and the Document Object Model (DOM), it is not an abstraction layer or enhancement to those APIs. Rather, it seeks to provide a robust, light-weight means of reading and writing XML data without the complex and memory-consumptive options that current API offerings provide.

What more could I write?

Wrapup

There is one more class of Native-to-XML libraries that I should mention briefly. All the above tools were ways of making XML documents look like the native data structures of a favorite programming language (this is even true of PYX, where some tools prefer to see newline delimited records). But one can go in the other direction also. For various reasons, one sometimes want to create serialized representations of native in-memory objects. Many programming languages have internal--and often binary--formats for representing their objects. But a number of developers have decided that XML, as the "universal interchange format" would make a good serialization format.

Some of the XML serializers I know about are:

What all these libraries do is very similar. Unfortunately though, as of right now, the XML dialects of each serializer is slightly different from the others. A unification might allow a lightweight means of exchanging objects between programming languages.

Resources

The home page for Pyxie (the Python PYX library, and also C versions of the xmlv and xmln tools) is hosted by Sourceforge:

http://pyxie.sourceforge.net/

An introduction to the PYX format written by Sean McGrath can be found at:

http://www.xml.com/pub/a/2000/03/15/feature/index.html

A perl library for working with (and converting to/from) PYX can be found at:

http://search.cpan.org/search?dist=XML-PYX

My articles describing xml_objectify and xml_pickle can be found on IBM developerWorks at the following URLs"

http://www-106.ibm.com/developerworks/library/xml-matters1/index.html
http://www-106.ibm.com/developerworks/library/xml-matters2/index.html
http://www-106.ibm.com/developerworks/library/x-matters11.html

A good starting point for understanding all the diverse XML modules for Perl is the Perl XML FAQ:

http://www.perlxml.com/faq/perl-xml-faq.html

More specifics on Perl XML modules can be found at the Perl-XML Module List:

http://www.perlxml.com/modules/perl-xml-modules.html

A more up-to-date, but slightly less annotated, summary of modules can be found by searching CPAN with the query:

http://search.cpan.org/search?mode=module&query=XML

The CPAN documentation for the XML::Grove module can be found at:

http://search.cpan.org/doc/KMACLEOD/XML-Grove-0.46alpha/lib/XML/Grove.pm

The documentation for the XML::Parser module can be found in the distribution, and also at:

http://search.cpan.org/doc/TWEGNER/XML-Parser-2.30-bin56Mac/Parser.pm

The CPAN copies of the documentation for the XML::DumpXML and XML::DumpXML::Parser modules can be found at:

http://search.cpan.org/doc/GAAS/Data-DumpXML-1.03/DumpXML.pm

and:

http://search.cpan.org/doc/GAAS/Data-DumpXML-1.03/DumpXML/Parser.pm

The homepage for the Ruby REXML library is at:

http://www.germane-software.com/~ser/software/rexml/

The tutorial for REXML is well-written, and provides a great introduction to the library:

http://www.germane-software.com/~ser/Software/rexml/tutorial.html

The homepage for the Java JDOM library is at:

http://www.jdom.org/

JSX is one of several Java XML serialization packages. It can be found at:

http://www.csse.monash.edu.au/~bren/JSX/

The Ruby XMarshall module is another version of XML serialization. It can be found at:

http://www.goto.info.waseda.ac.jp/~fukusima/ruby/xmarshal.rb

Quite a bit of Ruby XML information can be found at:

http://www.rubyxml.com/

My article "XML-RPC as object model" compares XML-RPC with xml_pickle :

http://www-106.ibm.com/developerworks/library/x-matters15.html