David Mertz, Ph.D. <[email protected];
Gnosis Software, Inc. <http://gnosis.cx/publish/>
January 2002
This series looks at several distinct conceptual models for manipulating XML documents. But this final article serves as a reflection and critique on programmers assumptions that XML manipulation should have any conceptual model. Rather than find a particular paradigm for thinking about XML, and providing API's for many programming languages, some tools take the opposite approach. That is, these latter tools start with the paradigms and, even more so, idioms of a particular programming language, then try to figure out the most natural way to manipulate XML within these idioms
As XML has developed into a widely used data format, a number of programming models have arisen for manipulating XML documents. Some of these models--or "paradigms"--have been enshrined as standards, while others remain only informally specified (but equally widely used nonetheless). In a general way, the several models available for manipulating XML documents closely mirror the underlying approaches and techniques that programmers in different traditions bring to the task of working with XML. It is worth noticing that "models" are at a higher level of abstraction than a particular programming language; most of the models discussed in this series are associated with APIs that have been implemented in multiple programming languages.
In part, the richness of available XML programming models simply allows programmers and projects to work in the ways that are most comfortable and familiar to them. In many ways, there is overlap--at least in achievable outcomes--between all the XML programming models. However, different models also carry with them specific pros and cons in the context of XML manipulation; and these might urge the use of particular models for particular projects. This series of five articles aims to provide readers with an overview of the costs, benefits, and motivations for all of the major approaches to programmatic manipulation of XML documents (manipulation here, should be understood also to mean "using XML to drive or communicate other application processes").
The first article, Part 1, discussed the OOP-style Document Object Model (DOM), which is a W3C Recommendation. Part 2 discussed the Simple API for XML (SAX) and similar event-driven and procedural styles of XML programming. Part 3, covered eXtensible Stylesheet Language Transformations (XSLT) Part 4, addressed functional programming (FP) approaches to XML processing, with examples from the Haskell HaXml library. This final installment, Part 5, looks briefly at a number of tools and techniques that did not quite fit into the previous discussion. What these tools have in common is that their structure and usage has a lot more to do with the programming languages used to manipulate XML than with the inherent structure of XML.
The techniques presented in the previous installments of this series have many conceptual differences between them. But at a meta-conceptual level, all the techniques share a surprisingly close affinity. That is, even though DOM is pure OOP, SAX is event-driven and procedural, and XSLT is pure declarative programming, developing the paradigm for each one followed the same path. For all of them, someone started with a question like "How should I think about XML documents, and about the transformations one can make upon them?" In each case the answer was a particular programming paradigm. Actually implementing the abstract API implied by the paradigm choice came later when libraries were written for numerous programming languages.
DOM, SAX, and XSLT each have their own sort of conceptual purity. Perhaps the developers of each were most familiar or most comfortable with the corresponding paradigm; or more likely these developers merely wanted to develop APIs that flowed from underlying conceptual models. The drawback of each of these techniques is that they need to be shoehorned into particular programming languages. The APIs and proscribed styles of programming around these APIs only fit concrete programming languages in a rough way. The data structures of DOM, for example, can certainly be represented in many programming languages, but they are not really the most idiomatic way of using data in any of them. XSLT, as was described in the previous installment, is its own complete--albeit special--language, rather than really an API. But inasmuch as XSLT transformations are often defined and executed inside other programs, they have a "bolted-on" feel to them (you need this whole other language, living inside your main application language). SAX, just because it does the least also creates the least "impedance mismatch" when called from various programming languages.
Some library developers have decided to approach XML manipulation from a different direction. Rather than start out worrying about what the XML format suggests as a paradigm, these latter developers start by thinking about which native idioms of a given programming language most naturally express the contents of an XML document. A set of built-in, or at least ubiquitous in usage, data structures are chosen to hold all the bits-and-pieces of XML, and the language-native flow constructs are chosen to descibe the manipulation process. This alternative approach eschews paradigms, and focusses on practicality and simplicity. Naturally, one drawback to this practical approach is that every library and technique is wholly like every other, and much less knowledge and practice transfers between different languages, libraries and techniques. It is a drawback often worth living with.
One thing left out of the above summary is the subject of the fourth installment of this series. The Haskell library HaXml--as well as the other functional programming language libraries indicated in the Resources of that installment--have already taken the main steps towards a programming-language focus. In a sense, HaXml would fit well in this installment. However, given the special strengths of functional programming for XML processing--type safety, complex data types, higher-order functions, and the general stark contrast with imperative languages--FP deserved its own installment.
In the rest of this installment, we take a look at a handful of different "native-oriented" XML libraries in a half-dozen programming languages. Other libraries probably exist, but this gives readers a feel for the terrain.
The PYX format is a line-oriented representation of XML documents that is derived from the SGML ESIS format. PYX is not itself XML, but it is able to represent all the information within an XML document in a manner that is easier to handle using familiar text processing tools. Moreover, PYX documents can themselves be transformed back into XML as needed. It is worth noting that PYX documents are approximately the same size as the corresponding XML versions (sometimes a little larger, sometimes a little smaller); so storage and transmission considerations do not significantly enter into the transformation between XML and PYX.
The PYX format is extremely simple to describe and understand. The first character on each line identifies the content-type of the line. Content does not directly span lines, although successive lines might contain the same content-type. In the case of tag attributes, the attribute name and value are simply separated by a space, without use of extra quotes. The prefix characters are:
( start-tag ) end-tag A attribute - character data ? processing instruction
The motivation for PYX is the wide usage, convenience, and familiarity
of line-oriented text processing tools and techniques. The GNU textutils,
for example, include tools like wc
, tail
,
head
, uniq
; other familiar text processing
tools are grep
, sed
, awk
, and
in a more sophisticated way perl
and other scripting languages.
These types of tools both generally expect newline-delimited records
and rely on regular expression patterns to identify parts of texts.
Neither of the expectations is a good match for XML, as it happens.
Let us take a look at PYX in action. PYX libraries exist for several
programming languages, but much of the time it is most useful simply
to use the command line tools xmln
and xmlv
.
The first is a non-validating transformation tool, the second adds
validation against a DTD.
[PYX]# cat test.xml |
One should notice that the transformation loses the DOCTYPE declaration and the comment in the original XML document. For many purposes, this is not important (parsers often discard this information also). The PYX format, in contrast to the XML format, allows one to easily pose a variety of ad hoc questions about a document. For example: what are all the attribute values in the sample document? Using PYX, we can simply ask:
[PYX]# ./xmln test.xml | grep "^A" | awk '{print $2}' |
Getting this answer out of the original XML is a huge challenge.
Either one needs to create a whole program that calls a parser, and
looks for tag attribute dictionaries, or one needs to come up with a
quite complex regular expression that will find the information of interest
(left as an exercise for readers). Complicating things is the contents
of the <MoreSpam>
element, which contains something
that looks a lot like a tag attribute, but is not.
Sean McGrath's article (see Resources) has additional similar examples.
The API methods of DOM give one access to a certain data-structure
that represents an XML document. The problem is that this data-structure
is very little like the built-in data types of programming languages.
A number of libraries have made the move to "native" versions of XML
documents. But in my opinion, none has done it as thoroughly as my
own Python xml_objectify
module.
When xml_objectify
is used to read in an XML
document, what one gets is a very simple Python object, whose object
attributes correspond to the subelements and attributes of the root
document element. The only difference between subelements and tag attributes
is whether they contain futher objects or plain text. Testing the type
of thing a (Python) attribute contains is sufficient for determining
whether it started out as a subelement or XML attribute (but in Python
data terms, the difference is not really that important--it usually reflects
a fairly arbitrary XML design choice).
The earlier articles listed in the Resources go into the design and
limitations of xml_objectify
, but an example of its
usage illustrates its strength. Remember first how one might use DOM
to look at data in an XML document:
from xml.dom import minidom |
This produces something like:
% python dom.py |
In contrast, xml_objectify
lets a user refer
to the XML document data in much more intuitive, and Pythonic, ways:
xml_objectify
access to XML data structures
from xml_objectify import XML_Objectify |
The result is similar to the DOM version:
xml_objectify
output from data structures
% python xo.py |
In the culture of the Perl programming language, there is a motto held by programmers: "there's more than one way to do it." This slogan is well enough known that it is usually just abbreviated as TMTOWTDI. As one would expect from this motto, Perl developers have come up with quite a few different ways of handling XML. While there do exist Perl modules to support standards like DOM and SAX, most Perl programmers prefer modules that embody the Perlish cardinal virtues: laziness, hubris, and impatience. Perl programmers are the types of folks who think they can do it better, faster, and with less work than conformance with rigid and complex standards allow.
What most of the Perl modules for XML manipulation have in common
is that they convert XML documents into data structures that are more
idiomatic Perl. API calls like one might find in DOM or SAX are eschewed
in favor of standard data structures--generally arrays and hashes.
In keeping with the Perl motto, several different modules often do almost,
but not quite, the same thing. Let us look first at the example of
XML::Grove
. As I have suggested,
XML::Grove
takes an XML document, and creates a representation
of it in terms of Perl hashes, accessed with normal Perl syntax. A short
example lets us extract almost the same information in the PYX example
above:
XML::Grove
script to print root attribute values
use XML::Grove; |
This usage is fairly straightforward, or at least rather Perlish. The result is:
XML::Grove
attribute query
[PERL]# perl grove.pl |
The XML::Parser
module is quite versatile, and
allows several "styles" of handling XML documents. Some of these styles
are SAX-like insofar as they utilize callbacks from serial processing
of a document. But the structure is much more Perl-native than is using
all the SAX API. But the style that best illustrates how far
XML::Parser
has "gone native" is the "tree" style.
In fact, this style has an even more Perlish feel than
XML::Grove
does:
XML::Parser
"tree" style native data structure
use XML::Parser; |
The result of this program shows all its Perl roots, and has very little XML-specific about it:
XML::Parser
tree data structure result
[PERL]# perl parser.pl |
By this point, the idea of using programming language native data structures has probably sunk in. A couple more libraries for doing this are worth mentioning briefly.
The REXML
library is a very well thought out
library for the Ruby programming language. Much like XML::Parser
for Perl, REXML
operates in multiple modes.
The stream parser works in a way similar to SAX, but with a more Ruby-oriented
syntax. The tree mode is the most interesting. Basically, this mode
is quite similar to the data representation one gets with
xml_objectify
of XML::Parser
"Tree"
style. One advantage the REXML
library has
is its integration of an XPATH-like region specifier syntax. Combined
with the rest of Ruby's concise syntax, one can be extremely expressive
without DOM-style contortions. For example, the REXML tutorial shows
these lines:
require "rexml/document" |
Java gets in the native game also. Even though DOM was itself largely styled around Java, the programming language neutral methods of DOM are still unnecessarily complex to work with (even in Java). JDOM is a more Java-native version of XML processing. Let's just look at the JDOM mission statement to make the point:
There is no compelling reason for a Java API to manipulate XML to be complex, tricky, unintuitive, or a pain in the neck. JDOM is both Java-centric and Java-optimized. It behaves like Java, it uses Java collections, it is completely natural API for current Java developers, and it provides a low-cost entry point for using XML.
While JDOM interoperates well with existing standards such as the Simple API for XML (SAX) and the Document Object Model (DOM), it is not an abstraction layer or enhancement to those APIs. Rather, it seeks to provide a robust, light-weight means of reading and writing XML data without the complex and memory-consumptive options that current API offerings provide.
What more could I write?
There is one more class of Native-to-XML libraries that I should mention briefly. All the above tools were ways of making XML documents look like the native data structures of a favorite programming language (this is even true of PYX, where some tools prefer to see newline delimited records). But one can go in the other direction also. For various reasons, one sometimes want to create serialized representations of native in-memory objects. Many programming languages have internal--and often binary--formats for representing their objects. But a number of developers have decided that XML, as the "universal interchange format" would make a good serialization format.
Some of the XML serializers I know about are:
xml_pickle
module. DataDump::DumpXML
and
DataDump::DumpXML::Parser
modules XMarshall
module (which seems
little maintained, however). What all these libraries do is very similar. Unfortunately though, as of right now, the XML dialects of each serializer is slightly different from the others. A unification might allow a lightweight means of exchanging objects between programming languages.
The home page for Pyxie (the Python PYX library, and also C
versions of the xmlv
and xmln
tools) is hosted
by Sourceforge:
http://pyxie.sourceforge.net/
An introduction to the PYX format written by Sean McGrath can be found at:
http://www.xml.com/pub/a/2000/03/15/feature/index.html
A perl library for working with (and converting to/from) PYX can be found at:
http://search.cpan.org/search?dist=XML-PYX
My articles describing xml_objectify
and xml_pickle
can be found on IBM developerWorks
at the following URLs"
http://www-106.ibm.com/developerworks/library/xml-matters1/index.html
http://www-106.ibm.com/developerworks/library/xml-matters2/index.html
http://www-106.ibm.com/developerworks/library/x-matters11.html
A good starting point for understanding all the diverse XML modules for Perl is the Perl XML FAQ:
http://www.perlxml.com/faq/perl-xml-faq.html
More specifics on Perl XML modules can be found at the Perl-XML Module List:
http://www.perlxml.com/modules/perl-xml-modules.html
A more up-to-date, but slightly less annotated, summary of modules can be found by searching CPAN with the query:
http://search.cpan.org/search?mode=module&query=XML
The CPAN documentation for the
XML::Grove
module can be found at:
http://search.cpan.org/doc/KMACLEOD/XML-Grove-0.46alpha/lib/XML/Grove.pm
The documentation for the XML::Parser
module can be found in the distribution, and also at:
http://search.cpan.org/doc/TWEGNER/XML-Parser-2.30-bin56Mac/Parser.pm
The CPAN copies of the documentation for
the XML::DumpXML
and XML::DumpXML::Parser
modules can be found at:
http://search.cpan.org/doc/GAAS/Data-DumpXML-1.03/DumpXML.pm
and:
http://search.cpan.org/doc/GAAS/Data-DumpXML-1.03/DumpXML/Parser.pm
The homepage for the Ruby
REXML
library is
at:
http://www.germane-software.com/~ser/software/rexml/
The tutorial for REXML is well-written, and provides a great introduction to the library:
http://www.germane-software.com/~ser/Software/rexml/tutorial.html
The homepage for the Java
JDOM
library
is at:
http://www.jdom.org/
JSX is one of several Java XML serialization packages. It can be found at:
http://www.csse.monash.edu.au/~bren/JSX/
The Ruby XMarshall
module is another version of XML serialization. It can be found
at:
http://www.goto.info.waseda.ac.jp/~fukusima/ruby/xmarshal.rb
Quite a bit of Ruby XML information can be found at:
http://www.rubyxml.com/
My article "XML-RPC as
object model" compares XML-RPC with xml_pickle
:
http://www-106.ibm.com/developerworks/library/x-matters15.html