David Mertz, Ph.D. <[email protected]>
Gnosis Software, Inc. <http://gnosis.cx/publish/>
November 2001
This series looks at several distinct conceptual models that exist for manipulation of XML documents. The Simple API for XML (SAX) is a widely used, and frequently implemented, approach to the procedural and sequential processing of an XML document. We look at what it means to treat an XML document as a series of events, and handled in an event-driven programming context. Source code examples of using SAX are provided, along with pointers to further resources.
As XML has developed into a widely used data format, a number of programming models have arisen for manipulating XML documents. Some of these models--or "paradigms"--have been enshrined as standards, while others remain only informally specified (but equally widely used nonetheless). In a general way, the several models available for manipulating XML documents closely mirror the underlying approaches and techniques that programmers in different traditions bring to the task of working with XML. It is worth noticing that "models" are at a higher level of abstraction than a particular programming language; most of the models discussed in this series are associated with APIs that have been implemented in multiple programming languages.
In part, the richness of available XML programming models simply allows programmers and projects to work in the ways that are most comfortable and familiar to them. In many ways, there is overlap--at least in achievable outcomes--between all the XML programming models. However, different models also carry with them specific pros and cons in the context of XML manipulation; and these might urge the use of particular models for particular projects. This series of five articles aims to provide readers with an overview of the costs, benefits, and motivations for all of the major approaches to programmatic manipulation of XML documents (manipulation here, should be understood also to mean "using XML to drive or communicate other application processes").
This article addresses the Simple API for XML (SAX), which is an event-driven and procedural style of XML programming. The previous article, Part 1, discussed the Document Object Model (DOM), which is a W3C Recommendation. Part 3 will look at XSLT, which brings a declarative programming style to transformations of XML documents. In Part 4, we will see the application of full-fledged Functional Programming (FP) techniques to XML manipulation--these in some ways unify the earlier models (but are less commonly used). The final installment, Part 5, will look briefly at a number of tools and techniques that did not quite fit into the previous discussion, but that readers would do well to be aware of.
There are a number of ways one can think about an XML document. A DOM
programmer imagines an XML document as a representation of a hierarchically
branching tree. In the DOM picture, each branch-point or leaf is a Node;
and depending on exactly which type of Node it is, it gets written slightly
differently. For example, element Nodes get written like <elem>...</elem>
, text nodes like some text
. But according to the Document
Object Model--as the name suggests--every XML document is simply the description
of a complete object that lives at a higher conceptual plane.
When one thinks about OOP a lot, the DOM picture looks compelling. Everything is an object, and objects do things like contain other objects and have properties. At a conceptual level, SAX appeals more to programmers who are used to thinking about streams and sequences--events following one another in strict synchronous order. A SAX programmer imagines an XML document as a linear sequence of discrete events, each one belonging to one of a small number of event types.
There are many programming tasks that effectively impose the linear thinking style of SAX. For example, readings from a physical instrument arrive in a particular sequence (or at least have a logical sequence), each corresponding to a time at which that physical state existed. A mouse or keyboard, for what it is worth, are physical instruments of certain sorts. Or financial transactions, similarly, exist as a linear sequence of events, each of which must be independently handled (perhaps depending on a state that reflects "past" transactions, but never on "future" ones). Processing a logfile, or other delimited or flat line-oriented data file, is also usually treated as a series of distinct instructions or records.
Of closest similarity to SAX is a language lexer--something that takes a sequence of symbols (bytes) from a source file, and groups them into larger aggregates (tokens such as keywords, strings, symbols, etc). In point of fact, SAX is a parser, rather than merely a lexer; SAX creates events that are slightly more structured than mere aggregations (but only slightly).
One needs to distinguish between SAX itself, and a SAX application. SAX (Simple API for XML) is, after all, an API not a tool or utility in itself. What SAX does is read a specified bytestream, then alert its controlling application whenever one of 10 content events occur in the document. Technically things are slightly more complicated since errors, entities and notations can also cause notices--but of the 10 content events only 3 are almost always used, 2 are sometimes needed, and the 5 remaining ones are largely "special purpose" (I have never used those 5 for the practical applications I have programmed).
SAX, technically speaking, is an object-oriented programming framework.
SAX was first--and probably still most widely--implemented in Java, and
that language requires a certain degree of OOP-purity (aside from some minor
warts with simple types versus classes). Therefore, a SAX application will
implement a custom Handler
class that specializes the base handlers
in the SAX library, and provides methods corresponding to events. An instance
of the custom handler is passed to the previously instantiated SAX parser.
From there, you are "off to the races."
The startElement()
, endElement()
and
characters()
methods are specialized by just about every
SAX application. These methods, predictably, are called when an open tag,
close tag, and tag body, respectively are encountered. Fairly often, the
startDocument()
and endDocument()
methods
are used for initialization and cleanup. But equally often, any initialization
or cleanup can occur at the application level rather than at the document
level (or the two are interchangeable). The remaining methods are
startPrefixMapping()
, endPrefixMapping()
, processingInstruction()
, ignorableWhitespace()
and skippedEntity()
.
These each have important reasons for being there, but in practice are
used less frequently. You can see that the naming conventions use Java-style
word capitalization.
The inheritance and interface design that goes into SAX is pretty nicely
thought out. But all the OOP stuff is also fairly irrelevant to the underlying
conceptual model of SAX. Having known method names, and providing abstract
base classes, makes the programming a little bit easier. But the general
idea of handing a bunch of callback operations to a parser is perfectly
straightforward in a non-OOP language. As an example, the very popular �
expat
� XML parser is written in C (not C++), and is generally called
from a C application (although wrappers for other languages exist too).
While the function names in expat
� do not precisely match
those in SAX, they are conceptually close to each other. Moreover, a number
of full-fledged SAX libraries are built on top of the underlying�
expat
� parser; the extra layer is fairly transparent in this case.
As the example below demonstrates, there is nothing stopping a programmer
from using the exact same names for callback functions with�expat
� as she would use for handler methods in SAX. In fact, doing this is
good practice, in general.
At its heart, SAX is a structured programming style of thinking that just happens to have been implemented in an OOP language.
Let us look at the same trivial XML processing application written in
several programming languages. The first of these is Python. This version
is probably the most readable (I like Python), and follows SAX OOP patterns�
and naming accurately. A Java version would look almost identical. All
this application does--in all three implementations--is write out a canonical
version of the same XML document it reads in. Sort of an imperfect
cat
. The output might differ from the input in respect to some whitespace
around attributes, and the loss of comments and processing instructions.
Some character entities can become confused also. It is not trying to be
a great application; but a better application will look quite
similar to this simple one.
from xml.sax import make_parser,handler |
Call this application with:
% python dumbsax.py myfile.xml > new.xml |
A Perl version is largely similar. One thing to notice is that the Perl
implementation is, in a sense, not really SAX. That is, XML::Parser::PerlSAX
does not choose to use exactly the method names defined by SAX proper.
Moreover, OOP in Perl is slightly circuitous compared to languages that
started out as object oriented. Nonetheless, it should be clear that this
version closely matches the Python one.
use XML::Parser::PerlSAX; |
Call this application with:
% perl dumbsax.pl myfile.xml > new.xml |
Finally, to round out our view of SAX-like XML processing, let us look
at a completely non-OOP application in C, and using �expat
as its parser. The most notable feature of the C�expat
�
version is that is looks almost exactly the same as the object oriented
SAX versions. Some custom functions are used to setup callback functions
rather than calling a method of a parser instance. But that is a very minor
spelling variant.
#include <stdio.h> |
Compile and call this application with:
% gcc dumbexpat.c -lexpat -o dumbexpat |
There are some technologies, certainly, that have a good concept but
a weak implementation. SAX is not one of them; the design of SAX is a clean
and direct approach to the problem it solves, and really cannot be significantly
improved upon. Both the strengths and weaknesses of SAX (or, likewise, of
�expat
� ) come directly out of its paradigmatic model.
The thing to emphasize about SAX is that it has no concept of an XML document, as such. SAX knows about the small parts that make up an XML document--tags, attributes, bodies, etc.--but simply does not conceive of an XML document as a unified thing, a single �data structure. DOM, by contrast, has a whole model of Nodes, children, values, and all sorts of useful structural relations in an XML document. Likewise, the declarative statements about �documents that we will see in later installments about XSLT and functional programming techniques, conceive of an XML document as a data structure to operate upon holistically. The contrast between SAX and these other paradigms is really quite stark.
Although SAX does allow one to specify validation of an input document,
validity is not really treated as a� document property. Instead,
the violation of a validity constraint is just an event, not fundamentally
different from a startElement
event of a character
event (this event, however, is handled by means of an error or exception--but
that is an implementational strategy, not paradigmatic). A SAX application
can do whatever it wants with an "invalidity" event, including ignore it
entirely or infer a valid variant.
There are a couple situations where SAX (or a similar event-driven parser) is pretty well the only reasonable approach. Two clear cases are: (1) When an input XML document is sufficiently large (e.g., megabytes or gigabytes; but maybe mere kilobytes in embedded contexts) that creating an in-memory representation of the entire document is infeasible. Moreover, it should be noted that a DOM representation of an XML document is generally several times the size of the underlying document. (2) When an input XML document is not even available all-at-once, but rather arrives over a channel during a duration of time (but you want to process the content as each portion is available). The nice thing about SAX is that it can operate in a space- and time-constrained context. Each sequential event can be processed as it "occurs", with no necessary reliance on future events (nor even necessarily on past events). There is no need to read in an entire file at once for SAX processing, nor does one need to await the completion of a stream.
The downside of SAX is also sometimes an upside. Since there is no data structure of an XML document provided by SAX, if one wants to represent the document--or even just parts of it--in a SAX application, it is necessary to allocate and create a custom data structure. DOM, by contrast, just gives you the whole structure, along with a number of convenient methods for operating on it. Building and populating a custom data structure is generally much more work than is simply working with a DOM tree. At both a design and an implementation stage there is a lot that can go wrong in a large and complex data structure (DOM implementations have already been debugged for you).
The SAX examples in the above source code were unrealistic in this
respect--all �they did was read events then write them back out; most real-world
SAX applications will accumulate various information contained in the various
events. Even something as simple as the relative nesting of startElement
events needs to be kept in application-defined data structures (whether
this will be a stack, tree, hash or something else depends on the requirement).
But there is a plus to creating custom data structures in a SAX application--namely,
they are custom . A large amount of the time, a DOM-style tree
structure is the relevant way to look at an XML document/stream. But not
always--at other times what the application is interested in about an XML
stream is not its tree, but some different aggregation of the information
in the event stream. In such cases--even apart from the memory and latency
issues--SAX is the right approach, precisely because one needs
to build a custom data structure. Under a DOM approach, what a programmer
winds up doing is first building a DOM tree, than "walking" that tree to
extract a brand new data structure with the needed characteristics. SAX
bypasses the extra step.
Like a lot of open source projects, the official website for SAX is hosted by SourceForge:
http://sax.sourceforge.net/
An apparently identical page can also be found at:
http://www.saxproject.org/
The �expat
parser can also be found at a SourceForge
page.
http://expat.sourceforge.net/