David Mertz, Ph.D. <[email protected]>
   Gnosis Software, Inc.  <http://gnosis.cx/publish/>
   November 2001
This series looks at several distinct conceptual models that exist for manipulation of XML documents. The Simple API for XML (SAX) is a widely used, and frequently implemented, approach to the procedural and sequential processing of an XML document. We look at what it means to treat an XML document as a series of events, and handled in an event-driven programming context. Source code examples of using SAX are provided, along with pointers to further resources.
As XML has developed into a widely used data format, a number of programming models have arisen for manipulating XML documents. Some of these models--or "paradigms"--have been enshrined as standards, while others remain only informally specified (but equally widely used nonetheless). In a general way, the several models available for manipulating XML documents closely mirror the underlying approaches and techniques that programmers in different traditions bring to the task of working with XML. It is worth noticing that "models" are at a higher level of abstraction than a particular programming language; most of the models discussed in this series are associated with APIs that have been implemented in multiple programming languages.
In part, the richness of available XML programming models simply allows programmers and projects to work in the ways that are most comfortable and familiar to them. In many ways, there is overlap--at least in achievable outcomes--between all the XML programming models. However, different models also carry with them specific pros and cons in the context of XML manipulation; and these might urge the use of particular models for particular projects. This series of five articles aims to provide readers with an overview of the costs, benefits, and motivations for all of the major approaches to programmatic manipulation of XML documents (manipulation here, should be understood also to mean "using XML to drive or communicate other application processes").
This article addresses the Simple API for XML (SAX), which is an event-driven and procedural style of XML programming. The previous article, Part 1, discussed the Document Object Model (DOM), which is a W3C Recommendation. Part 3 will look at XSLT, which brings a declarative programming style to transformations of XML documents. In Part 4, we will see the application of full-fledged Functional Programming (FP) techniques to XML manipulation--these in some ways unify the earlier models (but are less commonly used). The final installment, Part 5, will look briefly at a number of tools and techniques that did not quite fit into the previous discussion, but that readers would do well to be aware of.
There are a number of ways one can think about an XML document. A DOM 
  programmer imagines an XML document as a representation of a hierarchically 
  branching tree. In the DOM picture, each branch-point or leaf is a Node; 
 and  depending on exactly which type of Node it is, it gets written slightly 
  differently. For example, element Nodes get written like  <elem>...</elem>
    , text nodes like some  text . But according to the Document 
 Object Model--as the name  suggests--every XML document is simply the description 
 of a complete object that  lives at a higher conceptual plane. 
When one thinks about OOP a lot, the DOM picture looks compelling. Everything is an object, and objects do things like contain other objects and have properties. At a conceptual level, SAX appeals more to programmers who are used to thinking about streams and sequences--events following one another in strict synchronous order. A SAX programmer imagines an XML document as a linear sequence of discrete events, each one belonging to one of a small number of event types.
There are many programming tasks that effectively impose the linear thinking style of SAX. For example, readings from a physical instrument arrive in a particular sequence (or at least have a logical sequence), each corresponding to a time at which that physical state existed. A mouse or keyboard, for what it is worth, are physical instruments of certain sorts. Or financial transactions, similarly, exist as a linear sequence of events, each of which must be independently handled (perhaps depending on a state that reflects "past" transactions, but never on "future" ones). Processing a logfile, or other delimited or flat line-oriented data file, is also usually treated as a series of distinct instructions or records.
Of closest similarity to SAX is a language lexer--something that takes a sequence of symbols (bytes) from a source file, and groups them into larger aggregates (tokens such as keywords, strings, symbols, etc). In point of fact, SAX is a parser, rather than merely a lexer; SAX creates events that are slightly more structured than mere aggregations (but only slightly).
One needs to distinguish between SAX itself, and a SAX application. SAX (Simple API for XML) is, after all, an API not a tool or utility in itself. What SAX does is read a specified bytestream, then alert its controlling application whenever one of 10 content events occur in the document. Technically things are slightly more complicated since errors, entities and notations can also cause notices--but of the 10 content events only 3 are almost always used, 2 are sometimes needed, and the 5 remaining ones are largely "special purpose" (I have never used those 5 for the practical applications I have programmed).
SAX, technically speaking, is an object-oriented programming framework. 
 SAX  was first--and probably still most widely--implemented in Java, and 
that  language requires a certain degree of OOP-purity (aside from some minor 
warts  with simple types versus classes). Therefore, a SAX application will 
implement a  custom Handler class that specializes the base handlers
in the SAX  library, and provides methods corresponding to events. An instance
of the custom  handler is passed to the previously instantiated SAX parser.
From there, you are  "off to the races." 
The     startElement() , endElement() and
 characters()   methods are specialized by just about every
SAX  application. These methods, predictably, are  called when  an open tag,
close tag, and tag body, respectively are encountered.  Fairly  often, the
  startDocument() and  endDocument()  methods 
are used for initialization and cleanup. But equally often, any  initialization
 or cleanup can occur at the application level rather than at the  document
 level (or the two are interchangeable). The remaining methods are  
  startPrefixMapping(), endPrefixMapping(),  processingInstruction()
, ignorableWhitespace()  and  skippedEntity().
 These each have important reasons for being there,  but in practice are
used  less frequently. You can see that the naming  conventions use Java-style
word capitalization. 
The inheritance and interface design that goes into SAX is pretty nicely 
  thought out. But all the OOP stuff is also fairly irrelevant to the underlying 
  conceptual model of SAX. Having known method names, and providing abstract 
 base  classes, makes the programming a little bit easier. But the general 
 idea of  handing a bunch of callback operations to a parser is perfectly 
straightforward  in a non-OOP language. As an example, the very popular �
expat� XML parser is  written in C (not C++), and is  generally called
from a C application (although  wrappers for other languages  exist too).
  
While the function names in expat� do not  precisely match
those in SAX, they are  conceptually close to each other.  Moreover, a number
of full-fledged SAX  libraries are built on top of the  underlying�
expat� parser; the extra layer is fairly transparent  in this case.
As the  example below demonstrates, there is nothing stopping  a programmer
from using  the exact same names for callback functions with�expat
�    as she would  use for handler methods in SAX. In fact, doing this is
good  practice, in  general. 
At its heart, SAX is a structured programming style of thinking that just happens to have been implemented in an OOP language.
Let us look at the same trivial XML processing application written in 
 several  programming languages. The first of these is Python. This version 
 is probably  the most readable (I like Python), and follows SAX OOP patterns�
 and naming accurately. A Java  version would look almost  identical. All
this application does--in all three  implementations--is write  out a canonical
version of the same XML document it  reads in. Sort of an  imperfect   
cat. The output might differ from  the input in respect to some whitespace
around attributes, and the loss of  comments and processing instructions.
Some character entities can become  confused also. It is not trying to be
a great application; but a better  application will   look   quite
similar to this simple one. 
| from xml.sax import make_parser,handler | 
Call this application with:
| % python dumbsax.py myfile.xml > new.xml | 
A Perl version is largely similar. One thing to notice is that the Perl 
  implementation is, in a sense, not really SAX. That is,  XML::Parser::PerlSAX
    does not choose to use exactly the method  names defined by SAX proper. 
 Moreover, OOP in Perl is slightly circuitous  compared to languages that 
started out as object oriented. Nonetheless, it  should be clear that this 
version closely matches the Python one. 
| use XML::Parser::PerlSAX; | 
Call this application with:
| % perl dumbsax.pl myfile.xml > new.xml | 
Finally, to round out our view of SAX-like XML processing, let us look 
 at a  completely non-OOP application in C, and using �expat
 as its parser.  The most notable feature of the  C�expat� 
  version is  that is looks almost exactly the same as the object oriented 
 SAX versions. Some  custom functions are used to setup callback functions 
 rather than calling a  method of a parser instance. But that is a very minor 
 spelling variant. 
| #include <stdio.h> | 
Compile and call this application with:
| % gcc dumbexpat.c -lexpat -o dumbexpat | 
There are some technologies, certainly, that have a good concept but 
 a weak  implementation. SAX is not one of them; the design of SAX is a clean 
 and direct  approach to the problem it solves, and really cannot be significantly 
 improved  upon. Both the strengths and weaknesses of SAX (or, likewise, of
�expat�  ) come directly  out of its paradigmatic model. 
The thing to emphasize about SAX is that it has no concept of an XML document, as such. SAX knows about the small parts that make up an XML document--tags, attributes, bodies, etc.--but simply does not conceive of an XML document as a unified thing, a single �data structure. DOM, by contrast, has a whole model of Nodes, children, values, and all sorts of useful structural relations in an XML document. Likewise, the declarative statements about �documents that we will see in later installments about XSLT and functional programming techniques, conceive of an XML document as a data structure to operate upon holistically. The contrast between SAX and these other paradigms is really quite stark.
Although SAX does allow one to specify validation of an input document, 
  validity is not really treated as a� document property. Instead, 
 the  violation of a validity constraint is just an event, not fundamentally 
 different  from a startElement event of a character
    event (this  event, however, is handled by means of an error or exception--but 
 that is an  implementational strategy, not paradigmatic). A SAX application 
 can do whatever  it wants with an "invalidity" event, including ignore it 
 entirely or infer a  valid variant. 
There are a couple situations where SAX (or a similar event-driven parser) is pretty well the only reasonable approach. Two clear cases are: (1) When an input XML document is sufficiently large (e.g., megabytes or gigabytes; but maybe mere kilobytes in embedded contexts) that creating an in-memory representation of the entire document is infeasible. Moreover, it should be noted that a DOM representation of an XML document is generally several times the size of the underlying document. (2) When an input XML document is not even available all-at-once, but rather arrives over a channel during a duration of time (but you want to process the content as each portion is available). The nice thing about SAX is that it can operate in a space- and time-constrained context. Each sequential event can be processed as it "occurs", with no necessary reliance on future events (nor even necessarily on past events). There is no need to read in an entire file at once for SAX processing, nor does one need to await the completion of a stream.
The downside of SAX is also sometimes an upside. Since there is no data structure of an XML document provided by SAX, if one wants to represent the document--or even just parts of it--in a SAX application, it is necessary to allocate and create a custom data structure. DOM, by contrast, just gives you the whole structure, along with a number of convenient methods for operating on it. Building and populating a custom data structure is generally much more work than is simply working with a DOM tree. At both a design and an implementation stage there is a lot that can go wrong in a large and complex data structure (DOM implementations have already been debugged for you).
The SAX examples in the above source code were unrealistic in this 
respect--all �they  did was read  events then write them back out; most real-world
SAX applications will  accumulate various information contained in the various
events. Even something  as simple as the relative nesting of startElement
  events needs to  be kept in application-defined data structures (whether
this will be a stack,  tree, hash or something else depends on the requirement).
But there is a plus to  creating custom data structures in a SAX application--namely,
 they are  custom . A large amount of the time, a DOM-style tree
structure is the  relevant way to look at an XML document/stream. But not
always--at other times  what the application is interested in about an XML
stream is not its tree, but  some different aggregation of the information
in the event stream. In such  cases--even apart from the memory and latency
issues--SAX is the right approach,  precisely because one needs
to build a custom data structure. Under a  DOM approach, what a programmer
winds up doing is first building a DOM tree,  than "walking" that tree to
extract a brand new data structure with the needed  characteristics. SAX
bypasses the extra step.   
Like a lot of open source projects, the official website for SAX is hosted by SourceForge:
http://sax.sourceforge.net/
An apparently identical page can also be found at:
http://www.saxproject.org/
The �expat  parser can also be found at a SourceForge
page. 
http://expat.sourceforge.net/