|
Sorting through Twisted...
| a good sense of why they are ...
Simple enough, and identical to the [SSAX] utility I presented.
WRITING THE OUTLINE UTILITY
------------------------------------------------------------------------
The main work of 'Outline.java' is performed by the class
[nu.xom.Builder]. This class builds an in-memory [XOM] object based
on an XML source. One surprise I found is that if you specify a class
initializer of 'true', [XOM] will insist on validation, even for XML
documents that do not specify a DTD. In other words, all such
documents will throw a 'ValidityException' (but this might be system
dependend upon installed Java XML parsers). The best approach is
probably to omit an initialization flag, and let [XOM] figure out the
best parser.
#-------------------- Outline.java utility ----------------------#
import nu.xom.*;
import java.io.IOException;
public class Outline {
public static void main(String[] args) {
try {
// Use 'Builder(true)' to require validation
Builder parser = new Builder();
Document doc = parser.build(args[0]);
showElement(doc.getRootElement(), 0);
}
catch (ValidityException ex) {
System.err.println(args[0]+" is invalid.");
}
catch (ParseException ex) {
System.err.println(args[0]+" is not well-formed.");
}
catch (java.io.IOException ex) {
System.err.println(args[0]+" cannot be read");
}
}
private static void showElement(Element element, int level) {
// Show the tag, along with its attributes
indent(level, "<"+element.getLocalName());
for (int i=0; i < element.getAttributeCount(); i++) {
Attribute attr = element.getAttribute(i);
System.out.print(" "+attr.getLocalName()+"='"+attr.getValue()+"'");
}
System.out.println(">");
// Now loop through child nodes
for (int i=0; i < element.getChildCount(); i++) {
Node node = element.getChild(i);
if (node instanceof Text) {
String text = node.getValue();
if (text.length() > 30) {
indent(level+1, "|"+text.substring(0,30)+"...\n");
}
} else if (node instanceof Element) {
showElement((Element)node, level+1);
}
}
}
private static void indent(int level, String string) {
for (int i=0; i < level; i++) { System.out.print(" "); }
System.out.print(string);
}
}
The organization here is pretty straightforward. The method
'.showElement()' displays the name and attributes of each element, then
recurses to its children, incremementing an indent level on each
recursion.
In designing this utility, I took an illustrative misstep. The class
'Element' has a method '.getChildElements()' that returns a
traversable list of elements--excluding other 'Node' objects from the
enumeration. On its face, using this enumeration would seem more
straightforward; the method is, in fact, widely useful since you can
optionally limit the enumeration to children with a given name. Since
an 'Element' also has a '.getValue()' method to retrieve the PCDATA,
it would seem like we could grab these content strings with each such
child element.
Unfortunately, the semantics of '.getValue()' are slightly wrong for
my intended use: '.getValue()' retrieves -all- the text inside a given
tag, not only that portion of it leading up to the next child tag. For
example, in the above example, the blurb inside the '' element is
also thereby inside the enclosing '' element, and
'author.getValue()' retrieves stuff we do not want. What we are left
with is walking through all the child nodes, and deciding what to do
with each based on which subclass of 'Node' we find. In particular,
for purposes of this utility, I am only interested in 'Text' and
'Element', not 'Comment', 'ProcessingInstruction', 'DocType' or
others.
CREATING A NEW XML DOCUMENT
------------------------------------------------------------------------
While, in my opinion, the main benefit of XML APIs in in parsing and
traversing existing XML documents, sometimes we also want to create
new documents within a program--or at least modify existing ones. For
the simplest tasks, basic string operations really do suffice. But
it's not hard to make a programming error, and fail to close a tag, or
escape a special value. Using [XOM] for document creation guards
against any such errors.
Here is a brief example, mostly taken from the [XOM] tutorial:
#---------------------- HelloWorld.java -------------------------#
import nu.xom.*;
public class HelloWorld {
public static void main(String[] args) {
Element root = new Element("root");
root.appendChild("Hello World!");
Attribute foo = new Attribute("foo","bar");
root.addAttribute(foo);
Document doc = new Document(root);
String result = doc.toXML();
System.out.println(result);
}
}
This outputs the following:
#---------------------- HelloWorld output -----------------------#
$ java HelloWorld
Hello World!
Beyond the basic, '.appendChild()' and '.addAttribute()' methods, the
'.copy()', '.detach()' method and the '.remove*()' collection are
useful for rearranging [XOM] trees. Every tree, and every node inside
it has a '.toXML()' method, and moreover this is the sole
serialization format for [XOM] objects.
COMPARISONS
------------------------------------------------------------------------
In writing my little 'Outline' utility, I became curious about how
convenient [XOM] really is compared to other APIs. Since the same
utility was written for the last installment on [SSAX], that makes for
an obvious comparison. As it turns out, the Scheme and Java
versions--using [SSAX] and [XOM] respectively--work out to pretty much
the same length in lines, despite Schemes use of macros and dynamic
typing. The coding style is very different, of course; and the Scheme
is actually shorter in characters (if you ignore the larger number of
comments in the [SSAX] version).
Readers of this column, however, will know that I often advocate
Python--and specifically my own Gnosis Utilities APIs. I decided to
make a quick shot at the same utility using the latest development
version of [gnosis.xml.objectify]:
#---------------------- outline.py utility ----------------------#
from sys import stdin, stdout, stderr
from gnosis.xml.objectify import XML_Objectify, \
make_instance, tagname, content, attributes
XML_Objectify.expat_kwargs['nspace_sep'] = None
def showNode(node, level=0):
stdout.write(" "*level+"<"+tagname(node))
for key,val in attributes(node).items():
stdout.write(" %s='%s'" % (key,val))
stdout.write(">\n")
for child in content(node):
if isinstance(child, unicode):
if len(child) > 30:
stdout.write(" "*(level+1)+"|"+child[:30]+"...\n")
else:
showNode(child, level+1)
showNode(make_instance(stdin))
I find it interesting that the Java version with [XOM] is still about
2.5 times as long (and also very close to the same speed, once I
benchmarked against a large XML version of Shakespeare's _Hamlet_;
Python's smaller startup time biases small tests).
Much of the extra code in Java relates to the various exception
checking in the method 'Outline.main()'. In Python, I can let the
built in exception stacks do the work for me; of course, if I were to
start doing something more meaningful with exceptions than just report
them, then Python starts to look more like Java.
Obviously, however, programmers who want to use Java, for whatever
good reasons, gain little benefit in knowing that libraries for Python
or Scheme might allow more compact code. And Java certainly has a
number of strengths that can merit the extra verboseness.
CONCLUSION
------------------------------------------------------------------------
The real problem with DOM is that it is -good enough- for many
purposes. There are far too many methods in DOM, many overlapping in
purpose, and not named consistently. Committees and legacies do that.
Despite that, -everyone- already has a DOM library handy--not just
Java programmers, but also programmers of many other programming
languages. It is too easy to just choose DOM because it is widespread
and available.
Even though I would not generally choose to write in Java if I had the
option to write Python (or maybe Ruby, or even Perl), [XOM] really
does everything better than DOM. [XOM] is more correct, easier to
learn, and more conistent. Most of its capabilities have not been
covered in this introduction, but feel assured it has the usual
collection of XML technologies incorporated: XPath, XSLT, XInclude,
ability to interface with SAX and DOM, and so on.
If you are doing XML development in Java, and you are able to include
a custom LGPL library in your application, I strongly recommend you
give [XOM] a serious look.
RESOURCES
------------------------------------------------------------------------
The starting point for [XOM] is its homepage:
http://www.cafeconleche.org/XOM
A good place to get a sense of -why- Elliote Rusty Harold things he
needed to develop [XOM] is his presentation "What's Wrong with XML
APIs (and how to fix them)":
http://www.cafeconleche.org/XOM/whatswrong/text0.html
The API documentation for [XOM] is quite good. It can be found at:
http://www.cafeconleche.org/XOM/apidocs/index.html
ABOUT THE AUTHOR
------------------------------------------------------------------------
{Picture of Author: http://gnosis.cx/cgi-bin/img_dqm.cgi}
David Mertz once led the desperate life of scholarship. David may be
reached at mertz@gnosis.cx; his life pored over at
http://gnosis.cx/publish/. Suggestions and recommendations on this,
past, or future, columns are welcomed. Check out David's new book
_Text Processing in Python
_ at http//gnosis.cx/TPiP/.