David Mertz, Ph.D.
Line commander, Gnosis Software, Inc.
Most of the time, processing XML documents utilizes heavy-duty APIs and custom applications. However, the tradition of using small tools with I/O piped between them is a very fine one on Unix-like platforms. XML need not be entirely left out of quick-and-dirty processing with one-liners that is especially useful during development and debugging cycles.
As much as I hate to say it, XML tools simply have not
reached the level of convenience of the text utilities
available at a Unix-like command-line. For line-oriented,
whitespace- or comma-delimited text files, it is quite amazing
what you can accomplish with clever combinations of
cut, pipes, and short shell scripts.
In my opinion, it is not that XML is inherently resistent to the modular treatment flat text files enjoy. We just need to learn from experience the best ways to componentize XML tools. For example, in writing this tip, I had a few realistic sample tasks in mind; but what I found was that even those tools that have command-line facilities have not yet learned to "play nice" with each other. Working with multiple tools is not intractable, it just requires a little bit of wrapping.
A fact to note is that quite a few people have written
versions--in various programming languages--of similar simple
tools. Each version behaves a bit differently, but they tend
to accomplish the same overall task. For this tip, I look at
xpath--the first two
come from my Gnosis Utilities, the last is a Perl module
written by Matt Sergeant (get it from CPAN).
I have written previously about
xml_indexer, which creates an
index of the words within XML documents by their XPath. For
example, you can index then search an XML document with:
% xml_indexer chap.xml % indexer events were /Users/dqm/chap.xml::/chapter/sect1/sect2/para /Users/dqm/chap.xml::/chapter/sect1/sect2/para 1 files matched wordlist: ['events', 'mostly'] Processed in 0.062 seconds (SlicedZPickleIndexer)
These commands display the elements within the XML document
chap.xml that contain the words
necessarily in order or proximity). If other XML documents were
added to the index, matching occurrences in them would appear
also. New searches are almost instantaneous, even if multiple
documents are indexed, by the way.
While it tells you a little bit to know that words occur at
particular XPaths within particular documents, the point of a
search is usually to see (or further process) the actual content
matches. For that, you need to employ a command-line
tool; I have installed Perl's XML:XPath, whose behavior I like.
You can cut-and-paste discovered XPaths into the tool
% xpath chap.xml '/chapter/sect1/sect2/para' Found 1 nodes: -- NODE -- <para>It is not particularly remarkable that... ... </para>
This points to a nice modularity in the tools. Moreover, if the
XPath passed to
xpath had wildcards in it, it might have
matched more than just the one node. Unfortunately, the output of
indexer does not have quite the right form to pipe to
to automate looking at the nodes with matched words:
separates the filename from the XPath with "::", and
looks at one XPath at a time. We can do better.
There might be a way to manage the above "impedence mismatch"
using clever combinations of
apply, pipes, and the
like. But I found it easier to write a short (and reusable)
#!/bin/sh for hit in `indexer [email protected] 2> /dev/null` do echo $hit | sed 's/::/ /' > loc.tmp cat loc.tmp | xargs xpath 2> /dev/null echo done rm loc.tmp
As with other well-designed command-line tools,
xpath send informational messages to STDERR, the actual results
to STDOUT. For my script, I am not interested in the STDERR
messages. Now I can find all the nodes in which a list of words
occur as easily as:
% find-xml-elements events were <para>Lest we forget some events in a recent decade... ... Salem and by HUAC.</para> <para>It is not particularly remarkable that... ... being uncovered.</para>
So far, so good. What our search outputs is a series of XML snippets, where each top-level element contains the searched words. However, the result is generally not quite a well-formed XML document, since it is multiply-rooted.
One difficulty in analyzing XML data is that XML documents can contain variations in formatting that are irrelevant to their semantic content. Some whitespace is "ignorable", the order of attributes is discarded during parsing, empty elements may be either self-closed or have an end-tag, and entities can be encoded in a few different ways. In truth, even much whitespace that is non-ignorable from a parser's perspective is nonetheless insignificant from an application point-of-view; "pretty" newlines and indenting are useful for people, and many applications (optionally) perform such stylistic formatting.
There are a rather large number of tools that have been written
to compare XML documents in a semantically useful way. Most of
them have chosen the obvious name
xmldiff, or something close
to it (use Google to find versions for various programming
languages). Underlying such a comparison of XML documents is a
canonicalization of the layout of each document. Once
inflexible algorithmic decisions have been made about the exact
rendering of an XML document, semantically similar documents are
easier to compare with generic tools like
I use a Python script I wrote called
xmlcat. The tool is not
complicated--it acts much like the standard
cat utility, but
canonicalizes XML documents along the way. In a chance to use my
favorite word, I can note that the operation of
idempotent. The reason I like
xmlcat over similar tools like
xmlpp (see Resources) is that it adds an option inspired by the
lynx. If you pass the
--dump argument to
xmlcat, it outputs only the textual content of an XML document,
eliminating the tags (using vertical whitespace is a moderately
pretty way). For data-oriented XML, this capability is of
little use, but for marked-up prose, it is handy.
If you search XML documents of prose for content words, most
likely you are interested in the content more than you are the
markup. Filtering with
xmlcat --dump is exactly the trick to
remove unwanted XML tags. However, directly piping the output
xmlcat is not quite right, since
the output of
find-xml-elements is not quite an entire
well-formed XML documents (it is fragments, as noted). A short
shell script solves the problem:
#!/bin/sh for hit in `indexer [email protected] 2> /dev/null` do echo $hit | sed 's/::/ /' > loc.tmp cat loc.tmp | xargs xpath 2> /dev/null | xmlcat --dump echo done rm loc.tmp
The output from
find-xml-text plays nice with standard text
utilites. For example, I would like to display all the
paragraphs that contain some search terms, but remove any left
indent from their lines and limit line-length:
% find-xml-text events were | sed 's/^ *//' | fmt -w 70 Lest we forget some events in a recent decade... ... ...those in Salem and by HUAC. It is not particularly remarkable... ... ...being uncovered.
Kip Hampton wrote a worthwhile article last year looking at Perl tools for command-line XML processing:
The Perl tools
xmldiff (compare XML documents) and
pretty printer) can be found at:
Gnosis Utilies includes several of the utilities discussed in this article, download it from:
XML Matters #10 discusses full text indexing of XML documents by XPath:
David Mertz uses a wholly unstructured brain to write about structured document formats. David may be reached at [email protected]; his life pored over athttp://gnosis.cx/publish/. And buy his book: http://gnosis.cx/TPiP/.