XML ZONE TIP: Command-line XML Processing
Working with XML Documents from the Shell

David Mertz, Ph.D.
Line commander, Gnosis Software, Inc.
April, 2003

    Most of the time, processing XML documents utilizes heavy-duty
    APIs and custom applications. However, the tradition of using
    small tools with I/O piped between them is a very fine one on
    Unix-like platforms. XML need not be entirely left out of
    quick-and-dirty processing with one-liners that is especially
    useful during development and debugging cycles.

INTRODUCTION
------------------------------------------------------------------------

  As much as I hate to say it, XML tools simply have not
  reached the level of convenience of the text utilities
  available at a Unix-like command-line.  For line-oriented,
  whitespace- or comma-delimited text files, it is quite amazing
  what you can accomplish with clever combinations of 'sed',
  'grep', 'xargs', 'wc', 'cut', pipes, and short shell scripts.

  In my opinion, it is not that XML is inherently resistent to
  the modular treatment flat text files enjoy.  We just need to
  learn from experience the best ways to componentize XML tools.
  For example, in writing this tip, I had a few realistic sample
  tasks in mind; but what I found was that even those tools that
  have command-line facilities have not yet learned to "play
  nice" with each other.  Working with multiple tools is not
  intractable, it just requires a little bit of wrapping.

  A fact to note is that quite a few people have written
  versions--in various programming languages--of similar simple
  tools.  Each version behaves a bit differently, but they tend
  to accomplish the same overall task.  For this tip, I look at
  the tools 'xml_indexer', 'xmlcat', and 'xpath'--the first two
  come from my Gnosis Utilities, the last is a Perl module
  written by Matt Sergeant (get it from CPAN).


FINDING WORDS IN XML PROSE
------------------------------------------------------------------------

  I have written previously about 'xml_indexer', which creates an
  index of the words within XML documents by their XPath.  For
  example, you can index then search an XML document with:

      #--------- Indexing and searching on XPaths --------------#
      % xml_indexer chap.xml
      % indexer events were
      /Users/dqm/chap.xml::/chapter/sect1[2]/sect2[1]/para[1]
      /Users/dqm/chap.xml::/chapter/sect1[2]/sect2[4]/para[3]
      1 files matched wordlist: ['events', 'mostly']
      Processed in 0.062 seconds (SlicedZPickleIndexer)

  These commands display the elements within the XML document
  'chap.xml' that contain the words 'events' and 'were' (not
  necessarily in order or proximity). If other XML documents were
  added to the index, matching occurrences in them would appear
  also.   New searches are almost instantaneous, even if multiple
  documents are indexed, by the way.

  While it tells you a little bit to know that words occur at
  particular XPaths within particular documents, the point of a
  search is usually to see (or further process) the actual content
  matches. For that, you need to employ a command-line 'xpath'
  tool; I have installed Perl's XML:XPath, whose behavior I like.

  You -can- cut-and-paste discovered XPaths into the tool 'xpath'.
  E.g.:

      #------------ Manually looking at an XPath ---------------#
      % xpath chap.xml '/chapter/sect1[2]/sect2[4]/para[3]'
      Found 1 nodes:
      -- NODE --
      <para>It is not particularly remarkable that...
      ...
      </para>

  This points to a nice modularity in the tools. Moreover, if the
  XPath passed to 'xpath' had wildcards in it, it might have
  matched more than just the one node. Unfortunately, the output of
  'indexer' does not have quite the right form to pipe to 'xpath',
  to automate looking at the nodes with matched words: 'indexer'
  separates the filename from the XPath with "::", and 'xpath' only
  looks at one XPath at a time. We can do better.

A FIRST LITTLE SHELL SCRIPT
------------------------------------------------------------------------

  There might be a way to manage the above "impedence mismatch"
  using clever combinations of 'xargs', 'apply', pipes, and the
  like.  But I found it easier to write a short (and reusable)
  shell script:

      #------------------ find-xml-elements --------------------#
      #!/bin/sh
      for hit in `indexer $@ 2> /dev/null`
      do
        echo $hit | sed 's/::/ /' > loc.tmp
        cat loc.tmp | xargs xpath 2> /dev/null
        echo
      done
      rm loc.tmp

  As with other well-designed command-line tools, 'indexer' and
  'xpath' send informational messages to STDERR, the actual results
  to STDOUT. For my script, I am not interested in the STDERR
  messages. Now I can find all the nodes in which a list of words
  occur as easily as:

      #------------ Searching XML elements for words -----------#
      % find-xml-elements events were
      <para>Lest we forget some events in a recent decade...
      ...
      Salem and by HUAC.</para>
      <para>It is not particularly remarkable that...
      ...
      being uncovered.</para>

  So far, so good.  What our search outputs is a series of XML
  snippets, where each top-level element contains the searched
  words.  However, the result is generally not quite a
  well-formed XML document, since it is multiply-rooted.

COMPARING XML DOCUMENT AND EXTRACTING TEXT
------------------------------------------------------------------------

  One difficulty in analyzing XML data is that XML documents can
  contain variations in formatting that are irrelevant to their
  semantic content. Some whitespace is "ignorable", the order of
  attributes is discarded during parsing, empty elements may be
  either self-closed or have an end-tag, and entities can be
  encoded in a few different ways. In truth, even much whitespace
  that is non-ignorable from a parser's perspective is nonetheless
  insignificant from an application point-of-view; "pretty"
  newlines and indenting are useful for people, and many
  applications (optionally) perform such stylistic formatting.

  There are a rather large number of tools that have been written
  to compare XML documents in a semantically useful way. Most of
  them have chosen the obvious name 'xmldiff', or something close
  to it (use Google to find versions for various programming
  languages). Underlying such a comparison of XML documents is a
  -canonicalization- of the layout of each document. Once
  inflexible algorithmic decisions have been made about the exact
  rendering of an XML document, semantically similar documents are
  easier to compare with generic tools like 'diff'.

  I use a Python script I wrote called 'xmlcat'. The tool is not
  complicated--it acts much like the standard 'cat' utility, but
  canonicalizes XML documents along the way. In a chance to use my
  favorite word, I can note that the operation of 'xmlcat' is
  idempotent. The reason I like 'xmlcat' over similar tools like
  'xmlpp' (see Resources) is that it adds an option inspired by the
  web browser 'lynx'. If you pass the '--dump' argument to
  'xmlcat', it outputs only the textual content of an XML document,
  eliminating the tags (using vertical whitespace is a moderately
  pretty way).  For data-oriented XML, this capability is of
  little use, but for marked-up prose, it is handy.

A SECOND SHELL SCRIPT FOR VIEWING TEXT
------------------------------------------------------------------------

  If you search XML documents of prose for content words, most
  likely you are interested in the content more than you are the
  markup.  Filtering with 'xmlcat --dump' is exactly the trick to
  remove unwanted XML tags.  However, directly piping the output
  of 'find-xml-elements' to 'xmlcat' is not quite right, since
  the output of 'find-xml-elements' is not quite an entire
  well-formed XML documents (it is fragments, as noted).  A short
  shell script solves the problem:

      #-------------------- find-xml-text ----------------------#
      #!/bin/sh
      for hit in `indexer $@ 2> /dev/null`
      do
        echo $hit | sed 's/::/ /' > loc.tmp
        cat loc.tmp | xargs xpath 2> /dev/null | xmlcat --dump
        echo
      done
      rm loc.tmp

  The output from 'find-xml-text' plays nice with standard text
  utilites.  For example, I would like to display all the
  paragraphs that contain some search terms, but remove any left
  indent from their lines and limit line-length:

      #--------- Searching XML element text for words ----------#
      % find-xml-text events were | sed 's/^ *//' | fmt -w 70
      Lest we forget some events in a recent decade...
      ...
      ...those in Salem and by HUAC.

      It is not particularly remarkable...
      ...
      ...being uncovered.

RESOURCES
------------------------------------------------------------------------

  Kip Hampton wrote a worthwhile article last year looking at
  Perl tools for command-line XML processing:

    http://www.xml.com/lpt/a/2002/04/17/perl-xml.html

  The Perl tools 'xmldiff' (compare XML documents) and 'xmlpp' (XML
  pretty printer) can be found at:

    http://software.decisionsoft.com/tools.html

  Gnosis Utilies includes several of the utilities discussed in
  this article, download it from:

    http://gnosis.cx/download/Gnosis_Utils-current.tar.gz

  _XML Matters #10_ discusses full text indexing of XML documents
  by XPath:

    http://www-106.ibm.com/developerworks/xml/library/x-matters10.html

ABOUT THE AUTHOR
------------------------------------------------------------------------

  {Picture of Author: http://gnosis.cx/cgi-bin/img_dqm.cgi} David
  Mertz uses a wholly unstructured brain to write about structured
  document formats. David may be reached at mertz@gnosis.cx; his
  life pored over at http://gnosis.cx/publish/. And buy his book:
  http://gnosis.cx/TPiP/.