David Mertz, Ph.D.
Objectifier, Gnosis Software, Inc.
April 2001
As XML becomes popular as a document storage format, especially for prose-oriented documents, the task of locating contents within XML document collections becomes more difficult. This column extends the generic full text indexer presented in David's Charming Python #15 to include XML-specific search and indexing features. Discussions of how indexing can take advantage of the hierarchical node structure of XML is addressed in the tool design and in the discussion.
In a technical sense, XML is a simplification and specialization of SGML. At a first approximation, XML documents should also be valid SGML documents. SGML, in turn, has been used extensively for large-scale documents, in both corporate and government circles. One is likely to come across multi-thousand page, multi-megabyte, documents for, e.g. product specifications, regulatory requirements, and computer system documentation in SGML formats.
Culturally, however, XML has evolved from a different direction. In one respect, XML is a successor for EDI; in another respect, a successor for HTML. But having a different cultural history than SGML, XML is undergoing its own process of tool development. And at the same time, as XML is becoming more popular, expect to see more and more of both (usally) informal HTML documents and (usually) formal SGML documents to migrate in the direction of XML formats--particularly using XML dialects like DocBook.
One tool that XML has not yet grown within its own culture is
one to effectively and efficiently locate content within large
XML documents, or within large collections of XML documents.
General file-search tools like grep
on Unix, and other
similar tools on other platforms, are perfectly able to read
the plain text of XML documents (modulo possible Unicode
issues), but a simple grep
search (or even a complicated one)
miss the structure of XML documents.
If files contain thousands of pages of documentation, just
determining that a given file contains a word, phrase, or
regular expression only very loosely targets what you are
likely to know when searching for content in them. Just which
of those agriculture reports, for example, was written by Ms.
June Apple? One imagines a coarse tool like grep
would tend
to find a lot of things that are not the ones of interest.
Moreover, ad hoc tools like grep
, while very efficient at
what they do, need to check the entire contents of large files
each time a search is performed. For frequent searches,
repeated full-file searching is inefficient.
indexer
In response to the need outlined above, I have created the
public-domain utility xml_indexer
. This Python module can
both be used as a runtime utility and also easily extended by
custom applications that use its services. xml_indexer
, in
turn, relies on the services of two earlier public-domain
utilities I have written about in IBM developerWorks articles:
indexer
and xml_objectify
(see Resources).
The "trick" xml_indexer
uses is the same one XPATH uses.
Rather than treat XML documents as simply things in the
filesystem, we can pretend that the hierarchical nodes of an
XML document themselves look much like a hierarchical
filesystem. For purposes of indexing, other than a need for a
little syntax to distinguish an XPATH from a filesystem path,
we can simply treat an XML node as if it were itself a text
file. Fortunately, indexer
was designed with enough
flexibility to use arbitrary identifiers in indexing texts.
Let's look at some search results:
[D:\articles] indexer ibm /articles/tutor/cryptology3.xml::/section[1]/panel[2]/body/text_column/p[1] /temp/Benchmark/Data/addr2.xml::/person[4]/contact_info/email/@address /temp/Benchmark/Data/addr2.xml::/person[2]/contact_info/email/@address /tools/addr2.xml::/person[4]/contact_info/email/@address /tools/addr2.xml::/person[2]/contact_info/email/@address 5 file matched wordlist: ['ibm'] Processed in 0.320 seconds (SlicedZPickleIndexer)
As with XPATH, attribute values are preceded by an @
mark,
and sibling nodes are enumerated within square brackets. The
filesystem path to an XML document acts, in this context, like
an XPATH axis--roughly as a namespace. For comparison, let's
peform a similar indexed search against a file database (some
additional search terms are used to keep the result list
reasonable):
[D:\articles] indexer ibm python xml indexer D:\archive\mail\messages\tenco.cp15.2001-03-06.13+50+35 D:\archive\mail\messages\tenco.cp15.2001-03-01.07+57+26 D:\archive\mail\messages\tenco.cp15.2001-02-28.23+25+26 3 file matched wordlist: ['ibm', 'python', 'xml', 'indexer'] Processed in 2.530 seconds (SlicedZPickleIndexer)
While the first search is against a fairly trivial amount of test data, the second search uses a "production" index against about 100MB of archived email messages (stored in the filesystem, one message per file). Taking just a couple seconds to search 100MB of files (for multiple simultaneous word occurrences) is quite fast, methinks.
Moreover, while these searches utilize different index
databases (because they were done during a testing stage of
xml_indexer
), there is no reason that a compound index of
text files and XML nodes cannot be created. In such a case, it
is even possible (and probably often useful) to index each XML
file both as a collection of nodes and as a plain file.
After doing so, search results will show both types of
identifier, with the filesystem identifier obviously occurring
in every case that an XPATH in its namespace does. For
example:
[D:\articles] indexer actresses /temp/Benchmark/Data/addr_break.xml /temp/Benchmark/Data/addr_break.xml::/person[3]/misc_info 2 file matched wordlist: ['actresses'] Processed in 0.070 seconds (SlicedZPickleIndexer)
In the above examples, readers will have noticed that in the
examples indexer
was used to perform searches, with no
mention of xml_indexer
. This is because the very same index
search tool is used for searching index databases created by
xml_indexer
as for those create by indexer
. In fact
indexer
is simply a call to python indexer.py ...
with the
command-line arguments passed in an OS-appropriate manner.
Creating or enhancing text-file indexes is also performed by
indexer
--run indexer --help
or indexer /?
to get a
breakdown on the needed arguments and switches. You can
recurse across directories when you add files to an index, and
also limited filename patterns in various ways.
At least for now, XML-node index databases are created using
the simpler xml_indexer.py
script. As of this writing, just
a single XML document's nodes are added to an index database
at a time, by specifying its name as a command-line argument.
However, by the time you read this, the command-line syntax for
xml_indexer.py
will probably be enhanced to look more like
that for indexer.py
. Take a look at the output of
python xml_indexer.py --help
before using it.
In order to give search results XPATH wildcard capabilities, a
-filter
option has been added to indexer
. XPATH functions,
however, are not currently supported in search results. As a
transparent, and beneficial, side-effect, this same switch can
be used for filename globbing--just in case you are only
interested in matching files fulfilling some patterns.
Basically, the /filter
option works exactly as you might
expect (adjust for different quoting syntax across shells).
You can specify that you are only interested in XPATH results
using the double colon in the filter.
[D:\articles] indexer "/filter=*::*" actresses /temp/Benchmark/Data/addr_break.xml::/person[3]/misc_info 1 file matched wordlist: ['actresses'] Processed in 0.050 seconds (SlicedZPickleIndexer)
[D:\articles] indexer "/filter=*.xml" actresses /temp/Benchmark/Data/addr_break.xml 1 file matched wordlist: ['actresses'] Processed in 0.050 seconds (SlicedZPickleIndexer)
More complicated XPATH specifiers are possible by specifying the subelements and order required:
[D:\articles] indexer symmetric /tutor/cryptology1.xml::/section[2]/panel[8]/title /tutor/cryptology1.xml::/section[2]/panel[8]/body/text_column/code_listing /tutor/cryptology1.xml::/section[2]/panel[7]/title /tutor/cryptology1.xml::/section[2]/panel[7]/body/text_column/p[1] 4 file matched wordlist: ['symmetric'] Processed in 0.100 seconds (SlicedZPickleIndexer)
[D:\articles] indexer "-filter=*::/*/title" symmetric /tutor/cryptology1.xml::/section[2]/panel[8]/title /tutor/cryptology1.xml::/section[2]/panel[7]/title 2 file matched wordlist: ['symmetric'] Processed in 0.080 seconds (SlicedZPickleIndexer)
It turned out that the design of xml_indexer
was aided
enormously by the object-oriented design principles that went
into indexer
before it. With the overriding of just a few
methods in the GenericIndexer
class (actually, in its
descendent SlicedPickleIndexer
--but one could just as easily
mix in any concrete Indexer class), the use of an entirely new
set of identifiers and data source was possible.
Readers who wish to use xml_indexer
as part of their own
larger Python projects should find its further specialization
equally simple. I look forward to seeing the uses readers are
able to put these helpful base index classes to.
The xml_indexer
module may be downloaded from:
http://gnosis.cx/download/xml_indexer.py
A general background discussion of the indexer
module is
contained in Charming Python #15: Developing a Full-Text
Indexer in Python:
http://gnosis.cx/publish/programming/charming_python_15.html
The indexer
module itself may be found at:
http://gnosis.cx/download/indexer.py
In order to easily descend recursively through XML nodes, I
utilized the high-level Pythonic interface provided by
xml_objectify
. However, it should be noted that until
recently, this option would not have been practical. Older
versions of xml_objectify
used DOM to read XML files, which
proves embarassingly slow for large XML documents (part of the
blame is on the way xml_objectify
handles this DOM). Costas
Malamas has provided an alternative parsing method that uses
the expat
parser and stream-oriented techniques. This new
technique still has a few hickups with some complicated XML
documents, but in most cases works fine, and multiple orders of
magnitude faster. You can find xml_objectify
at:
http://gnosis.cx/download/xml_objectify.py
David Mertz must have mislaid his MacGuffin in one of his other articles. It is bound to show up again soon. David may be reached at [email protected]; his life pored over at http://gnosis.cx/publish/. Suggestions and recommendations on this, past, or future, columns are welcomed.