David Mertz, Ph.D.
Objectifier, Gnosis Software, Inc.
As XML becomes popular as a document storage format, especially for prose-oriented documents, the task of locating contents within XML document collections becomes more difficult. This column extends the generic full text indexer presented in David's Charming Python #15 to include XML-specific search and indexing features. Discussions of how indexing can take advantage of the hierarchical node structure of XML is addressed in the tool design and in the discussion.
In a technical sense, XML is a simplification and specialization of SGML. At a first approximation, XML documents should also be valid SGML documents. SGML, in turn, has been used extensively for large-scale documents, in both corporate and government circles. One is likely to come across multi-thousand page, multi-megabyte, documents for, e.g. product specifications, regulatory requirements, and computer system documentation in SGML formats.
Culturally, however, XML has evolved from a different direction. In one respect, XML is a successor for EDI; in another respect, a successor for HTML. But having a different cultural history than SGML, XML is undergoing its own process of tool development. And at the same time, as XML is becoming more popular, expect to see more and more of both (usally) informal HTML documents and (usually) formal SGML documents to migrate in the direction of XML formats--particularly using XML dialects like DocBook.
One tool that XML has not yet grown within its own culture is
one to effectively and efficiently locate content within large
XML documents, or within large collections of XML documents.
General file-search tools like
grep on Unix, and other
similar tools on other platforms, are perfectly able to read
the plain text of XML documents (modulo possible Unicode
issues), but a simple
grep search (or even a complicated one)
miss the structure of XML documents.
If files contain thousands of pages of documentation, just
determining that a given file contains a word, phrase, or
regular expression only very loosely targets what you are
likely to know when searching for content in them. Just which
of those agriculture reports, for example, was written by Ms.
June Apple? One imagines a coarse tool like
grep would tend
to find a lot of things that are not the ones of interest.
Moreover, ad hoc tools like
grep, while very efficient at
what they do, need to check the entire contents of large files
each time a search is performed. For frequent searches,
repeated full-file searching is inefficient.
In response to the need outlined above, I have created the
xml_indexer. This Python module can
both be used as a runtime utility and also easily extended by
custom applications that use its services.
turn, relies on the services of two earlier public-domain
utilities I have written about in IBM developerWorks articles:
xml_objectify (see Resources).
xml_indexer uses is the same one XPATH uses.
Rather than treat XML documents as simply things in the
filesystem, we can pretend that the hierarchical nodes of an
XML document themselves look much like a hierarchical
filesystem. For purposes of indexing, other than a need for a
little syntax to distinguish an XPATH from a filesystem path,
we can simply treat an XML node as if it were itself a text
indexer was designed with enough
flexibility to use arbitrary identifiers in indexing texts.
Let's look at some search results:
[D:\articles] indexer ibm /articles/tutor/cryptology3.xml::/section/panel/body/text_column/p /temp/Benchmark/Data/addr2.xml::/person/contact_info/email/@address /temp/Benchmark/Data/addr2.xml::/person/contact_info/email/@address /tools/addr2.xml::/person/contact_info/email/@address /tools/addr2.xml::/person/contact_info/email/@address 5 file matched wordlist: ['ibm'] Processed in 0.320 seconds (SlicedZPickleIndexer)
As with XPATH, attribute values are preceded by an
and sibling nodes are enumerated within square brackets. The
filesystem path to an XML document acts, in this context, like
an XPATH axis--roughly as a namespace. For comparison, let's
peform a similar indexed search against a file database (some
additional search terms are used to keep the result list
[D:\articles] indexer ibm python xml indexer D:\archive\mail\messages\tenco.cp15.2001-03-06.13+50+35 D:\archive\mail\messages\tenco.cp15.2001-03-01.07+57+26 D:\archive\mail\messages\tenco.cp15.2001-02-28.23+25+26 3 file matched wordlist: ['ibm', 'python', 'xml', 'indexer'] Processed in 2.530 seconds (SlicedZPickleIndexer)
While the first search is against a fairly trivial amount of test data, the second search uses a "production" index against about 100MB of archived email messages (stored in the filesystem, one message per file). Taking just a couple seconds to search 100MB of files (for multiple simultaneous word occurrences) is quite fast, methinks.
Moreover, while these searches utilize different index
databases (because they were done during a testing stage of
xml_indexer), there is no reason that a compound index of
text files and XML nodes cannot be created. In such a case, it
is even possible (and probably often useful) to index each XML
file both as a collection of nodes and as a plain file.
After doing so, search results will show both types of
identifier, with the filesystem identifier obviously occurring
in every case that an XPATH in its namespace does. For
[D:\articles] indexer actresses /temp/Benchmark/Data/addr_break.xml /temp/Benchmark/Data/addr_break.xml::/person/misc_info 2 file matched wordlist: ['actresses'] Processed in 0.070 seconds (SlicedZPickleIndexer)
In the above examples, readers will have noticed that in the
indexer was used to perform searches, with no
xml_indexer. This is because the very same index
search tool is used for searching index databases created by
xml_indexer as for those create by
indexer. In fact
indexer is simply a call to
python indexer.py ... with the
command-line arguments passed in an OS-appropriate manner.
Creating or enhancing text-file indexes is also performed by
indexer --help or
indexer /? to get a
breakdown on the needed arguments and switches. You can
recurse across directories when you add files to an index, and
also limited filename patterns in various ways.
At least for now, XML-node index databases are created using
xml_indexer.py script. As of this writing, just
a single XML document's nodes are added to an index database
at a time, by specifying its name as a command-line argument.
However, by the time you read this, the command-line syntax for
xml_indexer.py will probably be enhanced to look more like
indexer.py. Take a look at the output of
python xml_indexer.py --help before using it.
In order to give search results XPATH wildcard capabilities, a
-filter option has been added to
indexer. XPATH functions,
however, are not currently supported in search results. As a
transparent, and beneficial, side-effect, this same switch can
be used for filename globbing--just in case you are only
interested in matching files fulfilling some patterns.
/filter option works exactly as you might
expect (adjust for different quoting syntax across shells).
You can specify that you are only interested in XPATH results
using the double colon in the filter.
[D:\articles] indexer "/filter=*::*" actresses /temp/Benchmark/Data/addr_break.xml::/person/misc_info 1 file matched wordlist: ['actresses'] Processed in 0.050 seconds (SlicedZPickleIndexer)
[D:\articles] indexer "/filter=*.xml" actresses /temp/Benchmark/Data/addr_break.xml 1 file matched wordlist: ['actresses'] Processed in 0.050 seconds (SlicedZPickleIndexer)
More complicated XPATH specifiers are possible by specifying the subelements and order required:
[D:\articles] indexer symmetric /tutor/cryptology1.xml::/section/panel/title /tutor/cryptology1.xml::/section/panel/body/text_column/code_listing /tutor/cryptology1.xml::/section/panel/title /tutor/cryptology1.xml::/section/panel/body/text_column/p 4 file matched wordlist: ['symmetric'] Processed in 0.100 seconds (SlicedZPickleIndexer)
[D:\articles] indexer "-filter=*::/*/title" symmetric /tutor/cryptology1.xml::/section/panel/title /tutor/cryptology1.xml::/section/panel/title 2 file matched wordlist: ['symmetric'] Processed in 0.080 seconds (SlicedZPickleIndexer)
It turned out that the design of
xml_indexer was aided
enormously by the object-oriented design principles that went
indexer before it. With the overriding of just a few
methods in the
GenericIndexer class (actually, in its
SlicedPickleIndexer--but one could just as easily
mix in any concrete Indexer class), the use of an entirely new
set of identifiers and data source was possible.
Readers who wish to use
xml_indexer as part of their own
larger Python projects should find its further specialization
equally simple. I look forward to seeing the uses readers are
able to put these helpful base index classes to.
xml_indexer module may be downloaded from:
A general background discussion of the
indexer module is
contained in Charming Python #15: Developing a Full-Text
Indexer in Python:
indexer module itself may be found at:
In order to easily descend recursively through XML nodes, I
utilized the high-level Pythonic interface provided by
xml_objectify. However, it should be noted that until
recently, this option would not have been practical. Older
xml_objectify used DOM to read XML files, which
proves embarassingly slow for large XML documents (part of the
blame is on the way
xml_objectify handles this DOM). Costas
Malamas has provided an alternative parsing method that uses
expat parser and stream-oriented techniques. This new
technique still has a few hickups with some complicated XML
documents, but in most cases works fine, and multiple orders of
magnitude faster. You can find
David Mertz must have mislaid his MacGuffin in one of his other articles. It is bound to show up again soon. David may be reached at firstname.lastname@example.org; his life pored over at http://gnosis.cx/publish/. Suggestions and recommendations on this, past, or future, columns are welcomed.