David Mertz, Ph.D.
Interior Designer, Gnosis Software, Inc.
August 2001
This article looks at a number of approaches to wrapping compression around XML documents. The special structures in XML tend to allow certain improvements over the most naive compression techniques. Code exploring several techniques is provided in the article.
XML as a format has a lot of nice properties--it is a perfectly general way of representing arbitrary data structures, and it is human readable (more of less). But XML has one very notable unpleasant property. XML documents are VERBOSE; not just a little on the wordy side, but grotesquely, morbidly obese, almost unbelievably huge. Much of the time, this drawback of XML really makes no difference--DASD is cheap, and the time on the wire might be only a small part of the total time in the process. But at other times, bandwidth and storage space can be important.
To quantify things, it is not at all unusual for XML documents that represent table-oriented data to be three times as large as a CSV or database representation, or even than a flat file. A similar increase is typical in representing EDI data (such as for ebXML projects). In many of these context, one starts out with multi-megabyte data sources, and making them multiples larger can be inconvenient, especially for transmission purposes. For example, for purposes of this article, I created an XML representation of an approximately 1 megabyte Apache access logfile. This created an XML document 3.18 times the size of the underlying line-oriented log. The only information added in the process was some descriptive names for fields, but that could have also been specified in a single header line of less than a hundred bytes. Moreover, my specific XML representation did not include any namespaces in the tags, which would have increased the size further.
When one thinks about compressing documents, one normally
thinks first of general compression algorithms like Lempel-Ziv
and Huffman, and of the common utilities that implement
variations on them. Specifically, on Unix-like platforms, what
comes first to mind is usually the utility gzip
; on other
platforms, zip
is more common (using utilities such as
pkzip
, info-zip
and WinZip
). gzip
turns out to be
quite consistently better than zip
, but only by small
margins. These utilities indeed tend substantially to reduce
the size of XML files. But it also turns out that one can
obtain considerably better compression rates by two means,
either individually or in combination.
The first technique is to use the Burrows-Wheeler compression
algorithm rather than Lempel-Ziv sequential algorithms. In
particular, the somewhat less common utility bzip2
(or its
associated libraries and APIs) is an implementation of
Burrows-Wheeler for many system platforms (and is accompanied
by full source and a BSD-style license). Burrows-Wheeler
operates by grouping related strings in a uncompressed source,
rather than in the Lempel-Ziv style of building up a dictionary
of string occurences. bzip2
consistently obtains better
compression than gzip
across many file types, but the effect
is especially dramatic for XML documents. On the down side,
bzip2
is generally slower than gzip
. Then again the
slowness of bandwidth will very often swamp speed differences
in CPU or memory-bound algorithms.
The second technique is to take advantage of the very specific structure of XML documents to produce more compressible representations. The XMill utility is one implementation of this technique, and it is available (with C++ source) under a liberal license from AT&T. XMill, however, seems to require certain click-through style limitations on its licensing, and cannot be distributed by other parties directly (at least as I understand it). I have created my own implementation of the same general technique, and the implementation is presented in this article. The code herein is released to the public domain, and the technique as implemented was developed independently and has only a "bird-eye view" similarity to XMill--sometimes XMill does better, and sometimes I do (but XMill is probably always faster than my initial implementation, which only pays attention to compression results).
For purposes of comparison in this article I obtained or
created four base documents. The first was Shakespeare's play
Hamlet as an XML document (see resources). The markup
includes tags such as <PERSONA>
, <SPEAKER>
and <LINE>
which map fairly naturally to the typographic forms one might
encounter in a printed copy. In order to make comparisons of
just how the XML markup contributes to document size and
compressibility, I derived from hamlet.xml
a document
hamlet.txt
that simply removed all the XML tags, but left the
content. This derivation is not reversible, and is a strict
loss of information. A person reading hamlet.txt
would not
have a great deal of difficulty determining semantically which
content is a "speaker" name and which a "line", for example,
but there is no easy way a computer could regenerate the source
XML document.
The next two documents are an Apache Weblog file (a compact set of line-oriented records) and an XML document created from this. Since the source document is the log file, no information is lost in the transformation, and it is trivial to recreate exactly the original format from the XML. Of course, it is not possible to use an XML parser, or DOM, or a validator, or a DTD, with the logfile format. Let's take a look at the sizes of the base documents:
288754 hamlet.xml 188830 hamlet.txt 949474 weblog.txt 3021921 weblog.xml
In both cases, the XML is much larger. In the Hamlet
example, the comparison is not entirely fair since the actual
information content of the text version is also less. But for
the Weblog file, the XML starts to look fairly bad. However,
not everything is quite as it appears. Compression programs do
a fairly good job of boiling documents down to their actual
information content, and meaningless padding tends towards zero
size in the compressed version (asymptotically with compression
effort, and if all is happy). Let's try gzip
, zip
and
bzip2
:
78581 hamlet.xml.gz 72505 hamlet.txt.gz 78696 hamlet.xml.zip 72620 hamlet.txt.zip 57522 hamlet.xml.bz2 56743 hamlet.txt.bz2
-
91029 weblog.txt.gz 115524 weblog.xml.gz 91144 weblog.txt.zip 115639 weblog.xml.zip 56156 weblog.txt.bz2 66994 weblog.xml.bz2
There are a number of interesting things in the above sizes.
For both styles of documents--for every compression
technique--the size differences in compressed files is much
smaller than between the XML versus text originals. It is also
noteworthy that gzip
and zip
cluster very closely together
for corresponding cases, while bzip2
does much better all the
time. Moreover, when using bzip2
on the Shakespeare
document, the compressed size difference between the text and
XML formats is nearly negligible, despite the 53% larger size
of the XML base document.
However, the Weblog stands out as as a problem case. While
compression narrows the bloat of the XML conversion quite a
bit, even the bzip2
version still lets the XML markup
increase the size by about 20%. Not necessarily a disaster,
but it feels like we should ideally be able to compress down to
the true information content
An XML document has a rather inefficient form when it comes to
compression, actually. As we will see bzip2
somewhat
alleviates this inefficency by regrouping strings. But at
heart, XML documents are a jumble of fairly dissimilar
parts--tags, attributes, elements bodies of different types.
If we could take each relatively homogeneous set of things, and
group them close to each other in a transformed file, standard
compressors would have more to work with. For example, if
every <host>
tag-body in our Weblog occurs near the others,
the block of stuff which contains the IP address of hosts will
be easy to compress. The trick here is to come up with a way
of transforming an XML document into something that contains
all the same information, but structures the layout in such a
compressor-friendly style.
The utilities xml2struct.py
and struct2xml.py
do exactly
what is desired. The versions below have a few limitations,
but demonstrate the principles involved. Some limitations are
that each document is limited to 253 distinct tags, and that
attributes and processing instructions are not handled. Fixing
those limits should not change the jist of the results,
however. XMill performs a similar transformation, but with
some extra options and with fewer limitations.
The general format of a "struct" document is as follows:
1. A list of tags that occur in the original XML document, separated by newline characters. 2. A section delimeter: 0x00 (the null byte) 3. A compact representation of the overall document structure, where each start tag is represented by a single byte, and the occurence of content is marked by a 0x02 byte. 4. Another section delimeter: 0x00 (the null byte) 5. The contents of all the elements indicated in the document structure schematic, grouped by element type. Each individual content item is delimited by a 0x02 byte, while the start of elements of a new type is delimited by a 0x01 byte (the last not strictly needed, but it makes reversing the transformation easier).
Below is complete Python code to perform and reverse the described transformation. It would be simple to implement this transformation in another programming language also:
import sys import xml.sax from xml.sax.handler import * class StructExtractor(ContentHandler): """Create a special structure/content form of an XML document""" def startDocument(self): self.taglist = [] self.contentdct = {} self.state = [] # stack for tag state self.newstate = 0 # flag for continuing chars in same elem self.struct = [] # compact document structure def endDocument(self): sys.stdout.write('\n'.join(self.taglist)) # Write out the taglist first sys.stdout.write(chr(0)) # section delimiter \0x00 sys.stdout.write(''.join(self.struct)) # Write out the structure list sys.stdout.write(chr(0)) # section delimiter \0x00 for tag in self.taglist: # Write all content lists sys.stdout.write(chr(2).join(self.contentdct[tag])) sys.stdout.write(chr(1)) # delimiter between content types def startElement(self, name, attrs): if not name in self.taglist: self.taglist.append(name) self.contentdct[name] = [] if len(self.taglist) > 253: raise ValueError, "More than 253 tags encountered" self.state.append(name) # push current tag self.newstate = 1 # chars go to new item # single char to indicate tag self.struct.append(chr(self.taglist.index(name)+3)) def endElement(self, name): self.state.pop() # pop current tag off stack self.newstate = 1 # chars go to new item self.struct.append(chr(1)) # \0x01 is endtag in struct def characters(self, ch): currstate = self.state[-1] if self.newstate: # either add new chars to state item self.contentdct[currstate].append(ch) self.newstate = 0 self.struct.append(chr(2)) # \0x02 content placeholder in struct else: # or append the chars to current item self.contentdct[currstate][-1] += ch if __name__ == '__main__': parser = xml.sax.make_parser() handler = StructExtractor() parser.setContentHandler(handler) parser.parse(sys.stdin)
Using SAX rather than DOM, makes this tranformation fairly time efficient, even though time was not a large consideration in developing it. To reverse the transformation:
def struct2xml(s): tags, struct, content = s.split(chr(0)) taglist = tags.split('\n') # all the tags contentlist = [] # list-of-lists of content items for block in content.split(chr(1)): contents = block.split(chr(2)) contents.reverse() # pop off content items from end contentlist.append(contents) state = [] # stack for tag state skeleton = [] # templatized version of XML for c in struct: i = ord(c) if i >= 3: # start of element i -= 3 # adjust for struct tag index offset tag = taglist[i] # spell out the tag from taglist state.append(tag) # push current tag skeleton.append('<%s>' % tag) # insert the element start tag elif i == 1: # end of element tag = state.pop() # pop current tag off stack skeleton.append('</%s>' % tag) # insert the element end tag elif i == 2: # insert element content tag = state[-1] item = contentlist[taglist.index(tag)].pop() item = item.replace('&','&') skeleton.append(item) # add bare tag to indicate content else: raise ValueError, "Unexpected structure tag: ord(%d)" % i return ''.join(skeleton) if __name__ == '__main__': import sys print struct2xml(sys.stdin.read()),
The real meat of the discussed tranformation comes when we try
to compress the results. If all is as desired, foo.struct
will be significantly more compressible than foo.xml
, even
though the two contain identical information (which is provable
since they are symmetrically derivable). In a sense,
xml2struct.py
is already a sort of compression algorithm (it
produces somewhat smaller files), but the real point is not to
use it directly but to aid further compression.
Let's look at some sizes, including a few repeated from above.
Some results from XMill are thrown in for comparison, these
include the name .xmi.
(XMill is available in versions
using gzip
and bzip2
algorithms):
228610 hamlet.struct 57533 hamlet.struct.bz2 57522 hamlet.xml.bz2 71060 hamlet.struct.gz 78581 hamlet.xml.gz 61823 hamlet.xmi.bz2 75247 hamlet.xmi.gz
The results on this prose document are somewhat mixed.
"Restructuring" the XML document assists gzip
quite a bit.
But it turns out that plain old bzip2
on the original XML
file does 11 bytes better at generating a compressible
structure than do my attempts. Of course, I am comforted that
XMill behaves similarly--but worse--than my tranformation.
We do a quite a bit better with Weblog files. Here it actually pays off.
1764816 weblog.struct 59955 weblog.struct.bz2 66994 weblog.xml.bz2 56156 weblog.txt.bz2 76183 weblog.struct.gz 115524 weblog.xml.gz 59409 weblog.xmi.bz2 74439 weblog.xmi.gz
As before, restructuring the XML Weblog aids gzip
compression extremely significantly. But since gzip
is not
our preferred technique anymore, this is only moderately
interesting. What is of genuine interest is that we have also
improved the compression rate of the--already
wonderful--'bzip2' by 11%. While maybe not earth-shattering,
this is enough to matter when worrying about megabytes. For
what it's worth, XMill does a bit better than xml2struct.py
in this case. What is further interesting is that our
compression of this restructured XML is within 7% of the best
obtained compression of the original textual Weblog.
The utility presented here is a preliminary attempt, but even
in early form it does surprisingly well--at least in some
cases--of squeezing those last bytes out of compressed XML
files. With a little refinement and experimentation, I expect
that a few percent more reduction could be obtained. Part of
what makes writing this utility hard is that bzip2
does
such a good job to start with. I was honestly surprised by
just how effective the Burrows-Wheeler algorithm was when I
started empirical testing.
There are some commercial utilities that attempt to perform XML
compression in a manner that utilizes knowledge of the specific
DTDs of compressed documents. It is quite likely that these
techniques obtain additional compression. However,
xml2struct.py
and XMill have the nice advantage of being
simple command line tools that one can transparently apply to
XML files. Custom programming of every compression is not
always desirable or possible--but where it is, squeezing out
even more bytes might be an obtainable goal.
Much of the inspiration for this article comes from the work of the XMill XML compressor. Information and a downloadable version can be found at the below link. The license requires a click-through, and the download page unfortunately seems to have a buggy script that does not allows downloads from all sites.
http://www.research.att.com/sw/tools/xmill/
The complete plays of Shakespeare can be found in XML form at
the below resource. The document hamlet.xml
used for testing
purposes was obtained there:
http://www.ibiblio.org/xml/examples/shakespeare/
The 1994 paper A Block-sorting Lossless Data Compression
Algorithm, by M. Burrows and D.J. Wheeler, introduced the
algorithm known now as Burrows-Wheeler. This technique is
implemented in the fairly popular bzip2
utility:
http://gatekeeper.dec.com/pub/DEC/SRC/research-reports/abstracts/src-rr-124.html
Many Unix-like systems include bzip2
as part of their
standard distribution. For other platforms--or for newer
versions--'bzip2' can be found at:
http://sources.redhat.com/bzip2/
I wrote what I believe is a good general introduction to data compression. It can be found at:
http://www.gnosis.cx/publish/programming/compression_primer.html
David Mertz believes that less is more David may be reached at [email protected]; his life pored over at http://gnosis.cx/publish/. Suggestions and recommendations on this, past, or future, columns are welcomed.