XML MATTERS #13: XML and Compression -- Exploring the entropy of documents --

XML as a format has a lot of nice properties--it is a perfectly general way of representing arbitrary data structures, and it is human readable (more of less). But XML has one very notable unpleasant property. XML documents are VERBOSE; not just a little on the wordy side, but grotesquely, morbidly obese, almost unbelievably huge. Much of the time, this drawback of XML really makes no difference--DASD is cheap, and the time on the wire might be only a small part of the total time in the process. But at other times, bandwidth and storage space can be important.

To quantify things, it is not at all unusual for XML documents that represent table-oriented data to be three times as large as a CSV or database representation, or even than a flat file. A similar increase is typical in representing EDI data (such as for ebXML projects). In many of these context, one starts out with multi-megabyte data sources, and making them multiples larger can be inconvenient, especially for transmission purposes. For example, for purposes of this article, I created an XML representation of an approximately 1 megabyte Apache access logfile. This created an XML document 3.18 times the size of the underlying line-oriented log. The only information added in the process was some descriptive names for fields, but that could have also been specified in a single header line of less than a hundred bytes. Moreover, my specific XML representation did not include any namespaces in the tags, which would have increased the size further.

When one thinks about compressing documents, one normally thinks first of general compression algorithms like Lempel-Ziv and Huffman, and of the common utilities that implement variations on them. Specifically, on Unix-like platforms, what comes first to mind is usually the utility gzip; on other platforms, zip is more common (using utilities such as pkzip, info-zip and WinZip). gzip turns out to be quite consistently better than zip, but only by small margins. These utilities indeed tend substantially to reduce the size of XML files. But it also turns out that one can obtain considerably better compression rates by two means, either individually or in combination.

The first technique is to use the Burrows-Wheeler compression algorithm rather than Lempel-Ziv sequential algorithms. In particular, the somewhat less common utility bzip2 (or its associated libraries and APIs) is an implementation of Burrows-Wheeler for many system platforms (and is accompanied by full source and a BSD-style license). Burrows-Wheeler operates by grouping related strings in a uncompressed source, rather than in the Lempel-Ziv style of building up a dictionary of string occurences. bzip2 consistently obtains better compression than gzip across many file types, but the effect is especially dramatic for XML documents. On the down side, bzip2 is generally slower than gzip. Then again the slowness of bandwidth will very often swamp speed differences in CPU or memory-bound algorithms.

The second technique is to take advantage of the very specific structure of XML documents to produce more compressible representations. The XMill utility is one implementation of this technique, and it is available (with C++ source) under a liberal license from AT&T. XMill, however, seems to require certain click-through style limitations on its licensing, and cannot be distributed by other parties directly (at least as I understand it). I have created my own implementation of the same general technique, and the implementation is presented in this article. The code herein is released to the public domain, and the technique as implemented was developed independently and has only a "bird-eye view" similarity to XMill--sometimes XMill does better, and sometimes I do (but XMill is probably always faster than my initial implementation, which only pays attention to compression results).

Comparing Basic Algorithms

For purposes of comparison in this article I obtained or created four base documents. The first was Shakespeare's play Hamlet as an XML document (see resources). The markup includes tags such as <PERSONA>, <SPEAKER> and <LINE> which map fairly naturally to the typographic forms one might encounter in a printed copy. In order to make comparisons of just how the XML markup contributes to document size and compressibility, I derived from hamlet.xml a document hamlet.txt that simply removed all the XML tags, but left the content. This derivation is not reversible, and is a strict loss of information. A person reading hamlet.txt would not have a great deal of difficulty determining semantically which content is a "speaker" name and which a "line", for example, but there is no easy way a computer could regenerate the source XML document.

The next two documents are an Apache Weblog file (a compact set of line-oriented records) and an XML document created from this. Since the source document is the log file, no information is lost in the transformation, and it is trivial to recreate exactly the original format from the XML. Of course, it is not possible to use an XML parser, or DOM, or a validator, or a DTD, with the logfile format. Let's take a look at the sizes of the base documents:

Directory listing of base documents

 288754  hamlet.xml
 188830  hamlet.txt
 949474  weblog.txt
3021921  weblog.xml

In both cases, the XML is much larger. In the Hamlet example, the comparison is not entirely fair since the actual information content of the text version is also less. But for the Weblog file, the XML starts to look fairly bad. However, not everything is quite as it appears. Compression programs do a fairly good job of boiling documents down to their actual information content, and meaningless padding tends towards zero size in the compressed version (asymptotically with compression effort, and if all is happy). Let's try gzip, zip and bzip2:

Directory listing of compressed Shakespeare

  78581  hamlet.xml.gz
  72505  hamlet.txt.gz
  78696  hamlet.xml.zip
  72620  hamlet.txt.zip
  57522  hamlet.xml.bz2
  56743  hamlet.txt.bz2

Directory listing of compressed Weblog

  91029  weblog.txt.gz
 115524  weblog.xml.gz
  91144  weblog.txt.zip
 115639  weblog.xml.zip
  56156  weblog.txt.bz2
  66994  weblog.xml.bz2

There are a number of interesting things in the above sizes. For both styles of documents--for every compression technique--the size differences in compressed files is much smaller than between the XML versus text originals. It is also noteworthy that gzip and zip cluster very closely together for corresponding cases, while bzip2 does much better all the time. Moreover, when using bzip2 on the Shakespeare document, the compressed size difference between the text and XML formats is nearly negligible, despite the 53% larger size of the XML base document.

However, the Weblog stands out as as a problem case. While compression narrows the bloat of the XML conversion quite a bit, even the bzip2 version still lets the XML markup increase the size by about 20%. Not necessarily a disaster, but it feels like we should ideally be able to compress down to the true information content

Reversible Transformations

An XML document has a rather inefficient form when it comes to compression, actually. As we will see bzip2 somewhat alleviates this inefficency by regrouping strings. But at heart, XML documents are a jumble of fairly dissimilar parts--tags, attributes, elements bodies of different types. If we could take each relatively homogeneous set of things, and group them close to each other in a transformed file, standard compressors would have more to work with. For example, if every <host> tag-body in our Weblog occurs near the others, the block of stuff which contains the IP address of hosts will be easy to compress. The trick here is to come up with a way of transforming an XML document into something that contains all the same information, but structures the layout in such a compressor-friendly style.

The utilities xml2struct.py and struct2xml.py do exactly what is desired. The versions below have a few limitations, but demonstrate the principles involved. Some limitations are that each document is limited to 253 distinct tags, and that attributes and processing instructions are not handled. Fixing those limits should not change the jist of the results, however. XMill performs a similar transformation, but with some extra options and with fewer limitations.

Struct document format

1.  A list of tags that occur in the original XML document,
    separated by newline characters.
2.  A section delimeter: 0x00 (the null byte)
3.  A compact representation of the overall document
    structure, where each start tag is represented by a
    single byte, and the occurence of content is marked by
    a 0x02 byte.
4.  Another section delimeter: 0x00 (the null byte)
5.  The contents of all the elements indicated in the
    document structure schematic, grouped by element type.
    Each individual content item is delimited by a 0x02
    byte, while the start of elements of a new type is
    delimited by a 0x01 byte (the last not strictly needed,
    but it makes reversing the transformation easier).

Below is complete Python code to perform and reverse the described transformation. It would be simple to implement this transformation in another programming language also:

xml2struct.py

import sys
import xml.sax
from xml.sax.handler import *

class StructExtractor(ContentHandler):
    """Create a special structure/content form of an XML document"""
    def startDocument(self):
        self.taglist = []
        self.contentdct = {}
        self.state = []             # stack for tag state
        self.newstate = 0           # flag for continuing chars in same elem
        self.struct = []            # compact document structure

    def endDocument(self):
        sys.stdout.write('\n'.join(self.taglist))
                                    # Write out the taglist first
        sys.stdout.write(chr(0))    # section delimiter \0x00
        sys.stdout.write(''.join(self.struct))
                                    # Write out the structure list
        sys.stdout.write(chr(0))    # section delimiter \0x00
        for tag in self.taglist:    # Write all content lists
            sys.stdout.write(chr(2).join(self.contentdct[tag]))
            sys.stdout.write(chr(1)) # delimiter between content types

    def startElement(self, name, attrs):
        if not name in self.taglist:
            self.taglist.append(name)
            self.contentdct[name] = []
            if len(self.taglist) > 253:
                raise ValueError, "More than 253 tags encountered"
        self.state.append(name)     # push current tag
        self.newstate = 1           # chars go to new item
                                    # single char to indicate tag
        self.struct.append(chr(self.taglist.index(name)+3))

    def endElement(self, name):
        self.state.pop()            # pop current tag off stack
        self.newstate = 1           # chars go to new item
        self.struct.append(chr(1))  # \0x01 is endtag in struct

    def characters(self, ch):
        currstate = self.state[-1]
        if self.newstate:           # either add new chars to state item
            self.contentdct[currstate].append(ch)
            self.newstate = 0
            self.struct.append(chr(2))
                                    # \0x02 content placeholder in struct
        else:                       # or append the chars to current item
            self.contentdct[currstate][-1] += ch

if __name__ == '__main__':
    parser = xml.sax.make_parser()
    handler = StructExtractor()
    parser.setContentHandler(handler)
    parser.parse(sys.stdin)

Using SAX rather than DOM, makes this tranformation fairly time efficient, even though time was not a large consideration in developing it. To reverse the transformation:

struct2xml.py

def struct2xml(s):
    tags, struct, content = s.split(chr(0))
    taglist = tags.split('\n')      # all the tags
    contentlist = []                # list-of-lists of content items
    for block in content.split(chr(1)):
        contents = block.split(chr(2))
        contents.reverse()          # pop off content items from end
        contentlist.append(contents)
    state =  []                     # stack for tag state
    skeleton = []                   # templatized version of XML
    for c in struct:
        i = ord(c)
        if i >= 3:                  # start of element
            i -= 3                  # adjust for struct tag index offset
            tag = taglist[i]        # spell out the tag from taglist
            state.append(tag)       # push current tag
            skeleton.append('<%s>' % tag)
                                    # insert the element start tag
        elif i == 1:                # end of element
            tag = state.pop()       # pop current tag off stack
            skeleton.append('</%s>' % tag)
                                    # insert the element end tag
        elif i == 2:                # insert element content
            tag = state[-1]
            item = contentlist[taglist.index(tag)].pop()
            item = item.replace('&','&amp;')
            skeleton.append(item)   # add bare tag to indicate content
        else:
            raise ValueError, "Unexpected structure tag: ord(%d)" % i

    return ''.join(skeleton)

if __name__ == '__main__':
    import sys
    print struct2xml(sys.stdin.read()),

Compressing The Transforms

The real meat of the discussed tranformation comes when we try to compress the results. If all is as desired, foo.struct will be significantly more compressible than foo.xml, even though the two contain identical information (which is provable since they are symmetrically derivable). In a sense, xml2struct.py is already a sort of compression algorithm (it produces somewhat smaller files), but the real point is not to use it directly but to aid further compression.

Let's look at some sizes, including a few repeated from above. Some results from XMill are thrown in for comparison, these include the name .xmi. (XMill is available in versions using gzip and bzip2 algorithms):

Directory listing of "structured XML"

 228610  hamlet.struct
  57533  hamlet.struct.bz2
  57522  hamlet.xml.bz2
  71060  hamlet.struct.gz
  78581  hamlet.xml.gz
  61823  hamlet.xmi.bz2
  75247  hamlet.xmi.gz

The results on this prose document are somewhat mixed. "Restructuring" the XML document assists gzip quite a bit. But it turns out that plain old bzip2 on the original XML file does 11 bytes better at generating a compressible structure than do my attempts. Of course, I am comforted that XMill behaves similarly--but worse--than my tranformation.

Directory listing 2 of "structured XML"

1764816  weblog.struct
  59955  weblog.struct.bz2
  66994  weblog.xml.bz2
  56156  weblog.txt.bz2
  76183  weblog.struct.gz
 115524  weblog.xml.gz
  59409  weblog.xmi.bz2
  74439  weblog.xmi.gz

As before, restructuring the XML Weblog aids gzip compression extremely significantly. But since gzip is not our preferred technique anymore, this is only moderately interesting. What is of genuine interest is that we have also improved the compression rate of the--already wonderful--'bzip2' by 11%. While maybe not earth-shattering, this is enough to matter when worrying about megabytes. For what it's worth, XMill does a bit better than xml2struct.py in this case. What is further interesting is that our compression of this restructured XML is within 7% of the best obtained compression of the original textual Weblog.

Conclusion

The utility presented here is a preliminary attempt, but even in early form it does surprisingly well--at least in some cases--of squeezing those last bytes out of compressed XML files. With a little refinement and experimentation, I expect that a few percent more reduction could be obtained. Part of what makes writing this utility hard is that bzip2 does such a good job to start with. I was honestly surprised by just how effective the Burrows-Wheeler algorithm was when I started empirical testing.

There are some commercial utilities that attempt to perform XML compression in a manner that utilizes knowledge of the specific DTDs of compressed documents. It is quite likely that these techniques obtain additional compression. However, xml2struct.py and XMill have the nice advantage of being simple command line tools that one can transparently apply to XML files. Custom programming of every compression is not always desirable or possible--but where it is, squeezing out even more bytes might be an obtainable goal.

Resources

Much of the inspiration for this article comes from the work of the XMill XML compressor. Information and a downloadable version can be found at the below link. The license requires a click-through, and the download page unfortunately seems to have a buggy script that does not allows downloads from all sites.

The complete plays of Shakespeare can be found in XML form at the below resource. The document hamlet.xml used for testing purposes was obtained there:

The 1994 paper A Block-sorting Lossless Data Compression Algorithm, by M. Burrows and D.J. Wheeler, introduced the algorithm known now as Burrows-Wheeler. This technique is implemented in the fairly popular bzip2 utility:

Many Unix-like systems include bzip2 as part of their standard distribution. For other platforms--or for newer versions--'bzip2' can be found at:

I wrote what I believe is a good general introduction to data compression. It can be found at:

About The Author

David Mertz believes that less is more David may be reached at [email protected]; his life pored over at http://gnosis.cx/publish/. Suggestions and recommendations on this, past, or future, columns are welcomed.

Xml Matters #13: Xml And Compression

Exploring the entropy of documents

Introduction