XML MATTERS #13: XML and Compression Exploring the entropy of documents David Mertz, Ph.D. Interior Designer, Gnosis Software, Inc. August 2001 This article looks at a number of approaches to wrapping compression around XML documents. The special structures in XML tend to allow certain improvements over the most naive compression techniques. Code exploring several techniques is provided in the article. INTRODUCTION ------------------------------------------------------------------------ XML as a format has a lot of nice properties--it is a perfectly general way of representing arbitrary data structures, and it is human readable (more of less). But XML has one very notable unpleasant property. XML documents are VERBOSE; not just a -little- on the wordy side, but grotesquely, morbidly obese, almost unbelievably huge. Much of the time, this drawback of XML really makes no difference--DASD is cheap, and the time on the wire might be only a small part of the total time in the process. But at other times, bandwidth and storage space can be important. To quantify things, it is not at all unusual for XML documents that represent table-oriented data to be three times as large as a CSV or database representation, or even than a flat file. A similar increase is typical in representing EDI data (such as for ebXML projects). In many of these context, one starts out with multi-megabyte data sources, and making them multiples larger can be inconvenient, especially for transmission purposes. For example, for purposes of this article, I created an XML representation of an approximately 1 megabyte Apache access logfile. This created an XML document 3.18 times the size of the underlying line-oriented log. The only -information- added in the process was some descriptive names for fields, but that could have also been specified in a single header line of less than a hundred bytes. Moreover, my specific XML representation did not include any namespaces in the tags, which would have increased the size further. When one thinks about compressing documents, one normally thinks first of general compression algorithms like Lempel-Ziv and Huffman, and of the common utilities that implement variations on them. Specifically, on Unix-like platforms, what comes first to mind is usually the utility 'gzip'; on other platforms, 'zip' is more common (using utilities such as 'pkzip', 'info-zip' and 'WinZip'). 'gzip' turns out to be quite consistently better than 'zip', but only by small margins. These utilities indeed tend substantially to reduce the size of XML files. But it also turns out that one can obtain considerably better compression rates by two means, either individually or in combination. The first technique is to use the Burrows-Wheeler compression algorithm rather than Lempel-Ziv sequential algorithms. In particular, the somewhat less common utility 'bzip2' (or its associated libraries and APIs) is an implementation of Burrows-Wheeler for many system platforms (and is accompanied by full source and a BSD-style license). Burrows-Wheeler operates by grouping related strings in a uncompressed source, rather than in the Lempel-Ziv style of building up a dictionary of string occurences. 'bzip2' consistently obtains better compression than 'gzip' across many file types, but the effect is especially dramatic for XML documents. On the down side, 'bzip2' is generally slower than 'gzip'. Then again the slowness of bandwidth will very often swamp speed differences in CPU or memory-bound algorithms. The second technique is to take advantage of the very specific structure of XML documents to produce more -compressible- representations. The XMill utility is one implementation of this technique, and it is available (with C++ source) under a liberal license from AT&T. XMill, however, seems to require certain click-through style limitations on its licensing, and cannot be distributed by other parties directly (at least as I understand it). I have created my own implementation of the same general technique, and the implementation is presented in this article. The code herein is released to the public domain, and the technique as implemented was developed independently and has only a "bird-eye view" similarity to XMill--sometimes XMill does better, and sometimes I do (but XMill is probably always faster than my initial implementation, which only pays attention to compression results). COMPARING BASIC ALGORITHMS ------------------------------------------------------------------------ For purposes of comparison in this article I obtained or created four base documents. The first was Shakespeare's play _Hamlet_ as an XML document (see resources). The markup includes tags such as '', '' and '' which map fairly naturally to the typographic forms one might encounter in a printed copy. In order to make comparisons of just how the XML markup contributes to document size and compressibility, I derived from 'hamlet.xml' a document 'hamlet.txt' that simply removed all the XML tags, but left the content. This derivation is *not* reversible, and is a strict loss of information. A person reading 'hamlet.txt' would not have a great deal of difficulty determining semantically which content is a "speaker" name and which a "line", for example, but there is no easy way a computer could regenerate the source XML document. The next two documents are an Apache Weblog file (a compact set of line-oriented records) and an XML document created from this. Since the source document is the log file, no information is lost in the transformation, and it is trivial to recreate exactly the original format from the XML. Of course, it is not possible to use an XML parser, or DOM, or a validator, or a DTD, with the logfile format. Let's take a look at the sizes of the base documents: #-------- Directory listing of base documents -----------# 288754 hamlet.xml 188830 hamlet.txt 949474 weblog.txt 3021921 weblog.xml In both cases, the XML is much larger. In the _Hamlet_ example, the comparison is not entirely fair since the actual information content of the text version is also less. But for the Weblog file, the XML starts to look fairly bad. However, not everything is quite as it appears. Compression programs do a fairly good job of boiling documents down to their actual information content, and meaningless padding tends towards zero size in the compressed version (asymptotically with compression effort, and if all is happy). Let's try 'gzip', 'zip' and 'bzip2': #----- Directory listing of compressed Shakespeare ------# 78581 hamlet.xml.gz 72505 hamlet.txt.gz 78696 hamlet.xml.zip 72620 hamlet.txt.zip 57522 hamlet.xml.bz2 56743 hamlet.txt.bz2 - #------- Directory listing of compressed Weblog ---------# 91029 weblog.txt.gz 115524 weblog.xml.gz 91144 weblog.txt.zip 115639 weblog.xml.zip 56156 weblog.txt.bz2 66994 weblog.xml.bz2 There are a number of interesting things in the above sizes. For both styles of documents--for every compression technique--the size differences in compressed files is much smaller than between the XML versus text originals. It is also noteworthy that 'gzip' and 'zip' cluster very closely together for corresponding cases, while 'bzip2' does much better all the time. Moreover, when using 'bzip2' on the Shakespeare document, the compressed size difference between the text and XML formats is nearly negligible, despite the 53% larger size of the XML base document. However, the Weblog stands out as as a problem case. While compression narrows the bloat of the XML conversion quite a bit, even the 'bzip2' version still lets the XML markup increase the size by about 20%. Not necessarily a disaster, but it feels like we should ideally be able to compress down to the true information content REVERSIBLE TRANSFORMATIONS ------------------------------------------------------------------------ An XML document has a rather inefficient form when it comes to compression, actually. As we will see 'bzip2' somewhat alleviates this inefficency by regrouping strings. But at heart, XML documents are a jumble of fairly dissimilar parts--tags, attributes, elements bodies of different types. If we could take each relatively homogeneous set of things, and group them close to each other in a transformed file, standard compressors would have more to work with. For example, if every '' tag-body in our Weblog occurs near the others, the block of stuff which contains the IP address of hosts will be easy to compress. The trick here is to come up with a way of transforming an XML document into something that contains *all* the same information, but structures the layout in such a compressor-friendly style. The utilities 'xml2struct.py' and 'struct2xml.py' do exactly what is desired. The versions below have a few limitations, but demonstrate the principles involved. Some limitations are that each document is limited to 253 distinct tags, and that attributes and processing instructions are not handled. Fixing those limits should not change the jist of the results, however. XMill performs a similar transformation, but with some extra options and with fewer limitations. The general format of a "struct" document is as follows: #--------------- Struct document format -----------------# 1. A list of tags that occur in the original XML document, separated by newline characters. 2. A section delimeter: 0x00 (the null byte) 3. A compact representation of the overall document structure, where each start tag is represented by a single byte, and the occurence of content is marked by a 0x02 byte. 4. Another section delimeter: 0x00 (the null byte) 5. The contents of all the elements indicated in the document structure schematic, grouped by element type. Each individual content item is delimited by a 0x02 byte, while the start of elements of a new type is delimited by a 0x01 byte (the last not strictly needed, but it makes reversing the transformation easier). Below is complete Python code to perform and reverse the described transformation. It would be simple to implement this transformation in another programming language also: #------------------- xml2struct.py ----------------------# import sys import xml.sax from xml.sax.handler import * class StructExtractor(ContentHandler): """Create a special structure/content form of an XML document""" def startDocument(self): self.taglist = [] self.contentdct = {} self.state = [] # stack for tag state self.newstate = 0 # flag for continuing chars in same elem self.struct = [] # compact document structure def endDocument(self): sys.stdout.write('\n'.join(self.taglist)) # Write out the taglist first sys.stdout.write(chr(0)) # section delimiter \0x00 sys.stdout.write(''.join(self.struct)) # Write out the structure list sys.stdout.write(chr(0)) # section delimiter \0x00 for tag in self.taglist: # Write all content lists sys.stdout.write(chr(2).join(self.contentdct[tag])) sys.stdout.write(chr(1)) # delimiter between content types def startElement(self, name, attrs): if not name in self.taglist: self.taglist.append(name) self.contentdct[name] = [] if len(self.taglist) > 253: raise ValueError, "More than 253 tags encountered" self.state.append(name) # push current tag self.newstate = 1 # chars go to new item # single char to indicate tag self.struct.append(chr(self.taglist.index(name)+3)) def endElement(self, name): self.state.pop() # pop current tag off stack self.newstate = 1 # chars go to new item self.struct.append(chr(1)) # \0x01 is endtag in struct def characters(self, ch): currstate = self.state[-1] if self.newstate: # either add new chars to state item self.contentdct[currstate].append(ch) self.newstate = 0 self.struct.append(chr(2)) # \0x02 content placeholder in struct else: # or append the chars to current item self.contentdct[currstate][-1] += ch if __name__ == '__main__': parser = xml.sax.make_parser() handler = StructExtractor() parser.setContentHandler(handler) parser.parse(sys.stdin) Using SAX rather than DOM, makes this tranformation fairly time efficient, even though time was not a large consideration in developing it. To reverse the transformation: #------------------- struct2xml.py ----------------------# def struct2xml(s): tags, struct, content = s.split(chr(0)) taglist = tags.split('\n') # all the tags contentlist = [] # list-of-lists of content items for block in content.split(chr(1)): contents = block.split(chr(2)) contents.reverse() # pop off content items from end contentlist.append(contents) state = [] # stack for tag state skeleton = [] # templatized version of XML for c in struct: i = ord(c) if i >= 3: # start of element i -= 3 # adjust for struct tag index offset tag = taglist[i] # spell out the tag from taglist state.append(tag) # push current tag skeleton.append('<%s>' % tag) # insert the element start tag elif i == 1: # end of element tag = state.pop() # pop current tag off stack skeleton.append('' % tag) # insert the element end tag elif i == 2: # insert element content tag = state[-1] item = contentlist[taglist.index(tag)].pop() item = item.replace('&','&') skeleton.append(item) # add bare tag to indicate content else: raise ValueError, "Unexpected structure tag: ord(%d)" % i return ''.join(skeleton) if __name__ == '__main__': import sys print struct2xml(sys.stdin.read()), COMPRESSING THE TRANSFORMS ------------------------------------------------------------------------ The real meat of the discussed tranformation comes when we try to compress the results. If all is as desired, 'foo.struct' will be significantly more compressible than 'foo.xml', even though the two contain identical information (which is provable since they are symmetrically derivable). In a sense, 'xml2struct.py' is already a sort of compression algorithm (it produces somewhat smaller files), but the real point is not to use it directly but to aid further compression. Let's look at some sizes, including a few repeated from above. Some results from XMill are thrown in for comparison, these include the name '*.xmi.*' (XMill is available in versions using 'gzip' and 'bzip2' algorithms): #------- Directory listing of "structured XML" ----------# 228610 hamlet.struct 57533 hamlet.struct.bz2 57522 hamlet.xml.bz2 71060 hamlet.struct.gz 78581 hamlet.xml.gz 61823 hamlet.xmi.bz2 75247 hamlet.xmi.gz The results on this prose document are somewhat mixed. "Restructuring" the XML document assists 'gzip' quite a bit. But it turns out that plain old 'bzip2' on the original XML file does 11 bytes better at generating a compressible structure than do my attempts. Of course, I am comforted that XMill behaves similarly--but worse--than my tranformation. We do a quite a bit better with Weblog files. Here it actually pays off. #------ Directory listing 2 of "structured XML" ---------# 1764816 weblog.struct 59955 weblog.struct.bz2 66994 weblog.xml.bz2 56156 weblog.txt.bz2 76183 weblog.struct.gz 115524 weblog.xml.gz 59409 weblog.xmi.bz2 74439 weblog.xmi.gz As before, restructuring the XML Weblog aids 'gzip' compression extremely significantly. But since 'gzip' is not our preferred technique anymore, this is only moderately interesting. What is of genuine interest is that we have also improved the compression rate of the--already wonderful--'bzip2' by 11%. While maybe not earth-shattering, this is enough to matter when worrying about megabytes. For what it's worth, XMill does a bit better than 'xml2struct.py' in this case. What is further interesting is that our compression of this restructured XML is within 7% of the best obtained compression of the original textual Weblog. CONCLUSION ------------------------------------------------------------------------ The utility presented here is a preliminary attempt, but even in early form it does surprisingly well--at least in some cases--of squeezing those last bytes out of compressed XML files. With a little refinement and experimentation, I expect that a few percent more reduction could be obtained. Part of what makes writing this utility hard is that 'bzip2' does -such- a good job to start with. I was honestly surprised by just how effective the Burrows-Wheeler algorithm was when I started empirical testing. There are some commercial utilities that attempt to perform XML compression in a manner that utilizes knowledge of the specific DTDs of compressed documents. It is quite likely that these techniques obtain additional compression. However, 'xml2struct.py' and XMill have the nice advantage of being simple command line tools that one can transparently apply to XML files. Custom programming of every compression is not always desirable or possible--but where it is, squeezing out even more bytes might be an obtainable goal. RESOURCES ------------------------------------------------------------------------ Much of the inspiration for this article comes from the work of the XMill XML compressor. Information and a downloadable version can be found at the below link. The license requires a click-through, and the download page unfortunately seems to have a buggy script that does not allows downloads from all sites. http://www.research.att.com/sw/tools/xmill/ The complete plays of Shakespeare can be found in XML form at the below resource. The document 'hamlet.xml' used for testing purposes was obtained there: http://www.ibiblio.org/xml/examples/shakespeare/ The 1994 paper _A Block-sorting Lossless Data Compression Algorithm_, by M. Burrows and D.J. Wheeler, introduced the algorithm known now as Burrows-Wheeler. This technique is implemented in the fairly popular 'bzip2' utility: http://gatekeeper.dec.com/pub/DEC/SRC/research-reports/abstracts/src-rr-124.html Many Unix-like systems include 'bzip2' as part of their standard distribution. For other platforms--or for newer versions--'bzip2' can be found at: http://sources.redhat.com/bzip2/ I wrote what I believe is a good general introduction to data compression. It can be found at: http://www.gnosis.cx/publish/programming/compression_primer.html ABOUT THE AUTHOR ------------------------------------------------------------------------ {Picture of Author: http://gnosis.cx/cgi-bin/img_dqm.cgi} David Mertz believes that less is more David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/publish/. Suggestions and recommendations on this, past, or future, columns are welcomed.