APPENDIX -- A DATA COMPRESSION PRIMER ------------------------------------------------------------------- SECTION -- Introduction ------------------------------------------------------------------- See Section 2.2.5 for details on compression capabilities included in the Python standard library. This appendix is intended to provide readers who are unfamiliar with data compression a basic background on its techniques and theory. The final section of this appendix provides a practical example--accompanied by some demonstration code--of a Huffman-inspired custom encoding. Data compression is widely used in a variety of programming contexts. All popular operating systems and programming languages have numerous tools and libraries for dealing with data compression of various sorts. The right choice of compression tools and libraries for a particular application depends on the characteristics of the data and application in question: streaming versus file; expected patterns and regularities in the data; relative importance of CPU usage, memory usage, channel demands, and storage requirements; and other factors. Just what is data compression, anyway? The short answer is that data compression removes -redundancy- from data; in information-theoretic terms, compression increases the -entropy- of the compressed text. But those statements are essentially just true by definition. Redundancy can come in a lot of different forms. Repeated bit sequences ('11111111') are one type. Repeated byte sequences are another ('XXXXXXXX'). But more often redundancies tend to come on a larger scale, either regularities of the data set taken as a whole, or sequences of varying lengths that are relatively common. Basically, what data compression aims at is finding algorithmic transformations of data representations that will produce more compact representations given "typical" data sets. If this description seems a bit complex to unpack, read on to find some more practical illustrations. SECTION -- Lossless and Lossy Compression ------------------------------------------------------------------- There are actually two fundamentally different "styles" of data compression: lossless and lossy. This appendix is generally about lossless compression techniques, but the reader would be served to understand the distinction first. Lossless compression involves a transformation of the representation of a data set such that it is possible to reproduce -exactly- the original data set by performing a decompression transformation. Lossy compression is a representation that allows you to reproduce something "pretty much like" the original data set. As a plus for the lossy techniques, they can frequently produce far more compact data representations than lossless compression techniques can. Most often lossy compression techniques are used for images, sound files, and video. Lossy compression may be appropriate in these areas insofar as human observers do not perceive the literal bit-pattern of a digital image/sound, but rather more general "gestalt" features of the underlying image/sound. From the point of view of "normal" data, lossy compression is not an option. We do not want a program that does "about the same" thing as the one we wrote. We do not want a database that contains "about the same" kind of information as what we put into it. At least not for most purposes (and the writer knows of few practical uses of lossy compression outside of what are already approximate mimetic representations of the real world, likes images and sounds). SECTION -- A Data Set Example ------------------------------------------------------------------- For purposes of this appendix, let us start with a specific hypothetical data representation. Here is an easy-to-understand example. In the town of Greenfield, MA, the telephone prefixes are '772-', '773-', and '774-'. (For non-USA readers: In the USA, local telephone numbers are seven digits and are conventionally represented in the form ###-####; prefixes are assigned in geographic blocks.) Suppose also that the first prefix is the mostly widely assigned of the three. The suffix portions might be any other digits, in fairly equal distribution. The data set we are interested in is "the list of all the telephone numbers currently in active use." One can imagine various reasons why this might be interesting for programmatic purposes, but we need not specify that herein. Initially, the data set we are interested in comes in a particular data representation: a multicolumn report (perhaps generated as output of some query or compilation process). The first few lines of this report might look like: #*----------- Telephone Number Report ----------------------# ============================================================= 772-7628 772-8601 772-0113 773-3429 774-9833 773-4319 774-3920 772-0893 772-9934 773-8923 773-1134 772-4930 772-9390 774-9992 772-2314 [...] SECTION -- Whitespace Compression ------------------------------------------------------------------- Whitespace compression can be characterized most generally as "removing what we are not interested in." Even though this technique is technically a lossy-compression technique, it is still useful for many types of data representations we find in the real world. For example, even though HTML is far more readable in a text editor if indentation and vertical spacing is added, none of this "whitespace" should make any difference to how the HTML document is rendered by a Web browser. If you happen to know that an HTML document is destined only for a Web browser (or for a robot/spider) then it might be a good idea to take out all the whitespace to make it transmit faster and occupy less space in storage. What we remove in whitespace compression never really had any functional purpose to start with. In the case of our example in this article, it is possible to remove quite a bit from the described report. The row of "=" across the top adds nothing functional, nor do the "-" within numbers, nor the spaces between them. These are all useful for a person reading the original report, but do not matter once we think of it as data. What we remove is not precisely whitespace in traditional terms, but the intent is the same. Whitespace compression is extremely "cheap" to perform. It is just a matter of reading a stream of data and excluding a few specific values from the output stream. In many cases, no "decompression" step is involved at all. But even where we would wish to re-create something close to the original somewhere down the data stream, it should require little in terms of CPU or memory. What we reproduce may or may not be exactly what we started with, depending on just what rules and constraints were involved in the original. An HTML page typed by a human in a text editor will probably have spacing that is idiosyncratic. Then again, automated tools often produce "reasonable" indentation and spacing of HTML. In the case of the rigid report format in our example, there is no reason that the original representation could not be precisely produced by a "decompressing formatter" down the data stream. SECTION -- Run-Length Encoding ------------------------------------------------------------------- Run-length encoding (RLE) is the simplest widely used lossless-compression technique. Like whitespace compression, it is "cheap"--especially to decode. The idea behind it is that many data representations consist largely of strings of repeated bytes. Our example report is one such data representation. It begins with a string of repeated "=", and has strings of spaces scattered through it. Rather than represent each character with its own byte, RLE will (sometimes or always) have an iteration count followed by the character to be repeated. If repeated bytes are predominant within the expected data representation, it might be adequate and efficient to always have the algorithm specify one or more bytes of iteration count, followed by one character. However, if one-length character strings occur, these strings will require two (or more) bytes to encode them, that is, '00000001 01011000' might be the output bit stream required for just one ASCII "X" of the input stream. Then again, a hundred "X" in a row would be output as '01100100 01011000', which is quite good. What is frequently done in RLE variants is to selectively use bytes to indicate iterator counts and otherwise just have bytes represent themselves. At least one byte-value has to be reserved to do this, but that can be escaped in the output, if needed. For example, in our example telephone-number report, we know that everything in the input stream is plain ASCII characters. Specifically, they all have bit one of their ASCII value as 0. We could use this first ASCII bit to indicate that an iterator count was being represented rather than representing a regular character. The next seven bits of the iterator byte could be used for the iterator count, and the next byte could represent the character to be repeated. So, for example, we could represent the string "YXXXXXXXX" as: #*----- Run length encoding example -----# "Y" Iter(8) "X" 01001111 10001000 01011000 This example does not show how to escape iterator byte-values, nor does it allow iteration of more than 127 occurrences of a character. Variations on RLE deal with issues such as these, if needed. SECTION -- Huffman Encoding ------------------------------------------------------------------- Huffman encoding looks at the symbol table of a whole data set. The compression is achieved by finding the "weights" of each symbol in the data set. Some symbols occur more frequently than others, so Huffman encoding suggests that the frequent symbols need not be encoded using as many bits as the less-frequent symbols. There are variations on Huffman-style encoding, but the original (and frequent) variation involves looking for the most common symbol and encoding it using just one bit, say 1. If you encounter a 0, you know you're on the way to encoding a longer variable length symbol. Let's imagine we apply Huffman encoding to our local phone-book example (assume we have already whitespace-compressed the report). We might get: #*----- Huffman encoding example -----# Encoding Symbol 1 7 010 2 011 3 00000 4 00001 5 00010 6 00011 8 00100 9 00101 0 00111 1 Our initial symbol set of digits could already be straightforwardly encoded (with no-compression) as 4-bit sequences (nibbles). The Huffman encoding given will use up to 5-bits for the worst-case symbols, which is obviously worse than the nibble encoding. However, our best case will use only -1- bit, and we know that our best case is also the most frequent case, by having scanned the data set. So we might encode a particular phone number like: #*----- Huffman translation example -----# 772 7628 --> 1 1 010 1 00010 010 00011 The nibble encoding would take 28-bits to represent a phone number; in this particular case, our encoding takes 19-bits. I introduced spaces into the example above for clarity; you can see that they are not necessary to unpack the encoding, since the encoding table will determine whether we have reached the end of an encoded symbol (but you have to keep track of your place in the bits). Huffman encoding is still fairly cheap to decode, cycle-wise. But it requires a table lookup, so it cannot be quite as cheap as RLE, however. The encoding side of Huffman is fairly expensive, though; the whole data set has to be scanned and a frequency table built up. In some cases a "shortcut" is appropriate with Huffman coding. Standard Huffman coding applies to a particular data set being encoded, with the set-specific symbol table prepended to the output data stream. However, if the whole type of data encoded--not just the single data set--has the same regularities, we can opt for a global Huffman table. If we have such a global Huffman table, we can hard-code the lookups into our executables, which makes both compression and decompression quite a bit cheaper (except for the initial global sampling and hard-coding). For example, if we know our data set would be English-language prose, letter-frequency tables are well known and quite consistent across data sets. SECTION -- Lempel-Ziv Compression ------------------------------------------------------------------- Probably the most significant lossless-compression technique is Lempel-Ziv. What is explained here is "LZ78," but LZ77 and other variants work in a similar fashion. The idea in LZ78 is to encode a streaming byte sequence using a dynamic table. At the start of compressing a bit stream, the LZ table is filled with the actual symbol set, along with some blank slots. Various size tables are used, but for our (whitespace-compressed) telephone number example above, let's suppose that we use a 32-entry table (this should be OK for our example, although much too small for most other types of data). First thing, we fill the first ten slots with our alphabet (digits). As new bytes come in, we first output an existing entry that grabs the longest sequence possible, then fill the next available slot with the N+1 length sequence. In the worst case, we are using 5-bits instead of 4-bits for a single symbol, but we'll wind up getting to use 5-bits for multiple symbols in a lot of cases. For example, the machine might do this (a table slot is noted with square brackets): #*----------- LZ77 Algorithm --------------# 7 --> Lookup: 7 found --> nothing to add --> keep looking 7 --> Lookup: 77 not found --> add '77' to [11] --> output [7]=00111 2 --> Lookup: 72 not found --> add '72' to [12] --> output [7]=00111 7 --> Lookup: 27 not found --> add '27' to [13] --> output [2]=00010 6 --> Lookup: 76 not found --> add '76' to [14] --> output [7]=00111 2 --> Lookup: 62 not found --> add '62' to [15] --> output [6]=00110 8 --> Lookup: 28 not found --> add '28' to [16] --> output [2]=00010 So far, we've got nothing out of it, but let's continue with the next phone number: #*----------- LZ77 Algorithm (cont) -------# 7 --> Lookup: 87 not found --> add '87' to [17] --> output [8]=00100 7 --> Lookup: 77 found --> nothing to add --> keep looking 2 --> Lookup: 772 not found --> add '772' to [18] --> output [11]=01011 8 --> Lookup: 28 found --> nothing to add --> keep looking 6 --> Lookup: 286 not found --> add '286' to [19] --> output [16]=10000 ... The steps should suffice to see the pattern. We have not achieved any net compression yet, but notice that we've already managed to use slot 11 and slot 16, thereby getting two symbols with one output in each case. We've also accumulated the very useful byte sequence '772' in slot 18, which would prove useful later in the stream. What LZ78 does is fill up one symbol table with (hopefully) helpful entries, then write it, clear it, and start a new one. In this regard, 32 entries is still probably too small a symbol table, since that will get cleared before a lot of reuse of '772' and the like is achieved. But the small symbol table is easy to illustrate. In typical data sets, Lempel-Ziv variants achieve much better compression rates than Huffman or RLE. On the other hand, Lempel-Ziv variants are very pricey cycle-wise and can use large tables in memory. Most real-life compression tools and libraries use a combination of Lempel-Ziv and Huffman techniques. SECTION -- Solving the Right Problem ------------------------------------------------------------------- Just as choosing the right algorithm can often create orders-of-magnitude improvements over even heavily optimized wrong algorithms, choosing the right data representation is often even more important than compression methods (which are always a sort of post hoc optimization of desired features). The simple data set example used in this appendix is a perfect case where reconceptualizing the problem would actually be a much better approach than using -any- of the compression techniques illustrated. Think again about what our data represents. It is not a very general collection of data, and the rigid a priori constraints allow us to reformulate our whole problem. What we have is a maximum of 30,000 telephone numbers (7720000 through 7749999), some of which are active, and others of which are not. We do not have a "duty," as it were, to produce a full representation of each telephone number that is active, but simply to indicate the binary fact that it -is- active. Thinking of the problem this way, we can simply allocate 30,000 bits of memory and storage, and have each bit say "yes" or "no" to the presence of one telephone number. The ordering of the bits in the bit-array can be simple ascending order from the lowest to the highest telephone number in the range. This bit-array solution is the best in almost every respect. It allocates exactly 3750 bytes to represent the data set; the various compression techniques will use a varying amount of storage depending both on the number of telephone numbers in the set and the efficiency of the compression. But if 10,000 of the 30,000 possible telephone numbers are active, and even a very efficient compression technique requires several bytes per telephone number, then the bit-array is an order-of-magnitude better. In terms of CPU demands, the bit-array is not only better than any of the discussed compression methods, it is also quite likely to be better than the naive noncompression method of listing all the numbers as strings. Stepping through a bit-array and incrementing a "current-telephone-number" counter can be done quite efficiently and mostly within the on-chip cache of a modern CPU. The lesson to be learned from this very simple example is certainly not that every problem has some magic shortcut (like this one does). A lot of problems genuinely require significant memory, bandwidth, storage, and CPU resources, and in many of those cases compression techniques can help ease--or shift--those burdens. But a more moderate lesson could be suggested: Before compression techniques are employed, it is a good idea to make sure that one's starting conceptualization of the data representation is a good one. SECTION -- A Custom Text Compressor ------------------------------------------------------------------- Most styles of compression require a decompression pass before one is able to do something useful with a source document. Many (de)compressors can operate as a stream, producing only the needed bytes of a compressed or decompressed stream in sequence. In some cases, formats even insert recovery or bookkeeping bytes that allow streams to begin within documents (rather than from the very beginning). Programmatic wrappers can make compressed documents or strings look like plaintext ones at the appropriate API layer. Nonetheless, even streaming decompressors require a computational overhead to get at the plaintext content of a compressed document. An excellent example of a streaming (de)compressor with an API wrapper is `gzip.GzipFile()`. Although not entirely transparent, you can compress and decompress documents without any explicit call to a (de)compression function using this wrapper. `gzip.GzipFile()` provides a file-like interface, but it is also easy to operate on a purely in-memory file using the support of `cStringIO.StringIO()`. For example: >>> from gzip import GzipFile >>> from cStringIO import StringIO >>> sio = StringIO() >>> writer = GzipFile(None, 'wb', 9, sio) >>> writer.write('Mary had a little lamb\n') >>> writer.write('its fleece as white as snow\n') >>> writer.close() >>> sio.getvalue()[:20] '\x1f\x8b\x08\x00k\xc1\x9c<\x02\xff' >>> reader = GzipFile(None, 'rb', 9, StringIO(sio.getvalue())) >>> reader.read()[:20] 'Mary had a little la' >>> reader.seek(30) >>> reader.read() 'ece as white as snow\n' One thing this example shows is that the underlying compressed string is more or less gibberish. Although the file-like API hides the details from an application programmer, the decompression process is also stateful in its dependence on a symbol table built from the byte sequence in the compressed text. You cannot expect to make sense of a few bytes in the middle of the compressed text without a knowledge of the prior context. A different approach to compression can have significant advantages in operating on natural-language textual sources. A group of researchers in Brazil and Chile have examined techniques for "word-based Huffman compression." The general strategy of these researchers is to treat whole words as the symbol set for a Huffman table, rather than merely naive byte values. In natural languages, a limited number of (various length, multibyte) words occur with a high frequency, and savings result if such words are represented with shorter byte sequences. In general, such reduced representation is common to all compression techniques, but word-based Huffman takes the additional step of retaining byte boundaries (and uses fixed symbol mapping, as with other Huffman variants). A special quality of word-based Huffman compressed text is that it need not undergo decompression to be searched. This quality makes it convenient to store textual documents in compressed form, without incurring the requirement to decompress them before they are useful. Instead, if one is searching for words directly contained in the symbol table, one can merely precompress the search terms, then use standard searching algorithms. Such a search can be either against an in-memory string or against a file-like source; in general a search against a precompressed target will be -faster- than one against an uncompressed text. In code, one would use snippets similar to: #*------------- Word-based Huffman compression ----------# small_text = word_Huffman_compress(big_text) search_term = "Foobar" coded_term = word_Huffman_compress(search_term) offset = small_text.find(coded_term) coded_context = small_text[offset-10:offset+10+len(search_term)] plain_context = word_Huffman_expand(coded_context) A sophisticated implementation of word-based Huffman compression can obtain better compression sizes than does `zlib`. For simplicity, the module below sacrifices optimal compression to the goal of clarity and brevity of code. A fleshed-out implementation could add a number of features. The presented module [word_huffman] uses a fixed number of bytes to encode each word in the symbol table. This number of bytes can be selected to be 1, 2, or 3 (thereby limiting the table to a generous 2 million entries). The module also separates the generation of a symbol table from the actual compression/decompression. The module can be used in a context where various documents get encoded using the same symbol table--the table presumably generated based on a set of canonical documents. In this situation, the computational requirement of symbol table generation can happen just once, and the symbol table itself need not be transmitted along with each compressed document. Of course, nothing prevents you from treating the document being processed currently as said canonical statistical word source (thereby somewhat improving compression). In the algorithm utilized by [word_huffman], only high-bit bytes are utilized in the symbol table. The lower 128 ASCII characters represent themselves as literals. Any ASCII character sequence that is not in the symbol table is represented as itself--including any short words that would not benefit from encoding. Any high-bit characters that occur in the original source text are escaped by being preceded by an 0xFF byte. As a result, high-bit characters are encoded using two bytes; this technique is clearly only useful for encoding (mostly) textual files, not binary files. Moreover, only character values 0x80-0xFE are used by the symbol table (0xFF -always- signals a literal high-bit character in the encoding). The [word_huffman] algorithm is not entirely stateless in the sense that not every subsequence in a compressed text can be expanded without additional context. But very little context is required. Any low-bit character always literally represents itself. A high-bit character, however, might be either an escaped literal, a first byte of a symbol table entry, or a non-first byte of a symbol table entry. In the worst case, where a 3-byte symbol table is used, it is necessary to look back two bytes from an arbitrary position in the text to determine the full context. Normally, only one byte lookback is necessary. In any case, words in the symbol table are separated from each other in the uncompressed text by nonalpha low-bit characters (usually whitespace), so parsing compressed entries is straightforward. #---------- word_huffman.py ----------# wordchars = '-_ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' def normalize_text(txt): "Convert non-word characters to spaces" trans = [' '] * 256 for c in wordchars: trans[ord(c)] = c return txt.translate(''.join(trans)) def build_histogram(txt, hist={}): "Incrementally build a histogram table from text source(s)" for word in txt.split(): hist[word] = hist.get(word, 0)+1 return hist def optimal_Nbyte(hist, entrylen=2): "Build optimal word list for nominal symbol table byte-length" slots = 127**entrylen words = [] for word, count in hist.items(): gain = count * (len(word)-entrylen) if gain > 0: words.append((gain, word)) words.sort() words.reverse() return [w[1] for w in words[:slots]] def tables_from_words(words): "Create symbol tables for compression and expansion" # Determine ACTUAL best symbol table byte length if len(words) < 128: entrylen = 1 elif len(words) <= 16129: entrylen = 2 else: entrylen = 3 # assume < ~2M distinct words comp_table = {} # Escape hibit characters for hibit_char in map(chr, range(128,256)): comp_table[hibit_char] = chr(255)+hibit_char # Literal low-bit characters for lowbit_char in map(chr, range(128)): comp_table[lowbit_char] = lowbit_char # Add word entries for word, index in zip(words, range(len(words))): comp_table[word] = symbol(index, entrylen) # Reverse dictionary for expansion table exp_table = {} for key, val in comp_table.items(): exp_table[val] = key return (comp_table, exp_table, entrylen) def symbol(index, entrylen): "Determine actual symbol from word sequence and symbol length" if entrylen == 1: return chr(128+index) if entrylen == 2: byte1, byte2 = divmod(index, 128) return chr(128+byte1)+chr(128+byte2) if entrylen == 3: byte1, rem = divmod(index, 16129) byte2, byte3 = divmod(rem, 128) return chr(128+byte1)+chr(128+byte2)+chr(128+byte3) raise ValueError, "symbol byte len must be 1 <= S <=3: "+`entrylen` def word_Huffman_compress(text, comp_table): "Compress text based on word-to-symbol table" comp_text = [] maybe_entry = [] for c in text+chr(0): # force flush of final word if c in wordchars: maybe_entry.append(c) else: word = ''.join(maybe_entry) comp_text.append(comp_table.get(word, word)) maybe_entry = [] comp_text.append(comp_table[c]) return ''.join(comp_text[:-1]) def word_Huffman_expand(text, exp_table, entrylen): "Expand text based on symbol-to-word table" exp_text = [] offset = 0 end = len(text) while offset < end: c = text[offset] if ord(c) == 255: # escaped highbit character exp_text.append(text[offset+1]) offset += 2 elif ord(c) >= 128: # symbol table entry symbol = text[offset:offset+entrylen] exp_text.append(exp_table[symbol]) offset += entrylen else: exp_text.append(c) offset += 1 return ''.join(exp_text) def Huffman_find(pat, comp_text, comp_table): "Find a (plaintext) substring in compressed text" comp_pat = word_Huffman_compress(pat, comp_table) return comp_text.find(comp_pat) if __name__=='__main__': import sys, glob big_text = [] for fpat in sys.argv[1:]: for fname in glob.glob(fpat): big_text.append(open(fname).read()) big_text = ''.join(big_text) hist = build_histogram(normalize_text(big_text)) for entrylen in (1, 2, 3): comp_words = optimal_Nbyte(hist, entrylen) comp_table, exp_table, entrylen_ = tables_from_words(comp_words) comp_text = word_Huffman_compress(big_text, comp_table) exp_text = word_Huffman_expand(comp_text, exp_table, entrylen_) print "Nominal/actual symbol length (entries): %i/%i (%i)" % \ (entrylen, entrylen_, len(comp_words)) print "Compression ratio: %i%%" % \ ((100*len(comp_text))/len(big_text)) if big_text == exp_text: print "*** Compression/expansion cycle successful!\n" else: print "*** Failure in compression/expansion cycle!\n" # Just for fun, here's a search against compressed text pos = Huffman_find('Foobar', comp_text, comp_table) The [word_huffman] module, while simple and fairly short, is still likely to be useful--and it lays the basis for a fleshed-out variant. The compression obtained by the algorithm above is a comparatively modest 50-60 percent of the size of the original text (in informal tests). But given that locality of decompression of subsegments is both possible and cheap, there is nearly no disadvantage to this transformation for stored documents. Word searches become quicker basically in direct proportion to the length reduction. One likely improvement would be to add run-length compression of whitespace (or generally of nonalpha characters); doing so would lose none of the direct searchability that this algorithm is designed around, and in typical electronic natural-language texts would result in significant additional compression. Moreover, a pleasant side effect of the [word_huffman] transformation is that transformed documents become -more- compressible under Lempel-Ziv-based techniques (i.e., cumulatively). In other words, there is benefit in precompressing documents with [word-huffman] if you intend to later compress them with 'gzip', 'zip', or similar tools. More aggressive improvements might be obtained by allowing variable byte-length symbol table entries and/or by claiming some additional low-bit control codes for the symbol table (and escaping literals in the original text). You can experiment with such variations, and your results might vary somewhat depending upon the details of application-specific canonical texts. Search capabilities might also be generalized--but this would require considerably greater effort. In the referenced research article below, the authors show how to generalize to direct regular-expression searching against word-based Huffman encoded texts. The [word_huffman] implementation allows certain straightforward transformations of regular expressions (where literal words occur within them) for searching against compressed documents, but a number of caveats and restrictions apply. Overcoming most such limitations would involve digging into Python's underlying regular expression engine, but it is possible in principle. SECTION -- References ------------------------------------------------------------------- A good place to turn for additional theoretical and practical information on compression is at the '' FAQ: . A research article on word-based Huffman encoding inspired my simple example of word-based compression. The article "Fast and Flexible Word Searching on Compressed Text," by Edleno Silva de Moura, Gonzalo Navarro, Nivio Ziviani, and Ricardo Baeza-Yates, can be found at: .