Compression, encoding, entropy

From: David Mertz <voting-project_at_gnosis_dot_cx>
Date: Sat May 01 2004 - 22:15:57 CDT

On May 1, 2004, at 5:23 PM, Alan Dechert wrote:
> After discussing this some with David Mertz, it occurred to me I may
> just
> not be speaking your language.

Wikipedia.org has an excellent entry on data compression (as well as
many great entries on many topics). I also would recommend an
introduction to compression that I wrote as an article, then later
included as an appendix to my book. You can read that at:

        http://gnosis.cx/TPiP/appendix_b.txt

A reason I recommend my article is that it's a sort of mystery story.
I first explain a number of common compression techniques as applied to
a sample data set. Then the surprise ending is that you could have
done better than any of the described compression techniques by
encoding the data differently to start with.

In brief, "compression" is an algorithm that when applied to a data
source, -typically- reduces the size of the data representation. How
well the compression does depends in part on the encoding of the data
that was used to start with. Compression algorithms are designed
around corpora of data. The goal of a compression algorithm is to
reduce the size of the most typical data sets encountered; but a side
effect of their manner of design is that they often fail to reduce the
size of atypical data sets (or even increase their size).

For example, Huffman compression uses variable numbers of bits to
encode different values. In a collection of documents, this works
great. For example, English texts contain a lot of "e"s and not very
many "q"s. For most texts, using fewer bits to encode "e"s than "q"s
makes for an efficient system. But if you apply a general English
Huffman table to the string "qqqqqqqqqqq", you're going to get a big
data representation.

Taking this to ballots, an algorithm might encode the "likely"
candidates with fewer bits than the "unlikely" candidates. Al Gore and
George Bush votes might be encoded with fewer bits than those for Harry
Browne. For most ballots the would make for a compact representation.
But the Libertarian voter needs more bits to store her ballot data.
The "typical" ballot might be represented more compactly, but the
"outlier" might have an expanded representation. Obviously, a
discriminatory data representation is a no-no.

"Encoding" is a related, but distinct concept. This describes how you
inherently represent values from a data set, not any post-processing
(compression) algorithm you might apply to that stored representation.
Taking votes again, you might decide (inefficiently) to represent a
vote on a single-selection 7-candidate race as a string of 7 bits,
where no more than one of those bits is set to a one. That succeeds at
storing the vote, but it takes 7 bits to do it. In fact there are only
8 legal selections in this contest: each candidate, plus "No
Preference." The "entropy" in this 7-candidate race is 3 bits. A good
representation would be:

   No Preference: 000
   Candidate 1: 001
   Candidate 2: 010
   Candidate 3: 011
   Candidate 4: 100
   Candidate 5: 101
   Candidate 6: 110
   Candidate 7: 111

But using this optimal encoding isn't the same thing as applying a
later compression step. It comes earlier in the storage process.

That said, even using the above -optimal- encoding, compression of a
collection of ballots is still well possible. Probably Candidate 1
gets a lot move votes than Candidate 7 (at least as most elections
rules set it up). So the bit-string '001' occurs a lot more than
'111'--Huffman compression could kick it. But this is a feature of
ballot corpora, not something applicable in a non-discriminatory
fashion to an arbitrary ballot.
==================================================================
= The content of this message, with the exception of any external
= quotations under fair use, are released to the Public Domain
==================================================================
Received on Mon May 31 23:17:02 2004

This archive was generated by hypermail 2.1.8 : Mon May 31 2004 - 23:18:15 CDT