Re: Compression, encoding, entropy

From: charlie strauss <cems_at_earthlink_dot_net>
Date: Sun May 02 2004 - 01:19:43 CDT

As someone who actually got their PHd doing information theory I'm really bewildered by the barcod length thread, I've tried to read all the post but maybe I missed something and I hope what I say next is just not utterly newbie.

first you can't use lossy compression--this was mentioned

Second, lossless compression cannot reduce the maximum message size in fact it will only increace it, always. thus it only has use if you need to reduce the median or average message size--this too was mentioned

third, dont confuse compression with optimal encoding. you can encode the vote in such a way--not using any compression--that its maximum size is a minimal. Namely, here is an optimal receipe--dont bother looking for a better one: enumerate every possible ballot and assign a ordinal number too it. Enuemration of course is impossible in practice for any large ballot. Instead simply divide the ballot into independent sections each of which is as large as can feasibly be ordinated on a regular computer. Odrinate each of these for each block on the barcode. use a look-up table to encode and invert each ordination. this will be very nearly optimal space-wise yet feasible.

fourth, dont confuse error-correction encoding with ballot design. designing your error detection scheme based upon errors showing up as "overvotes" is sub optimal both for data storage and as an error correction device. The proper way to do this is to understand the error model of bar code reading and printing. What sorts of errors occur? I dont know the answer to this so to illustrate what I mean I'll give you an example of something I do know the answer too. On disk reads it is very common for errors to occur in lost consecutive bits rather than randomly spaced bits. thus single bit error correction would fail. So one easy solution to this is to hamming error correct each byte for a 1 bit error correction as before. But then interleave the blocks of bits so that for example the first bits of a block of16 bytes are consectuive, (i.e. the most significant bit from the first 16 bytes) then the second most significant bit of the of the block (another 16 bits) and so on. now you can tolerate large
 numbers of consecutive bits being lost without losing any data that the error correction code cant recover. (The example I gave is not optimal but is illustrative).

Fifth, and I've mentioned this before, as part of the error correction scheme, consider writing error correction data from one ballot on other ballots. It's conceivable to image that one could for example put checksums or full ballots from say ten earlier ballots onto the back of a current ballot. (encrypted and out of order for privacy). on average this would make all the paper interlocked. The only problem with this idea that I can see is if the touch screen is not aware of discarded ballots.

-----Original Message-----
From: David Mertz <voting-project@gnosis.cx>
Sent: May 1, 2004 8:15 PM
To: voting-project@lists.sonic.net
Subject: [voting-project] Compression, encoding, entropy

On May 1, 2004, at 5:23 PM, Alan Dechert wrote:
> After discussing this some with David Mertz, it occurred to me I may
> just
> not be speaking your language.

Wikipedia.org has an excellent entry on data compression (as well as
many great entries on many topics). I also would recommend an
introduction to compression that I wrote as an article, then later
included as an appendix to my book. You can read that at:

        http://gnosis.cx/TPiP/appendix_b.txt

A reason I recommend my article is that it's a sort of mystery story.
I first explain a number of common compression techniques as applied to
a sample data set. Then the surprise ending is that you could have
done better than any of the described compression techniques by
encoding the data differently to start with.

In brief, "compression" is an algorithm that when applied to a data
source, -typically- reduces the size of the data representation. How
well the compression does depends in part on the encoding of the data
that was used to start with. Compression algorithms are designed
around corpora of data. The goal of a compression algorithm is to
reduce the size of the most typical data sets encountered; but a side
effect of their manner of design is that they often fail to reduce the
size of atypical data sets (or even increase their size).

For example, Huffman compression uses variable numbers of bits to
encode different values. In a collection of documents, this works
great. For example, English texts contain a lot of "e"s and not very
many "q"s. For most texts, using fewer bits to encode "e"s than "q"s
makes for an efficient system. But if you apply a general English
Huffman table to the string "qqqqqqqqqqq", you're going to get a big
data representation.

Taking this to ballots, an algorithm might encode the "likely"
candidates with fewer bits than the "unlikely" candidates. Al Gore and
George Bush votes might be encoded with fewer bits than those for Harry
Browne. For most ballots the would make for a compact representation.
But the Libertarian voter needs more bits to store her ballot data.
The "typical" ballot might be represented more compactly, but the
"outlier" might have an expanded representation. Obviously, a
discriminatory data representation is a no-no.

"Encoding" is a related, but distinct concept. This describes how you
inherently represent values from a data set, not any post-processing
(compression) algorithm you might apply to that stored representation.
Taking votes again, you might decide (inefficiently) to represent a
vote on a single-selection 7-candidate race as a string of 7 bits,
where no more than one of those bits is set to a one. That succeeds at
storing the vote, but it takes 7 bits to do it. In fact there are only
8 legal selections in this contest: each candidate, plus "No
Preference." The "entropy" in this 7-candidate race is 3 bits. A good
representation would be:

   No Preference: 000
   Candidate 1: 001
   Candidate 2: 010
   Candidate 3: 011
   Candidate 4: 100
   Candidate 5: 101
   Candidate 6: 110
   Candidate 7: 111

But using this optimal encoding isn't the same thing as applying a
later compression step. It comes earlier in the storage process.

That said, even using the above -optimal- encoding, compression of a
collection of ballots is still well possible. Probably Candidate 1
gets a lot move votes than Candidate 7 (at least as most elections
rules set it up). So the bit-string '001' occurs a lot more than
'111'--Huffman compression could kick it. But this is a feature of
ballot corpora, not something applicable in a non-discriminatory
fashion to an arbitrary ballot.

==================================================================
= The content of this message, with the exception of any external
= quotations under fair use, are released to the Public Domain
==================================================================
Received on Mon May 31 23:17:02 2004

This archive was generated by hypermail 2.1.8 : Mon May 31 2004 - 23:18:15 CDT