From: charlie strauss <cems_at_earthlink_dot_net>

Date: Sun May 02 2004 - 01:19:43 CDT

Date: Sun May 02 2004 - 01:19:43 CDT

As someone who actually got their PHd doing information theory I'm really bewildered by the barcod length thread, I've tried to read all the post but maybe I missed something and I hope what I say next is just not utterly newbie.

first you can't use lossy compression--this was mentioned

Second, lossless compression cannot reduce the maximum message size in fact it will only increace it, always. thus it only has use if you need to reduce the median or average message size--this too was mentioned

third, dont confuse compression with optimal encoding. you can encode the vote in such a way--not using any compression--that its maximum size is a minimal. Namely, here is an optimal receipe--dont bother looking for a better one: enumerate every possible ballot and assign a ordinal number too it. Enuemration of course is impossible in practice for any large ballot. Instead simply divide the ballot into independent sections each of which is as large as can feasibly be ordinated on a regular computer. Odrinate each of these for each block on the barcode. use a look-up table to encode and invert each ordination. this will be very nearly optimal space-wise yet feasible.

fourth, dont confuse error-correction encoding with ballot design. designing your error detection scheme based upon errors showing up as "overvotes" is sub optimal both for data storage and as an error correction device. The proper way to do this is to understand the error model of bar code reading and printing. What sorts of errors occur? I dont know the answer to this so to illustrate what I mean I'll give you an example of something I do know the answer too. On disk reads it is very common for errors to occur in lost consecutive bits rather than randomly spaced bits. thus single bit error correction would fail. So one easy solution to this is to hamming error correct each byte for a 1 bit error correction as before. But then interleave the blocks of bits so that for example the first bits of a block of16 bytes are consectuive, (i.e. the most significant bit from the first 16 bytes) then the second most significant bit of the of the block (another 16 bits) and so on. now you can tolerate large

numbers of consecutive bits being lost without losing any data that the error correction code cant recover. (The example I gave is not optimal but is illustrative).

Fifth, and I've mentioned this before, as part of the error correction scheme, consider writing error correction data from one ballot on other ballots. It's conceivable to image that one could for example put checksums or full ballots from say ten earlier ballots onto the back of a current ballot. (encrypted and out of order for privacy). on average this would make all the paper interlocked. The only problem with this idea that I can see is if the touch screen is not aware of discarded ballots.

-----Original Message-----

From: David Mertz <voting-project@gnosis.cx>

Sent: May 1, 2004 8:15 PM

To: voting-project@lists.sonic.net

Subject: [voting-project] Compression, encoding, entropy

On May 1, 2004, at 5:23 PM, Alan Dechert wrote:

*> After discussing this some with David Mertz, it occurred to me I may
*

*> just
*

*> not be speaking your language.
*

Wikipedia.org has an excellent entry on data compression (as well as

many great entries on many topics). I also would recommend an

introduction to compression that I wrote as an article, then later

included as an appendix to my book. You can read that at:

http://gnosis.cx/TPiP/appendix_b.txt

A reason I recommend my article is that it's a sort of mystery story.

I first explain a number of common compression techniques as applied to

a sample data set. Then the surprise ending is that you could have

done better than any of the described compression techniques by

encoding the data differently to start with.

In brief, "compression" is an algorithm that when applied to a data

source, -typically- reduces the size of the data representation. How

well the compression does depends in part on the encoding of the data

that was used to start with. Compression algorithms are designed

around corpora of data. The goal of a compression algorithm is to

reduce the size of the most typical data sets encountered; but a side

effect of their manner of design is that they often fail to reduce the

size of atypical data sets (or even increase their size).

For example, Huffman compression uses variable numbers of bits to

encode different values. In a collection of documents, this works

great. For example, English texts contain a lot of "e"s and not very

many "q"s. For most texts, using fewer bits to encode "e"s than "q"s

makes for an efficient system. But if you apply a general English

Huffman table to the string "qqqqqqqqqqq", you're going to get a big

data representation.

Taking this to ballots, an algorithm might encode the "likely"

candidates with fewer bits than the "unlikely" candidates. Al Gore and

George Bush votes might be encoded with fewer bits than those for Harry

Browne. For most ballots the would make for a compact representation.

But the Libertarian voter needs more bits to store her ballot data.

The "typical" ballot might be represented more compactly, but the

"outlier" might have an expanded representation. Obviously, a

discriminatory data representation is a no-no.

"Encoding" is a related, but distinct concept. This describes how you

inherently represent values from a data set, not any post-processing

(compression) algorithm you might apply to that stored representation.

Taking votes again, you might decide (inefficiently) to represent a

vote on a single-selection 7-candidate race as a string of 7 bits,

where no more than one of those bits is set to a one. That succeeds at

storing the vote, but it takes 7 bits to do it. In fact there are only

8 legal selections in this contest: each candidate, plus "No

Preference." The "entropy" in this 7-candidate race is 3 bits. A good

representation would be:

No Preference: 000

Candidate 1: 001

Candidate 2: 010

Candidate 3: 011

Candidate 4: 100

Candidate 5: 101

Candidate 6: 110

Candidate 7: 111

But using this optimal encoding isn't the same thing as applying a

later compression step. It comes earlier in the storage process.

That said, even using the above -optimal- encoding, compression of a

collection of ballots is still well possible. Probably Candidate 1

gets a lot move votes than Candidate 7 (at least as most elections

rules set it up). So the bit-string '001' occurs a lot more than

'111'--Huffman compression could kick it. But this is a feature of

ballot corpora, not something applicable in a non-discriminatory

fashion to an arbitrary ballot.

==================================================================

= The content of this message, with the exception of any external

= quotations under fair use, are released to the Public Domain

==================================================================

Received on Mon May 31 23:17:02 2004

*
This archive was generated by hypermail 2.1.8
: Mon May 31 2004 - 23:18:15 CDT
*