From: David Mertz <voting-project_at_gnosis_dot_cx>

Date: Sat May 01 2004 - 22:15:57 CDT

Date: Sat May 01 2004 - 22:15:57 CDT

On May 1, 2004, at 5:23 PM, Alan Dechert wrote:

*> After discussing this some with David Mertz, it occurred to me I may
*

*> just
*

*> not be speaking your language.
*

Wikipedia.org has an excellent entry on data compression (as well as

many great entries on many topics). I also would recommend an

introduction to compression that I wrote as an article, then later

included as an appendix to my book. You can read that at:

http://gnosis.cx/TPiP/appendix_b.txt

A reason I recommend my article is that it's a sort of mystery story.

I first explain a number of common compression techniques as applied to

a sample data set. Then the surprise ending is that you could have

done better than any of the described compression techniques by

encoding the data differently to start with.

In brief, "compression" is an algorithm that when applied to a data

source, -typically- reduces the size of the data representation. How

well the compression does depends in part on the encoding of the data

that was used to start with. Compression algorithms are designed

around corpora of data. The goal of a compression algorithm is to

reduce the size of the most typical data sets encountered; but a side

effect of their manner of design is that they often fail to reduce the

size of atypical data sets (or even increase their size).

For example, Huffman compression uses variable numbers of bits to

encode different values. In a collection of documents, this works

great. For example, English texts contain a lot of "e"s and not very

many "q"s. For most texts, using fewer bits to encode "e"s than "q"s

makes for an efficient system. But if you apply a general English

Huffman table to the string "qqqqqqqqqqq", you're going to get a big

data representation.

Taking this to ballots, an algorithm might encode the "likely"

candidates with fewer bits than the "unlikely" candidates. Al Gore and

George Bush votes might be encoded with fewer bits than those for Harry

Browne. For most ballots the would make for a compact representation.

But the Libertarian voter needs more bits to store her ballot data.

The "typical" ballot might be represented more compactly, but the

"outlier" might have an expanded representation. Obviously, a

discriminatory data representation is a no-no.

"Encoding" is a related, but distinct concept. This describes how you

inherently represent values from a data set, not any post-processing

(compression) algorithm you might apply to that stored representation.

Taking votes again, you might decide (inefficiently) to represent a

vote on a single-selection 7-candidate race as a string of 7 bits,

where no more than one of those bits is set to a one. That succeeds at

storing the vote, but it takes 7 bits to do it. In fact there are only

8 legal selections in this contest: each candidate, plus "No

Preference." The "entropy" in this 7-candidate race is 3 bits. A good

representation would be:

No Preference: 000

Candidate 1: 001

Candidate 2: 010

Candidate 3: 011

Candidate 4: 100

Candidate 5: 101

Candidate 6: 110

Candidate 7: 111

But using this optimal encoding isn't the same thing as applying a

later compression step. It comes earlier in the storage process.

That said, even using the above -optimal- encoding, compression of a

collection of ballots is still well possible. Probably Candidate 1

gets a lot move votes than Candidate 7 (at least as most elections

rules set it up). So the bit-string '001' occurs a lot more than

'111'--Huffman compression could kick it. But this is a feature of

ballot corpora, not something applicable in a non-discriminatory

fashion to an arbitrary ballot.

==================================================================

= The content of this message, with the exception of any external

= quotations under fair use, are released to the Public Domain

==================================================================

Received on Mon May 31 23:17:02 2004

*
This archive was generated by hypermail 2.1.8
: Mon May 31 2004 - 23:18:15 CDT
*