Re: Compression, encoding, entropy

From: David Mertz <voting-project_at_gnosis_dot_cx>
Date: Sun May 02 2004 - 10:35:54 CDT

On May 2, 2004, at 2:19 AM, charlie strauss wrote:
> As someone who actually got their PHd doing information theory I'm
> really bewildered by the barcod length thread

Me too. I would have thought that presenting a provably optimal answer
(along with a script to calculate it relative to a novel/historical
election) would be a sufficient answer to stop further speculation
(about either how to do better than optimal or on various ways to do
worse).

However, Charlie, I certainly welcome your expertise. Please let me
know if I make any errors in my calculations or characterizations. For
example, is my election-entropy.py script correct as far as you can
see? (it's answer look sensible, but it's always possible I made a
programming error).

> Second, lossless compression cannot reduce the maximum message size in
> fact it will only increace it, always.

Almost. As a general principle this is true, but it assumes that all
messages are possible as encodings.

As a counter-example, take a corpora that is specified to contain only
ASCII texts, with each character encoded as an 8-bit byte. We know the
high-bit is never 1 for this corpora. So we can compress by squeezing
the 8 bits down to 7 bits for each character. This compresses every
input, and expands none. We can do this because of an inherent
inefficiency in the initial encoding.

It is conceivable that the demo vote encoding contains an analogous
structural inefficiency that would allow compression in all cases. I'm
not sure whether it does, but it might given its known non-optimality.
However, rather than take our adequate first pass encoding technique as
something set in stone, and try to invent clever compression, it makes
a heck of a lot more sense just to choose an optimal encoding for the
next round.

That optimal encoding is exactly what I told the thread I would write
in the near future; and I will.

> third, dont confuse compression with optimal encoding. you can encode
> the vote in such a way--not using any compression--that its maximum
> size is a minimal. Namely, here is an optimal receipe--dont bother
> looking for a better one: enumerate every possible ballot and assign
> a ordinal number too it.

That's basically what I will do (I have the mentioned script sketched
out mentally, I just need to write it down). But as I indicated, I
will enumerate per-contest, thereby letting each contest fall on a
bit-boundary in its representation. I feel that has enough debugging
advantages to warrant the extra 1/2-bit per contest it will require, on
average, for the encoding. Plus it makes it WAY easier to program in
the first place (and for someone else to program a compatible
reimplementation--e.g. for a BVA).

Btw. Charlie may have missed the large size of ranked preference
contest possibility spaces. You cannot divide one such contest into
lookup blocks, it's part of a single enumeration. But using my posted
ranks.py script to calculate the enumeration size for ranked preference
contests, you find some big numbers. For example, the demos
'rank-8-of-8' County Commissioner race has 109601 distinct legal votes.
  That's a pretty big lookup table. Rank-8-or-10 gets to 2606501, which
is genuinely too big for a table. As an implementation question, I'm
going to use a slower-but-space-efficient style that will count through
the vote space repeatedly, but not store a table. It may take a while
to encode/decode, but it won't take much memory.

> fourth, dont confuse error-correction encoding with ballot design.

Yep. Though this is really a matter of not overselling it. The demo
used a simple, naive encoding design that just in passing provided a
minimal sort of error-detection because of its inefficiency. But
that's just the demo--nothing wrong with what it did, but not something
to design around going forward.

> Fifth, and I've mentioned this before, as part of the error correction
> scheme, consider writing error correction data from one ballot on
> other ballots.

Nope. I still violently disagree with this. For newcomers, this was
hashed out before on the list.

Anything that provides even partial information about the order in
which votes were casts is an unacceptable compromise of anonymity.
Even statistical weakening of anonymity guarantees is a no-go.
Charlie's cross-ballot ECCs looks good from a data-integrity
perspective, but it looks just awful from a cryptographic anonymity
perspective.

Yours, David...
(Ph.D. not in information theory, but I've been around that block).
==================================================================
= The content of this message, with the exception of any external
= quotations under fair use, are released to the Public Domain
==================================================================
Received on Mon May 31 23:17:02 2004

This archive was generated by hypermail 2.1.8 : Mon May 31 2004 - 23:18:15 CDT