Re: Compression, encoding, entropy

From: David Mertz <voting-project_at_gnosis_dot_cx>
Date: Tue May 04 2004 - 02:20:18 CDT

On May 4, 2004, at 2:56 AM, Arthur Keller wrote:
> You can code in hexadecimal (4 bits, base 16), octal (3 bits, base 8),
> or quad (2 bits, base 4). d is the number of bits in your digital
> encoding.

I managed to figure out in conversation with Charlie Strauss what
Arthur was talking about. The missing part of Arthur's description is
that he is talking about the symbology of a particular barcode. I.e.
one barcode has a 16-letter alphabet, another has a 4-letter alphabet,
etc.

So Arthur's "base" is the size of the barcode symbol set.
Unfortunately, this doesn't allow us to specify in a general way what
the bit-length required for an optimal self-delimited encoding is.
It's always relative to the particular alphabet we've already chosen.
That is, it's of less use in deciding the parameters for what is an
acceptable barcode to use.

I kinda wish Arthur would have indicated that this was what he was
talking about... but I guess since it was a hidden assumption in his
own thinking, it seemed obvious to him (even though it wasn't stated
on-thread).

However, we can still produce an example, assuming a priori a
particular symbology for our yet-undecided barcode. For example, if we
assume our barcode uses a 16-letter alphabet, applying this to the demo
election we get:

$ ./election-entropy.py < demo-election.data
Election summary for OVC demo ballot (write-ins count as candidate)

269995136716800 distinct votes are possible
Optimal encoding is approximately 48 bits
Contests at bit-boundaries, approx 53 bits
Contests self-delimited, approx 88 bits

Vote Space Optimal Self-Delim
---------- ------- ----------
       9 4 8
       9 4 8
       4 2 4
       4 2 4
       5 3 4
       4 2 4
       5 3 4
       4 2 4
       3 2 4
       3 2 4
       3 2 4
     176 8 12
  109601 17 24

To me the self-delimited case looks enough less efficient that optimal
encoding to rule it out if we are still contemplating Code128.
Especially since Code128 in decimal code includes some error-detection
of its own. And even more so if we use Karl's globally unique
identifier for ballots: that ID will include reference to the
particular ballot template that the ballot refers to, i.e. the bit
positions of each contest. So self-delimiting doesn't get us anything
extra.

Yours, David...
==================================================================
= The content of this message, with the exception of any external
= quotations under fair use, are released to the Public Domain
==================================================================
Received on Mon May 31 23:17:09 2004

This archive was generated by hypermail 2.1.8 : Mon May 31 2004 - 23:18:15 CDT