Re: Compression, encoding, entropy

From: charlie strauss <cems_at_earthlink_dot_net>
Date: Mon May 03 2004 - 08:53:16 CDT

Arthur, I have a couple questions
I gather the intent of self-delimiting codes is to divide a message into phrases each of which is self-demimiting. And that the virtue of self delimiting codes used on a series of phrases is that if you damage an arbitrarily large number of bits within a phrase then while that phrase may be unreadable the next phrase is isolated from being corrupted or getting out-of-frame. Is that right?

Dont self-delimiting codes imply a model where phrases are written consecutively in the bar codes, as to oppose being interleaved? if so would the the decision to use a self delmiting code prevent the ability to re-arragne the bits of the message into an order that might be more robust to certain types of bar-code misreads (say a string of bits in error) for certain types of error-correcting codes.

finally are you proposing a new desideratta that if there is a read/write error that is so large that it cannot be compensated by an error correcting code that at least it's damage should be isolated to a single contest or set of contests and that the remainder of the ballot be readable? As opposed to simply invaidating the whole ballot?

It seems like there might be some virtue in at least separating the main message and a checksum into different self-demilited phrases to increase the chances that a read-write error is so sever that the ECC cant detect the corruption (making it appear to a valid message).

-----Original Message-----
From: Arthur Keller <arthur@kellers.org>
Sent: May 3, 2004 2:57 AM
To: voting-project@lists.sonic.net
Cc: voting-project@lists.sonic.net
Subject: Re: [voting-project] Compression, encoding, entropy

At 12:50 AM -0400 5/3/04, David Mertz wrote:
>>>>>With 3 bits (it's octal, remember?), you can code up to 3
>>>>>choices, and with 6 bits you can code
>>>>>up to 19 choices. And 9 bits can code up to 147 choices.
>>>I haven't read the paper (I glanced at it when it was referenced
>>>some months ago). However, just based on the abstract, I can see
>>>that it's not an -optimal- encoding.
>
>On May 3, 2004, at 12:22 AM, Arthur Keller wrote:
>>There is an optimality proof for self-delimiting encodings
>
>Right... optimal *for self-delimiting encodings*
>
>Which is to say: NOT optimal among all encodings. Or specifically,
>an optimal (non-delimited) 3-bit code encodes 8 options, not just 3
>(and similarly for other bit lengths).

True enough.

>>>My hunch is that this isn't really important, since we can decide
>>>based on the type and number of options the bit length--and we
>>>HAVE the ballot-election.xml configuration in advance.
>>It allows for a uniform algorithm that is independent of the number
>>of possibilities. The advantage of such a scheme is that the
>>decoding is simpler when the encoding relies less on context.
>
>I tend to disagree here. But not so much that I couldn't perhaps be
>convinced otherwise.
>
>To my mind, requiring the encoding to EXPLICITLY rely on prior
>knowledge of the contest context acts as an additional cross-check
>on consistency between ballot template and cast ballot. ECCs do a
>better job with the integrity constraint you want than do
>self-delimiting contests.

It's a both - and, not an either - or.

>FWIW though, could you (Arthur) produce a little utility similar to
>election-entropy.py that would let members determine the exact
>bit-length requirement to encode ballots using optimal
>self-delimiting encodings? That would let us do a comparison of the
>info requirements of historical elections, comparing delimited and
>non-delimited optimal encoding styles.

I'm not a python programmer, so I can't. But the formula is:

Let n be the highest number being encoded
Let d be the number of bits in a digit
Let b be the base (2**n)
Let m be one-half of the base
The number of digits, g = floor (log_m(n) +1)
The number of bits is d*g
The efficiency in encoding compared with straight binary is:
(d*g)/floor(lg(n)+1)

>Another possible thing is to explicitly list where the contest
>breaks occur within the ballot, i.e. "contest 1: 3 bits; contest 2:
>7 bits; etc." I'm not sure exactly how much space that description
>would take (if encoded compactly). But the info would not normally
>need to be part of the BRP or BVA scan, it would just act as a
>validation during debugging or a challenge (i.e. it could go
>elsewhere on the ballot: another barcode or printed numbers or an
>annotation to the barcode).

That can be done with a length prefix. The problem is how long is a
length prefix is needed?

>>Relying on context is problematic and a potential source of errors.
>>Redundancy is helpful to reduce errors.
>
>I think the contest delimiter is a much less useful type of
>redundancy than are ECCs or election context. Or Karl's globally
>unique election identifier prefix.

I'm saying that it is easier to parse without context, and therefore
less bug prone than using context. Certainly ECCs (and digital
signatures) are useful as well, and context is a further check.
Remember, redundancy is your friend.

>>was it Steve Chessin who suggested placing phonemes (for
>>pronouncing the name) as part of each vote in the bar code, so the
>>BVA wouldn't need its own collection of sound files? I haven't
>>though enough about the implications of this idea.
>
>That's a neat idea.
>
>Certainly if we do that, we are nowhere near the 100, or even 150,
>bits we can fit in Code128. But something like that should fit in
>common 2-D codes.

That's correct. Personally, I like the idea of a fixed station under
which you place the ballot rather than a wand scanner type device.
If you place the ballot against a guide, you get easy registration.
And if you place a scanner above and below the ballot, you can read
the ballot in any orientation, although at a higher cost.

Best regards,
Arthur

-- 
-------------------------------------------------------------------------------
Arthur M. Keller, Ph.D., 3881 Corina Way, Palo Alto, CA  94303-4507
tel +1(650)424-0202, fax +1(650)424-0424
==================================================================
= The content of this message, with the exception of any external 
= quotations under fair use, are released to the Public Domain    
==================================================================
Received on Mon May 31 23:17:04 2004

This archive was generated by hypermail 2.1.8 : Mon May 31 2004 - 23:18:15 CDT