Re: OCR/barcode reliability

From: David Mertz <voting-project_at_gnosis_dot_cx>
Date: Thu Jun 03 2004 - 13:20:17 CDT

Arthur Keller <arthur@kellers.org> wrote:
|Modularity precludes using the contests for decoding the OCR. (For
|verifying a correct read, yes, but not for determining *what* was
|read.)

Did you read my note that gave a Levenshtein distance example?

I'm not sure what principle you think you have in mind under the name
modularity. Certainly the OCR software should be generic, if only
because such packages are widely tested outside OVC (and it takes
thousands of programmer hours to develop such software; we might as well
benefit from the Free Software community).

But that's only a first pass. After you generically read a ballot, you
can apply another pass to make sense of it. That is, if a given name
does not exactly match any available candidate (and is not marked as a
write-in), there is no reason not to figure out what the intention was.
Certainly not anything having to do with modularity.

Obviously, we need to decide parameters. If what was read has a
Levenshtein distance of 2 from one valid candidate, and a distance of
over 50 from every other candidate in that contest, I feel entirely
comfortable declaring the intention as the near match. However, if two
of the candidates are:

    Maria Cruz
    Mario Crump

I wouldn't want to make any guesses about the apparent value of:

    Ma*ic Cruo

It's not just the absolute Levenshtein distance we should look at, but
the distribution of them to all the valid names. Enough skew is enough
confidence. In any case, it's easy to flag EVERY non-perfect match as
requiring manual confirmation (call the fuzzy match "provisional
results").

In edge cases, Karl's chocolate-covered voters might necessitate manual
examination of ballots
==================================================================
= The content of this message, with the exception of any external
= quotations under fair use, are released to the Public Domain
==================================================================
Received on Wed Jun 30 23:17:05 2004

This archive was generated by hypermail 2.1.8 : Wed Jun 30 2004 - 23:17:29 CDT