Re: OCR/barcode reliability

From: charlie strauss <cems_at_earthlink_dot_net>
Date: Thu Jun 03 2004 - 13:59:31 CDT

I'd like to suggest a requirement that any method of reading, barcode or OCR, has to be capable of being done fast enough to keep up with a sheet feeder. That is perhaps it would be a good idea to be able to recount 100,000 ballots per day on a reasonable size peice of machinery. Or make up you own reasonable number. The point is if you have to do a recount it after the precincts closed it would be nice to be able to do this with a small staff using automated scanners, and that is going to require high throughput.

Barcodes can do this. bank checks which are pretty simple can do this. generalized OCR like you use to scan arbitrary documents will be hard pressed. So my supposition is its going to have to be very tailored OCR if you want to meet the above criteria.

As long as we are brainstorming about how to resolve ambiguous OCR reads (e.g. Levenshtein distance ) I'll make two comments. one is that I would assume this is a well plowed row in some body of literature, but I'm not personally familiar with it (aside from sequence alingment in genomics). two is that recognizing and then aligning characters might not be as good or as fast as simply matching the image directly. for example, once the page orientation is known, simply use a standard fourier transform to convolve it against the possible correctly typed items. and voila. One could do better than a fourier convolution given a proper error model (e.g. use wavelets if image strteching is allowed). Given were matching to a very very very small set of possible answers it makes sense to match the responses wholistically as an image not character by character. The FT approach would be potentially very fast and able to keep up with a sheet feeder, if this were to be done on a lot of ballots.

-----Original Message-----
From: David Mertz <voting-project_at_gnosis_dot_cx>
Sent: Jun 3, 2004 11:20 AM
To: voting-project@lists.sonic.net
Subject: Re: [voting-project] OCR/barcode reliability

Arthur Keller <arthur@kellers.org> wrote:
|Modularity precludes using the contests for decoding the OCR. (For
|verifying a correct read, yes, but not for determining *what* was
|read.)

Did you read my note that gave a Levenshtein distance example?

I'm not sure what principle you think you have in mind under the name
modularity. Certainly the OCR software should be generic, if only
because such packages are widely tested outside OVC (and it takes
thousands of programmer hours to develop such software; we might as well
benefit from the Free Software community).

But that's only a first pass. After you generically read a ballot, you
can apply another pass to make sense of it. That is, if a given name
does not exactly match any available candidate (and is not marked as a
write-in), there is no reason not to figure out what the intention was.
Certainly not anything having to do with modularity.

Obviously, we need to decide parameters. If what was read has a
Levenshtein distance of 2 from one valid candidate, and a distance of
over 50 from every other candidate in that contest, I feel entirely
comfortable declaring the intention as the near match. However, if two
of the candidates are:

    Maria Cruz
    Mario Crump

I wouldn't want to make any guesses about the apparent value of:

    Ma*ic Cruo

It's not just the absolute Levenshtein distance we should look at, but
the distribution of them to all the valid names. Enough skew is enough
confidence. In any case, it's easy to flag EVERY non-perfect match as
requiring manual confirmation (call the fuzzy match "provisional
results").

In edge cases, Karl's chocolate-covered voters might necessitate manual
examination of ballots
==================================================================
= The content of this message, with the exception of any external
= quotations under fair use, are released to the Public Domain
==================================================================
Received on Wed Jun 30 23:17:05 2004

This archive was generated by hypermail 2.1.8 : Wed Jun 30 2004 - 23:17:29 CDT