extract text from PDF, was Re: Shamos Rebuttal, Draft 1

From: Cameron L. Spitzer <cls_at_truffula_dot_sj_dot_ca_dot_us>
Date: Mon May 02 2005 - 14:09:04 CDT

>From: David Mertz <voting-project@gnosis.cx>
>Subject: Re: [OVC-discuss] Shamos Rebuttal, Draft 1
>Date: Sun, 1 May 2005 22:31:17 -0400
>To: Open Voting Consortium discussion list <ovc-discuss@listman.sonic.net>

>> I'll post a DOC for the next version. I generated the PDF by "save as
>> PDF" in Mac Word. Odd that it doesn't allow editing.

>PDF is not an editable format, after all (well... it *is*, but not
>easily). But it should allow cut-and-paste. Maybe Word adds the nasty
>permission restrictions that PDF enables. FWIW, some readers other
>than Adobe Acrobat are sensible enough to ignore those
>permissions--e.g. GhostView (not sure about OSX's Preview in this
>regard).

Please take a look at Xpdf. http://www.foolabs.com/xpdf/
http://packages.debian.org/stable/text/xpdf
http://cygwin.com/cgi-bin2/package-cat.cgi?file=xpdf%2Fxpdf-3.00-1&grep=xpdf
There's a GUI doc viewer that's better than Acroread in
some ways and worse in others. You can select-copy a rectangle of
text and paste it elsewhere, a critical feature missing from Acroread.
It also comes with some command line utilties for extracting text
from a PDF. As shipped, it respects the PDF copyright bit,
but it's easy to find that feature and disable it in the source
if you do things like that.

Please consider sharing documents in .sxw (Openoffice Writer) rather
than .DOC (Microsoft Word). Similar functionality but the tools are
free (as in speech) and the format is open, and comparable document
files are usually much smaller. (A .sxw is actually
a collection of XML and image files in .ZIP archive format.)

Sorry if this is old news, I'm new here.

Cameron

_______________________________________________
OVC discuss mailing lists
Send requests to subscribe or unsubscribe to arthur@openvotingconsortium.org
==================================================================
= The content of this message, with the exception of any external
= quotations under fair use, are released to the Public Domain
==================================================================
Received on Tue May 31 23:17:11 2005

This archive was generated by hypermail 2.1.8 : Tue May 31 2005 - 23:17:52 CDT