extract text from PDF, was Re: Shamos Rebuttal, Draft 1

From: Cameron L. Spitzer <cls_at_truffula_dot_sj_dot_ca_dot_us>
Date: Mon May 02 2005 - 14:09:04 CDT

>From: David Mertz <voting-project@gnosis.cx>
>Subject: Re: [OVC-discuss] Shamos Rebuttal, Draft 1
>Date: Sun, 1 May 2005 22:31:17 -0400
>To: Open Voting Consortium discussion list <ovc-discuss@listman.sonic.net>

>> I'll post a DOC for the next version. I generated the PDF by "save as
>> PDF" in Mac Word. Odd that it doesn't allow editing.

>PDF is not an editable format, after all (well... it *is*, but not
>easily). But it should allow cut-and-paste. Maybe Word adds the nasty
>permission restrictions that PDF enables. FWIW, some readers other
>than Adobe Acrobat are sensible enough to ignore those
>permissions--e.g. GhostView (not sure about OSX's Preview in this

Please take a look at Xpdf. http://www.foolabs.com/xpdf/
There's a GUI doc viewer that's better than Acroread in
some ways and worse in others. You can select-copy a rectangle of
text and paste it elsewhere, a critical feature missing from Acroread.
It also comes with some command line utilties for extracting text
from a PDF. As shipped, it respects the PDF copyright bit,
but it's easy to find that feature and disable it in the source
if you do things like that.

Please consider sharing documents in .sxw (Openoffice Writer) rather
than .DOC (Microsoft Word). Similar functionality but the tools are
free (as in speech) and the format is open, and comparable document
files are usually much smaller. (A .sxw is actually
a collection of XML and image files in .ZIP archive format.)

Sorry if this is old news, I'm new here.


