manuals in pdf (resolution, compression) from Antonio Carlini on 2004-06-27 (2004-June)

From: Antonio Carlini <a.carlini_at_ntlworld.com>
Date: Sun Jun 27 16:47:56 2004

> It seems to be bloody horrible on modern versions of Linux, too :-(
> (well, at least on Redhat 9) Slow as hell, plus the rendering
> quality is
> pretty awful.

I have a 2.8GHz Intel box running Debian at
work and that works quite nicely (firefox
and xpdf IIRC). I also have a 450MHz Intel
box running Windows and Acrobat 6 on that
runs like something that died quite a
while ago :-)

> True that most systems have PDF viewers, but they're more
> likely to have
> an image displayer and a text editor ;-)

For anything current, I expect all three to be
available. For anything classic, you cannot
expect more than a text editor to be around.
But I don't expect to read schematics on a
WT78!

> What's the current licencing for PDF tools? I've pretty much
> avoided it
> since the days when the reader was free, but anything (at least from
> Adobe) which created or manipulated PDF files cost $$$
>
> I believe the data format itself was copyrighted - but
> presumably isn't
> these days what with all the 3rd-party viewers out there?

The Adobe stuff you pay for (although the reader is free
on a few platforms). The specifications are freely
available and I've not seen anything that says you
cannot implement your own reader. Perhaps they want
you to pay if you implement a writer but I expect
not since Eric Smith's tumble is free.

> Hmm, how editable are PDF files by the way? On the OCR front,
> I'd expect
> anyone OCRing anything to proofread it afterwards and correct mistakes
> (which is of course vital for technical data anyway - technical data
> with mistakes in is useless!). So unless wordprocessor-like
> tools exist
> to edit PDF files then I wouldn't think they're much good as an
> intermediate format, because people need to be able to go in there and
> easily correct mistakes made by the OCR software.

You can edit a PDF but not quite in the same way as
you would edit a word processed document. You can
replace characters within a line but there seems to
be no "wrap" to the next line and so on. It's OK (if
somewhat painful) for the sort of OCR-error-correction
that you have in mind. Obviously you cannot edit
images (without extracting them, mangling in an
image editor and re-inserting).

I've scanned bucket loads of docs (small buckets
compared to others, but enough to fill 10CDs or
so - probably 100 docs). That was not too bad with
an automated sheet-feeder and a decent scanner
that would drop a file onto a share. A bit of
post-processing (convert to G4, check for missing
pages) and you are done. Even that probably took
a day per doc on average. Throw in OCR and you
would magnify that by maybe a factor of three.
Throw in proof-redaing and you'd go mad. It's
trivially easy to miss errors so you would really
have to have each doc verified by multiple people.
And then you need to measure their error rate,
perhaps by injecting deliberate misatkes (did
you see it, or have you had to go back to look)
and verifying that they spot them.

On top of that, if you go to all the trouble
of scanning stuff and verifying it to a certain
extent (and putting together a checksum in case
something goes wrong), surely you need to have
some kind of backup for when the disk dies? (I've
spent the last 24 hours recovering a dead 40GB
drive and the previous 24 hours - while waiting
for the replacement to arrive - wondering just
how dead it was and how much I'd not yet backed
up ...). I keep everything I've put any real
effort into on two CDs and I try to get it
hosted on at least one site somewhere.

Antonio

-- 
---------------
Antonio Carlini arcarlini_at_iee.org

Received on Sun Jun 27 2004 - 16:47:56 BST

This archive was generated by hypermail 2.3.0 : Fri Oct 10 2014 - 23:37:01 BST